Many of you might be familiar with the idea of JSON Parsing and using APIs to get data from a specific webpage or a web service. But sometimes, the API is not available or it is not free. Moreover, JSON Parsing takes a lot of development/coding time as well for beginners.
Therefore, an easy, but less powerful substitute of Parsing through a document is Screen Scraping.
Therefore, an easy, but less powerful substitute of Parsing through a document is Screen Scraping.
Screen Scraping:
In this process, we basically just reach our desired tag in HTML Document and read its contents.
XPath is used in order to reach a certain node.
XPath is used in order to reach a certain node.
XPath Syntax:
I'll be just giving a basic idea of Xpath in order to understand how the HTML Agility Pack will work.
A more detailed documentation is available at http://www.w3schools.com/xpath/
A more detailed documentation is available at http://www.w3schools.com/xpath/
Following are the various basic path expressions to reach a certain node in XML
nodename ( Select all the nodes having the name 'nodename' )
/nodename ( Select all nodename only from the root node )
//nodename ( Select all nodename from anywhere in the document )
@attribute ( Select attribute from a node)
/nodename//childnode ( Select all nodename only from root node, then select all childnode from anywhere inside the nodename )
Similarly, you can make a combination using different operators to make a path expression according to your needs.
How to get HTML Agility Pack?
After creating your Windows Phone/ Windows Store/ ASP.NET Web Application/ Windows Forms
Project successfully in Visual Studio, click on Manage NuGet Packages in the Project drop down menu in the Menu Bar.
Now a new window opens up. Search for HTML Agility Pack in it and install it.
The Pack will be successfully installed for your application.
Project successfully in Visual Studio, click on Manage NuGet Packages in the Project drop down menu in the Menu Bar.
Now a new window opens up. Search for HTML Agility Pack in it and install it.
The Pack will be successfully installed for your application.
Using HTML Agility Pack:
For the ease of testing, I created a Web Application, but the Package can be used on Windows Phone as well.
After adding a Web Form having .aspx as extension in your project, and add a button Extract Data, that will prompt the application to extract data and display it on the application when it is clicked.
To achieve that purpose, we will simply, write the code in the OnClick method of our button in the aspx.cs file of our webform.
Note: Include the header file in Webpage.aspx.cs through writing " using HtmlAgilityPack; " alongwith all the other header files included at the start of the Webpage.aspx.cs file.
Now you are ready to go inside the world of screen scraping.
To achieve that purpose, we will simply, write the code in the OnClick method of our button in the aspx.cs file of our webform.
Note: Include the header file in Webpage.aspx.cs through writing " using HtmlAgilityPack; " alongwith all the other header files included at the start of the Webpage.aspx.cs file.
Now you are ready to go inside the world of screen scraping.
Selecting a single node:
For our example, I am taking this website.
URL is http://www.cs.colorado.edu/~main/bgi/doc/getmouseclick.html. Suppose, I want to get the heading from this site.
I will first right click anywhere on the page and Inspect Element. In this side bar, I can detect in which tag is the text getmouseclick located.
In this example, it is located inside the h2 tag.
I will first right click anywhere on the page and Inspect Element. In this side bar, I can detect in which tag is the text getmouseclick located.
In this example, it is located inside the h2 tag.
I have to look only for the h2 tag anywhere in the document, therefore, my Xpath regular expression will be //h2.
Now coming towards the code.
Now coming towards the code.
Front End: (.aspx file)
When the ExtractData button is clicked, it will always call the ExtractData_Click function written in the WebPage1.aspx.cs file.
Back End: (.aspx.cs file)
Output:
Point to note:
This technique is preferred to be used in Windows Phone, as it gives the facility of acquiring the data from the required website asynchronously. That is to say that while we are loading the required website, the phone application is able to do other tasks and does not hang while loading.
I'll be discussing the variations of HTML Agility Pack, such as Xpath with attributes and Selecting Multiple Nodes later.





0 comments:
Post a Comment