Write For Us

Web Crawler Components


In this part we will explain the remaining two components needed to build web spider, plus the relation between these components in our web spider application.


What We Really Need?

"Microsoft Web Browser" COM component

The web browser component provides your application with Microsoft Internet Explorer capabilities such as document navigation, document viewing, and data downloading. Adding this component to your application is such as adding a complete web browser to your program. By which you can navigate to any URL from your ordinary desktop application. By adding the web browser control to your application form you will give your application's user the capability to browse the World Wide Web as well as folders in the local file system and on a network. As Microsoft Internet Explorer, our web browser control maintains a history list to allow its user to browse forward and backward. You can navigate to any URL by either click its link or type its URL in the control address bar.

Programmatically you can customize the appearance and functionality of your web browser control by controlling its properties from your application code. You can navigate to any URL by calling the web browser control navigate function, and the most important you can hold or get the current displayed document in the web browser control browsing area in the form of mshtml architecture.

To make use of this wonderful web browser component you must add it to your Ms Visual Studio 2005 toolbox. It not exists by default. So, click on the "ToolBox", with the right mouse button. A pop up menu will appear as shown in figure1 below.

Figure 1 - Tool Box Pop-up menu

Click "Choose Items....". The choose items dialog box will appear as shown in figure2 below.

Figure 2 - Choose Toolbox Items dialogue box

Select the "Com Components" tab. Browse the list of components till you find "Microsoft Web Browser" check the check box beside it, then click "OK". Give a look to your tool box now. The Web Browser component was added to the end of the "Components" tab of your tool box as shown in figure3 below.

Figure 3 - The web browser control was added to the toolbox items

Now you can select the web browser component from your tool box and deal with it as normal as you deal with other components.

In our program we will add a web browser control to our form although we will not actually make use of its displaying features. The purpose of adding it is to give our program the ability to get the current displayed URL document in its mshtml formats as we will see later. If you add a Web Browser component to your form and build your application, the following two files ("AxInterop.SHDocVw.dll", "Interop.SHDocVw.dll") will be added to your application folder under the "bin" folder at the same location where the "EXE" file exists.

"mshtml" DLL

MSHTML is firstly introduced at 1997. MSHTML is the main HTML component of Microsoft Internet Explorer 4.0 and later versions. It can also be used from applications other than IE. MSHTML provides many functions like; MSHTML editor that offer What You See Is What You Get (WYSIWYG) editing environment with a rich set of editing capabilities, and its HTML rendering and parsing capabilities.

Applications can host MSHTML and exploit its editing features to enable editing of HTML contents in a manner similar to text editing in word processors. By applying its WYSIWYG editing capabilities, MSHTML gives a document author the ability to improve his control on formats and appearance of his document. It helps him to work effectively without knowing HTML. Document author can simply click buttons to alter font sizes, paragraph formats, colors, and so on. As you see it is perfect for creating a simple yet powerful WYSIWYG web authoring tool.

Plus the above features -and this is what we really care about in our program - MSHTML provides wonderful rendering and parsing capabilities. You can use MSHTML without user interface activation to make use of its ability to parse HTML documents. By loading the required HTML document you can use the object model to access the underlying HTML like its tags, properties of each tag and elements, its forms, and so on. By adding this wonderful DLL to your program you can do many parsing processes and automations on the loaded HTML document. For example you can get all the hyperlinks in the document, you can get all the paragraphs, you can get the parameters of a certain element using its ID, and so on. You can also automate button clicking, text box editing, combo box selection, and so on. It is obvious now that we will use this wonderful component in our program to get all hyperlinks in the visited pages.

To add mshtml component to your project in the "Solution Explorer" just position your mouse pointer over the project item then click the right mouse button to open the pop up menu. From the appeared menu choose "Add Reference...". The add reference dialog box appears. From the .Net tab scroll till you find "Microsoft.MSHTML" select it then click ok (see figure4 below). The MSHTML is added to your project references now, and it is ready for usage.

Figure 4 -  Add Reference dialogue box

The Relation between MSHTML and Web Browser Control

As we mentioned in the last two sections above, we will use both the web browser control and the MSHTML DLL into our program. These two component actually related to each others as following. We will use the web browser control to load our URL, waiting till the loading process completes, then create an MSHTML document and get the browser document in the formats of the MSHTML document. After that we are free to deal with this document using all MSHTML parsing capabilities to get all the hyperlinks in it, and do the cycle again and again till we finish all the URLs or found no new URLs to parse.

Now and after explaining all the namespaces and components we will need to create our web spider program, let us have fun and start building it in the next two parts of this tutorial.

To download the complete program, just click here.

For further information

Refer to the online copy of Microsoft Developers Network at http://msdn.microsoft.com or use your own local copy of MSDN.

Tutorial toolbar:  Tell A Friend  |  Add to favorites  |  Feedback  |   

comments powered by Disqus