Web Crawler Components
Continued...
In this part we will explain the remaining two components
needed to build web spider, plus the relation between these components in our web spider application.
What We Really Need?
"Microsoft Web Browser" COM component
The web browser component provides your
application with Microsoft Internet Explorer capabilities such as document
navigation, document viewing, and data downloading. Adding this component to
your application is such as adding a complete web browser to your program. By
which you can navigate to any URL from your ordinary desktop application. By
adding the web browser control to your application form you will give your
application's user the capability to browse the World Wide Web as well as
folders in the local file system and on a network. As Microsoft Internet
Explorer, our web browser control
maintains a history list to allow its user to browse forward and backward. You
can navigate to any URL by either click its link or type its URL in the control
address bar.
Programmatically you can customize the
appearance and functionality of your web browser control by controlling its
properties from your application code. You can navigate to any URL by calling
the web browser control navigate function, and the most important you can hold
or get the current displayed document in the web browser control browsing area
in the form of mshtml architecture.
To make use of this wonderful web browser component you must add it to your Ms Visual Studio 2005 toolbox.
It not
exists by default. So, click on the "ToolBox", with the right
mouse button. A pop up menu will appear as shown in figure1 below.
 Figure 1 -
Tool Box Pop-up menu
Click "Choose Items....". The choose items
dialog box will appear as shown in figure2 below.

Figure 2 - Choose Toolbox Items dialogue box
Select the "Com Components" tab. Browse the
list of components till you find "Microsoft Web Browser" check the check box
beside it, then click "OK". Give a look to your tool box now. The Web Browser
component was added to the end of the "Components" tab of your tool box as shown
in figure3 below.

Figure 3 - The web browser control was added to the toolbox items
Now you can select the web browser component
from your tool box and deal with it as normal as you deal with other
components.
In our program we will add a web browser
control to our form although we will not actually make use of its displaying
features. The purpose of adding it is to give our program the ability to get the
current displayed URL document in its mshtml formats as we will see later. If
you add a Web Browser component to your form and build your application, the
following two files ("AxInterop.SHDocVw.dll", "Interop.SHDocVw.dll") will be
added to your application folder under the "bin" folder at the same location
where the "EXE" file exists.
"mshtml" DLL
MSHTML is firstly introduced at 1997. MSHTML is the main
HTML component of Microsoft Internet Explorer 4.0 and later versions. It can
also be used from applications other than IE. MSHTML provides many
functions like; MSHTML editor that offer What You See Is What You Get (WYSIWYG)
editing environment with a rich set of editing capabilities, and its HTML
rendering and parsing capabilities.
Applications can host MSHTML and exploit its editing
features to enable editing of HTML contents in a manner similar to text editing
in word processors. By applying its WYSIWYG editing capabilities, MSHTML gives a
document author the ability to improve his control on formats and appearance
of his document. It helps him to work effectively without knowing HTML. Document author can
simply click buttons to alter font sizes, paragraph formats, colors, and so on.
As you see it is perfect for creating a simple yet powerful WYSIWYG web
authoring tool.
Plus the above features -and this is what we really care about in
our program - MSHTML provides wonderful rendering and parsing capabilities. You
can use MSHTML without user interface activation to make use of its ability to
parse HTML documents. By loading the required HTML document you can use the
object model to access the underlying HTML like its tags, properties of each tag
and elements, its forms, and so on. By adding this wonderful DLL to your program
you can do many parsing processes and automations on the loaded HTML document.
For example you can get all the hyperlinks in the document, you can get all the
paragraphs, you can get the parameters of a certain element
using its ID, and so on. You can also automate button clicking, text box editing, combo box
selection, and so on. It is obvious now that we will use this wonderful
component in our program to get all hyperlinks in the visited pages.
To add mshtml component to your project in the
"Solution Explorer" just position your mouse pointer over the project item then
click the right mouse button to open the pop up menu. From the appeared menu
choose "Add Reference...". The add reference dialog box appears. From the .Net
tab scroll till you find "Microsoft.MSHTML" select it then click ok (see figure4
below). The MSHTML is added to your project references now, and it is ready for
usage.

Figure 4 -
Add Reference dialogue box
The Relation between MSHTML and Web Browser Control
As we mentioned in the last two sections above, we will
use both the web browser control and the MSHTML DLL into our program. These two
component actually related to each others as following. We will use the web
browser control to load our URL, waiting till the loading process completes,
then create an MSHTML document and get the browser document in the formats of
the MSHTML document. After that we are free to deal with this document using all
MSHTML parsing capabilities to get all the hyperlinks in it, and do the cycle
again and again till we finish all the URLs or found no new URLs to parse.
Now and after explaining all the namespaces and
components we will need to create our web spider program, let us have fun and
start building it in the next two parts of this tutorial.
To download the complete program, just click
here.
For further information
Refer to the online copy of Microsoft Developers Network at
http://msdn.microsoft.com or use your own local copy of MSDN.
|