Web Spider Code
This is the last part of our tutorial in which we will explain key code parts of web spider program.
"C_3OXO.vb" Class
This class is the main class in web spider application. It
contains all the data and code needed to execute the web crawling process. At
top of the class file add the following imports statements.
Imports mshtml
Imports System.Net
Imports System.io
Imports System.Data.OleDb
Imports System.Data
|
Then add the following declarations.
Public Class C_3OXO
Dim DBFilePN As String
Dim DBC As c_3OXO_DB
Dim UrlsT As DataTable
Public MLevel As Integer
Public TimOut As Integer
Public F1 As Form1
DBFilePN : is a private string represents the new
created database file path name.
DBC : is a variable of type C_3OXO_DB which represents the database
class.
UrlsT : a private variable of type data table and it represents in memory
representation of a table consists of one column to manage the URLs list
(crawler frontier).
MLevel : a public integer variable represents the maximum deeper tree
level the spider have to reach.
TimeOut : a public integer represents the allowed web browser control
maximum time out in seconds before it stops trying to get the current URL.
F1 : is a public variable of type Form1. It will be used to refer to
form1 object.
Go Spider
Public Sub GoSpider(ByVal URL As String, ByVal DBF As String)
DBC = New c_3OXO_DB
DBFilePN = DBF
CreateDBOutputFile(DBFilePN)
DBC.Initial(DBFilePN)
GetWebSite(URL)
End Sub
This is the main public subroutine in this class. This is
the sub we need to call to start the whole process. This sub receives two
parameters: the web site URL like for example "http://www.google.com",
and the database file name where the user want to store the web site data. At
its first line it creates an instant of the database class and assigns it to the
"DBC" variable. It assigns the input parameter "DBF" to the class parameter "DBFilePN"
to be available to all class methods. Then it creates the output database file
by calling the "CreateDBOutputFile" and passing to it the "DBFilePN" class
variable. It initializes the database class instant by calling its "Initial"
method. Finally it call the "GetWebSite" method which make the actual crawling
work as we will see later.
Create Output Database File
Private Function CreateDBOutputFile(ByVal odbfpn As String) As Boolean
Dim s As String()
Dim TS As String
Try
s = Environment.GetCommandLineArgs()
TS = s(0).Substring(0, s(0).LastIndexOf("\") + 1) & DBC.DBTemplateFPN
IO.File.Copy(TS, odbfpn)
Catch ex As Exception
Return False
End Try
Return True
End Function
This private function main purpose is to locate the
location of the application EXE file, and then copy the template database file
saved at this location to the new specified database file path name. It does its
function by making use of the environment "GetCommandLineArgs" method which
returns the application EXE folder path as its output parameter.
Get Web Site
Private Sub GetWebSite(ByVal URL As String)
F1.SendMessage("Initialization ...")
InitializeSiteTable(URL)
F1.SendMessage("Gets all URLs ...")
GetWebSiteAllURLs()
F1.SendMessage("Saves URLs to database ...")
SaveWebSite()
End Sub
You can think of this method as the maestro for all the
crawling operation. It is the one who orders the user interface to adjust itself
depending on the current state. It initializes the in memory site URLs
representation as we will see later. It starts the process for getting all the
links in the given web site. After that it starts the saving and insertion
operation to the database file.
Initialize in-memory URLs Table
Private Sub InitializeSiteTable(ByVal URL As String)
UrlsT = New DataTable
Dim ID As New DataColumn
ID.AllowDBNull = False
ID.ColumnName = "ID"
ID.DataType = GetType(System.Int32)
ID.Unique = True
ID.AutoIncrement = True
UrlsT.Columns.Add(ID)
Dim Href As New DataColumn
Href.AllowDBNull = False
Href.ColumnName = "Href"
Href.DataType = GetType(System.String)
Href.Unique = True
UrlsT.Columns.Add(Href)
Dim Status As New DataColumn
Status.ColumnName = "Status"
Status.DataType = GetType(System.Boolean)
UrlsT.Columns.Add(Status)
Dim PKeys(1) As DataColumn
PKeys(0) = ID
UrlsT.PrimaryKey = PKeys
Dim TRow As DataRow
TRow = UrlsT.NewRow
TRow.Item(1) = URL
TRow.Item(2) = False
UrlsT.Rows.Add(TRow)
End Sub
This subroutine creates a new data table instant and
configures this table as follows. It defines three columns ID, Href, and Status.
The "ID" field is the primary key column for this table. The "Href" field is the
column where the URL of the current link will be saved. The "Status" column is a
Boolean field indicates whether the current link is visited or not. At last a
new row contains the current web site URL is added to the table. You can think
of this table as the crawler URLs list. The first added row to this table which
is the row that contains the web site URL is the seed of the crawler. Other
added URLs will consist the frontier of the crawler. We use this memory
structure other than using for examples simple arrays or lists, to make benefit
of the uniqueness check supported by this way.
Get All URLs
Private Sub GetWebSiteAllURLs()
Dim i As Integer
Dim TS As TimeSpan = TimeSpan.FromSeconds(TimOut)
Dim Rows() As DataRow
'The current level of the web site tree
Dim CLevel As Integer = 0
Do
'increment the current level value by one
If MLevel = -1 Then CLevel = MLevel - 1 Else CLevel += 1
'CLevel += 1
Rows = UrlsT.Select("status = false")
For i = 0 To Rows.Length - 1
Try
F1.AdvanceProgressbar()
F1.SendMessage("Gets URLs: " + Rows(i).Item(1) + " ...")
GetWebPageURLs(Rows(i).Item(1), TS)
' set the status of the row to true
UrlsT.Rows.Find(Rows(i).Item(0)).Item(2) = True
Catch ex As Exception
End Try
Next
Loop While Rows.Length <> 0 And CLevel < MLevel
End Sub
The algorithm behind this function is to define a new
variable "CLevel" represents the current working level in the web site tree. It
firstly given the value of zero represents the top level of the tree which is
the web site address or URL. Then enters a loop that does the following: 1. If
the MLevel = -1 that means that the web spider will traverse the web site till
find no new URLs to visit. Set the CLevel which represents the current level
according to the MLevel. 2. Extract all the rows in the URLs table that have a
status of false (not visited yet). 3. Start a for loop to get the URLs in each
page represented by a row in the rows extracted in step2. Then change the status
of the visited row to true. Go again to step1 till find no new rows to visit or
the current level exceeds the maximum allowed level.
Get Web Page URLs
Private Sub GetWebPageURLs(ByVal url As String, ByVal TS As TimeSpan)
Dim Doc As mshtml.HTMLDocument
Doc = Navigate2WebPage(url, TS)
If Doc Is Nothing Then Return
' Get all URLs in the current doc
Dim AnchorsArr As IHTMLElementCollection = Doc.links
Dim Anchor As IHTMLAnchorElement
'Add each anchor to the URLS table
For Each Anchor In AnchorsArr
Dim NRow As DataRow
NRow = UrlsT.NewRow
Try
NRow.Item(1) = Anchor.href
NRow.Item(2) = False
UrlsT.Rows.Add(NRow)
Catch ex As Exception
End Try
Next
End Sub
This subroutine takes a URL and a time interval. It
defines an HTMLDocument variable, navigate to the URL using the web browser
control and assign the returned document to the HTML document defined early.
Then it defines an HTML elements collection and assigns to it the HTML document
links. It then traverses the collection and adds each link element to the
in-memory URLs table and make its status to false.
Navigate to a Web Page
Private Function Navigate2WebPage(ByVal URL As String, ByVal TimeoutInterv _
As TimeSpan) As HTMLDocument
Dim T1, T2 As Date
Dim Interv As TimeSpan
Try
F1.AxWebBrowser1.Navigate2(URL)
T1 = Now()
Do While (F1.AxWebBrowser1.ReadyState <> SHDocVw.tagREADYSTATE.READYSTATE_COMPLETE)
Application.DoEvents()
T2 = Now
Interv = T2.Subtract(T1)
If TimeSpan.Compare(Interv, TimeoutInterv) = 1 Then Return Nothing
Loop
Catch ex As Exception
Return Nothing
End Try
Return F1.AxWebBrowser1.Document
End Function
This function navigates the web browser control to the
entered URL. Waiting till the document loaded completely into the browser
control by testing the ready state of the web browser control. Then returning
the web browser document.
At this stage the program do all what is needed to
collect the URLs in the given web site and stores them in the in-memory URLs
table. The following methods take these URLs table and visit each URL in it in
turn to get the web page HTML text and store it to the database.
Get Web Page
Private Function GetWebPage(ByVal URL As String) As String
Dim myWebRequest As WebRequest
Dim myWebResponse As WebResponse
Try
' Create a new 'WebRequest' object to the mentioned URL.
myWebRequest = WebRequest.Create(URL)
' The response object of 'WebRequest' is assigned to a 'WebResponse' variable.
myWebResponse = myWebRequest.GetResponse()
Catch ex As Exception
Return "ERORR!"
End Try
Dim RString As String
Try
Dim streamResponse As Stream = myWebResponse.GetResponseStream()
Dim SReader As New StreamReader(streamResponse)
RString = SReader.ReadToEnd
streamResponse.Close()
SReader.Close()
myWebResponse.Close()
Catch ex As Exception
Return "ERORR!"
End Try
Return RString
End Function
This function takes a URL as an input and returns a
string contains the HTML text of the current page. This is done by using the "WebRequest"
and "WebResponse" classes.
Save Web Site
Private Sub SaveWebSite()
Dim i As Integer
Dim str As String
For i = 0 To UrlsT.Rows.Count - 1
F1.AdvanceProgressbar()
str = UrlsT.Rows(i).Item(1)
F1.SendMessage("Saves to database: " + str + " ...")
SaveWebPage(str, GetWebPage(str))
Next
End Sub
This subroutine travers the URLs table, get the URL from
it, Get the HTML text of it, then saves the URL, and the Page text to the data
base file.
Save Web Page
Private Function SaveWebPage(ByVal URL As String, ByVal Page As String) As Integer
Return DBC.Insert(URL, Page)
End Function
This function saves the entered URL and page string to
the database using the insert method of the database class.
"Form1.vb" class
The button click handler method
In the button click event handler method, some checks on
the user typed URL and database file are carried out.
Dim cls As New C_3OXO
cls.F1 = Me
cls.TimOut = Integer.Parse(Me.Tb_TimeOut.Text)
If Me.CB_MLevel.Checked Then
cls.MLevel = Integer.Parse(Me.Tb_MLevel.Text)
Else
cls.MLevel = -1
End If
Then a new instant of type C_3OXO class is defined and
created. The public variables of the created instant are assigned as shown in
the above code.
cls.GoSpider(Me.TB_URL.Text, Me.TB_DBFile.Text)
Then the "GoSpider" method is called to start the whole
crawling process.
That is all.
To download the complete program, just click
here.
|