Behind the connection: Internet Explorer Automation Part 3

Today I will present an Internet Explorer automation which will query Blogger stats page automatically. IE automation is required because Blogger website makes heavy use of JavaScript to dynamically construct the stats page. Downloading the webpage with a HTTP component won’t work because the numbers we are looking for are not in clear! JavaScript must be executed to get hand of it.

The code I will show you will also take care of authentication. Asking for the stats page without being first authenticated and you get the authentication page instead. The code I’ll present will detect the login page, fill the form automatically, submit it and then request the stats page again and finally extract the data.

For those not accustomed with Blogger author interface and his stats page, the screen dump shows an actual view of the page. It shows the stats for this week (At the time of writing this article). What we are interested in is to get the column on the right showing “Pageviews today 149”, “Pageviews yesterday 434” and the two other lines. This is an HTML table that we have to extract from the document.

As I said above before getting this stats page, you must be authenticated. This means that if you are not authenticated, Blogger will show you the login page whatever you asked in the first place. For your reference, here is a screen dump of the authentication page:

On that page, we see a form with two fields for Email and Password and a button “Sign in” to click. The program will locate those fields, assign a value and then click on the button.

Document Object Model (DOM)

The World Wide Web Consortium (W3C) Document Object Model (DOM) is a platform- and language-neutral interface that permits programs or scripts to access and update the content, structure, and style of a document. The W3C DOM includes a model for how a standard set of objects representing HTML and XML documents are combined, and an interface for accessing and manipulating them.

Internet Explorer exposes DOM thru a set of COM interfaces available to external programs such as our Delphi application. This is documented on MSDN website at:
http://msdn.microsoft.com/en-us/library/ie/hh772384(v=vs.85).aspx

I will only scratch the surface of DOM. Just enough to get you started and to accomplish the task for the sample application.

We saw in previous article that we can connect to IE by calling this line:

    FWebBrowser := CreateComObject(CLASS_InternetExplorer) as IWebBrowser2;

And that we can navigate to an URL with this line of code:

        FWebBrowser.Navigate(Url, EmptyParam, EmptyParam, EmptyParam, EmptyParam);

To get hand on the interface which is the entry point for the DOM, we must get the document (whatever it is) and the get the interface to the HTML document (if it exists):

      Doc := FWebBrowser.Document;
      Doc.QueryInterface(IID_IHTMLDocument2, HtmlDoc);

Those code lines are easy but wait! There can be some glitches. Internet Explorer takes some time to fetch URL and build document. A document can be quite complex and could requires a lot of downloads for HTML, images, CSS, scripts and more. And once everything is downloaded, scripts have to be executed. There are various status available to be sure everything is OK. The method WaitComplete here after takes an URL, navigate to it and wait until the HTML document interface is available and the document is ready:

function TQueryBloggerStatistics.WaitComplete(
    const URL : String = ''): IHTMLDocument2;
var
    Doc : IDispatch;
begin
    Result := nil;
    if URL <> '' then
        FWebBrowser.Navigate(Url, EmptyParam, EmptyParam, EmptyParam, EmptyParam);
    while FWebBrowser.Busy do
        Sleep(250);
    while FWebBrowser.Document = nil do
        Sleep(250);
    Doc := FWebBrowser.Document;
    if Doc.QueryInterface(IID_IHTMLDocument2, Result) <> S_OK then
        Exit;
    while not SameText(Result.readyState, 'complete') do
        Sleep(250);
end;

WaitComplete takes and optional URL and returns the IHTMLDocument2 interface required for handling the document. Tests are made to be sure everything is ready or complete. The code is quite straightforward but this must be done like that.

Once we’ve got an IHTMLDocument2 interface, we can use it to traverse the document object model (DOM) to find the HTML elements we need and to get or set their properties.

The HTML document has a number of collections like images, links, scripts and the likes. And there is a special collection returning absolutely everything. It is named “all”. We will use it to find what we need. For example, in the login form, we need to get hand on the HTML INPUT tag for each field and submit buttons. Each HTML tag has a TagName such as “input” and a tagID. TagName is an HTML standard while TagID is chosen by the web developer, in this case by Blogger. Fortunately at Blogger, they used very clear and meaningful TagId sucha as “Email” (for the Email input field), “Passwd” (for the password input field) and “Signin” for the submit button.

Since we have to get hand on several HTML elements, I wrote a little function FindTag:

function TQueryBloggerStatistics.FindTag(
    const Coll    : IHTMLElementCollection;
    const TagName : String;
    const TagID   : String) : IHTMLElement;
var
    PDisp : IDispatch;
    Var2  : OleVariant;
    I     : Integer;
begin
    for I := 0 to Coll.Length - 1 do begin
        pDisp := Coll.item(I, var2);
        if pDisp.QueryInterface(IID_IHTMLElement, Result) = S_OK then begin
            if SameText(Result.tagName, TagName) and
               SameText(Result.Id, TagID) then
                Exit;
        end;
    end;
    Result := nil;
end;

FindTag has to be called like this:

    HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'EMail');
    if Assigned(HtmlElem) then
        HtmlElem.setAttribute('Value', FUserEMail, 0);

This excerpt find tag name “input” tag having an ID “Email”. The result, if found, is the interface to handle that HTML element. Here I use the interface to set the attribute “value” to the user email (variable FUserEMail hold the Email address).

FindTag code is relatively simple although accessing the collection items is a little bit tricky and must pass thru the use of another interface. Sorry but this is how Microsoft designed IE to handle the DOM.

Detecting and handling the login page

The code I’ll show you below will query a webpage by his URL. Nere this URL is supposed to be the stats page of a given Blogger’s blog. We’ll come back to that URL later. It makes use of WaitComplete to fetch the URL, wait until it is ready and complete and then use FindTag to see it the page conatins an “input” tag with and ID “Email”. If this is the case, then it is assumed we have received the login page. The conde then fetch in cascade all other required tags in that page, fill it with user data and then claa the “Click” method of the HTML element which is the submit button. And guess what… IE will send the form to Blogger and authentication take place.

    FHtmlDoc := WaitComplete(URL);
    if not Assigned(FHtmlDoc) then
        Exit;

    // Check for login page
    // If found, fill in the form and subit it before continuing
    HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'EMail');
    if Assigned(HtmlElem) then begin
        HtmlElem.setAttribute('Value', FUserEMail, 0);
        HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'Passwd');
        if Assigned(HtmlElem) then begin
            HtmlElem.setAttribute('Value', FUserPassword, 0);
            HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'PersistentCookie');
            if Assigned(HtmlElem) then
                HtmlElem.setAttribute('Checked', '', 0);
            HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'Signin');
            if Assigned(HtmlElem) then begin
                HtmlElem.click;
                Display('Login...');
                // We have found login form and must wait for login to occur
                FHtmlDoc := WaitComplete;
                if not Assigned(FHtmlDoc) then
                    Exit;
                // Login is finished, we must navigate again to the target URL
                FHtmlDoc := WaitComplete(URL);
                if not Assigned(FHtmlDoc) then
                    Exit;
                HtmlElem := FindTag(FHtmlDoc.All, 'INPUT', 'EMail');
                if Assigned(HtmlElem) then begin
                    Display('Login failed');
                    Exit;
                end;
            end;
        end;
    end;

The next step is to extract the statistics from the stat page.
We will do that in the next article. Stay tuned!

Read also part 1 and part 2.

Follow me on Twitter
Follow me on LinkedIn
Follow me on Google+
Visit my website: http://www.overbyte.be

4 comments:

Unknown said...: How could I get changed HTML content when content updated by Ajax calls?; 13 May, 2013 19:06
FPiette said...: Ajac calls will be reflected in the DOM. Actually, the script doing the Ajax call will update the document using DOM itself. So from outside, you should be able to get the same changes.; 13 May, 2013 20:22
Unknown said...: No I can't.

I am already using TWebBrowser component + Delphi 7. And IE 9 32bit is loaded into my computer.

Some parts of content is updated dynamically by Ajax calls. It is using named DIV sections. For example, when user press a button on the form, content of DIV XYZ is updating / changing. But if I get the HTML Text by using DOM methods such as innerHTML, outerHTML, I am getting old content, not updated content...

Thanks a lot for your helps.

Hur; 14 May, 2013 15:40
FPiette said...: Have you tryed to get the document interface again ?; 14 May, 2013 19:58

May 12, 2013

Internet Explorer Automation Part 3

Document Object Model (DOM)

Detecting and handling the login page

4 comments: