Flock's Web Detective service is designed to make detection of states or information on web pages easier. It is used to detect logged-in or logged-out states for various web services, to detect media streams or person information on web pages, etc. With Web Detective, it's possible to examine any combination of URL, Document source, DOM, Form fields and Cookies using string comparison, regular expressions and even XPath expressions. Web Detective also lets you specify special named strings, such as web service URLs that may be subject to change, in a format that makes them easy to update later. Detection rules and named strings are defined in updateable XML files.
Contract ID: "@flock.com/web-detective;1"
Interface: flockIWebDetective
To use Web Detective, you must first define an XML rules file. Links to existing rules files for various web services are provided below.
Contents |
To specifiy named strings, you must have a <strings> element in your XML file. Example:
<strings>
<string name="domains" value="example.com,example.net"/>
<string name="loginURL">
<![CDATA[http://login.example.com/login.php]]>
</string>
<string name="profileURL">
<![CDATA[http://pages.example.com/%userid%/]]>
</string>
...
</strings>
The strings can be whatever you like. By convention, use %variables% to indicate substrings that your code is going to do substitutions on.
To retrieve a string, use the flockIWebDetective.getString() method.
In general, each rules file will specify a number of rules, in the form of <detect> elements. Each <detect> must have a "type" attribute. For example, <detect type="loggedin">. You may use whatever string you desire for the type, but if you want to follow Flock convention then have a look at the example files below. You may define more than one rule for a given type if you like. In this case, the first matching rule will be used.
To test a rule, use the flockIWebDetective.detect() method, or one of its variants. (See the interface documentation to help decide which variant to use.)
Inside the <detect>, you will need a <conditions> element. The conditions control whether or not the rule matches. Conditions may operate on a URL, a Document, on Cookies, or on some combination of the three.
<detect type="...">
<conditions>
<url domain="example.com"/>
</conditions>
</detect>
The above rule will match if the domain of the page being tested is "example.com". You can also get more specific, though. For example:
<detect type="...">
<conditions>
<url domain="example.com">
<host contains="www."/>
<path contains="login.php"/>
</url>
</conditions>
</detect>
If you need to test the contents of a document, you can either use regular expressions, or XPath, or a combination of both.
<detect type="...">
<conditions>
<document>
<regexp><![CDATA[/div id='Welcome'/]]></regexp>
</document>
</conditions>
</detect>
The above rule will match a document whose source contains the string "div id='Welcome'". Of course, regexps are much more powerful than this. You can use any JavaScript regular expression here.
For performance reasons, regexps are run against the document source on a line-by-line basis until a match is found. If you need a regexp to be run against the entire document source as if it were all on one line, do this:
<regexp multiline="true"><![CDATA[/.../]]></regexp>
But beware of the performance impact!
XPath expressions can be used in a similar way:
<detect type="...">
<conditions>
<document>
<xpath><![CDATA[//div[@id="header"]//a/text()[contains(.,"Logout")]]]></xpath>
</document>
</conditions>
</detect>
The above rule would match a document that had a DIV with id="header" that contained a "Logout" link.
IMPORTANT NOTE: XPath will only work in situations where you actually have a DOM -- for example when the user navigates to a page, the DOM gets loaded, and a "FlockDocumentReady" event is fired. In situations where there is no DOM -- such as when using XMLHttpRequest -- you must rely on regular expressions alone.
For situations where you need to detect field values on submission of a form, for example, you can specify <form> conditions. You can either use Basic or XPath syntax for this (see below). In order to detect form conditions, you must call flockIWebDetective.detectForm() and pass in the nsIDOMHTMLFormElement as an argument. If you specify a <document> element in your rule, then the form's .ownerDocument will be used. If you specify a <url> element, then .ownderDocument.URL will be used.
Example:
<detect type="login">
<conditions>
<form>
<field tagname="input" name="username" type="text"/>
<field tagname="input" name="password" type="password"/>
</form>
</conditions>
</detect>
As of the present, only "input" tagnames are supported. We'll probably extend this to handle textareas at some point. The conditions will only be met if matches are found for all form fields specified. That is, according to the rule above we need to find a text field called "username" and a password field called "password". If you want to match a specific id for a field, then you must use fieldid="<id>".
Example:
<detect type="login">
<conditions>
<form>
<xpath><![CDATA[[@name="loginform"]]]>
</form>
</conditions>
</detect>
The xpath [@name="loginform"] actually applies against the form element that gets passed in, so it will match for any form with an attribute name="loginform". A <regexp> element can only be used inside of an <xpath> condition (see Combined XPath and Regexps below).
XPath matching is more powerful than Basic, as you have access to all elements and attributes. Note, however, that sometimes XPath will not work against forms because the HTML of the page is poorly structured and Gecko does not parse the DOM in such a way that the <input> elements are actually contained within the <form> element. In these cases, you'll have to make do with Basic syntax as described above.
It is generally most useful to test for the existence or non-existence of certain cookies.
<detect type="loggedout">
<conditions>
<cookie nomatch="true" host=".example.com" name="loggedin"/>
</conditions>
</detect>
The above rule will match as long as there is no domain cookie called "loggedin" currently set for example.com.
It's possible to pull pieces of information off of a page using detection rules. To do this, you must either have a <results> element in your rule, or else you must be using compact syntax (see section below). You can use either regular expressions, XPath, or a combination of both to get the data.
<detect type="userinfo">
<conditions>
<url domain="example.com"/>
<document>
<xpath><![CDATA[//form[@id="logout"]]]>
</document>
</conditions>
<results>
<document>
<regexp re1="username">
<![CDATA[/input type="hidden" name="usr" value="(.+)"/]]>
</regexp>
</document>
</results>
</detect>
The above rule will match on any page in the "example.com" domain that contains a "logout" form. It will then pull a "username" value from a hidden field called "usr" using the regular expression.
You may use more than one <regexp> to get results. And each one may also have multiple variables, if you like. Each regexp variable is denoted by parentheses ( ) and corresponds to an attribute like "re1", "re2", and so on...
Example:
...
<results>
<document>
<regexp re1="fullname" re2="username">
<![CDATA[/<span id="user">(.+) \[(.+)\]/]]>
</regexp>
<regexp re1="avatarURL">
<![CDATA[/<img id="avatar" src="(.+)"]]>
</regexp>
</document>
</results>
The above rule will try to pull three pieces of information off of a page: fullname, username and avatarURL.
It's also possible to perform some post-processing on variables extracted from the regexp. To do this, you need to use a slightly different syntax.
...
<results>
<document>
<regexp>
<![CDATA[/<span id="user">(.+) \[(.+)\]/]]>
<re1 name="fullname"/>
<re2 name="username" processing="subst"><![CDATA[s/\+/_/g]]></re2>
</regexp>
</document>
</results>
The above example will retrieve a username from the document, and will then run a substitution on it, replacing all +'s with _'s.
Other post-processing directives currently available are:
These directives do not require a CDATA block. Furthermore, all of these directives (including subst) can be used in concert against a single variable. Just use a comma-separated list to indicate the order that you would like the processing performed in. For example:
<re2 name="username" processing="subst,unescape,toUpper"><![CDATA[s/\+/ /g]]></re2>
The above will take the username from the document, substitute spaces for + signs, perform a URL-unencoding, and then convert the whole thing to upper case.
...
<results>
<document>
<xpath name="profileURL" extract="attribute:href">
<![CDATA[//div[@id="welcome"]//a]]>
</xpath>
</document>
</results>
The above rule will find a div called "welcome", look inside it for an 'a' element (link), and then grab the value of the 'href' attribute from that link. The returned variable will be called "profileURL".
The 'extract' attribute is optional in Web Detective syntax. The XPath query itself should select an element in the HTML document, and the 'extract' attribute is used to indicate which property of that selected element to extract. If 'extract' is not specified, then it will default to using the 'nodeValue' property. Depending on the situation, that may not be what you want, however. If you are trying to pull data from a form field, then you likely want to specify extract="value". If you want the value of a specific attribute in the HTML, then you will want to use "attribute:<name>".
It's also possible to use a combination of XPath and regular expressions to pull data from a page. This can be more robust than using regexps alone.
Example:
...
<results>
<document>
<xpath>
<![CDATA[//form[@id="logout"]//input[@name="usr"]/@value]]>
<regexp re1="username"><![CDATA[/(.*)/]]></regexp>
</xpath>
</document>
</results>
The above rule is a more robust example of how to do what we tried to do above using just regexps. This time, however, we are ensuring that the "usr" input field occurs inside the form, rather than elsewhere on the page. The regular expression is run against the nodeValue of the node(s) resulting from the XPath expression, which in this case is the value of the "value" attribute.
IMPORTANT NOTE: XPath will only work in situations where you actually have a DOM -- for example when the user navigates to a page, the DOM gets loaded, and a "FlockDocumentReady" event is fired. In situations where there is no DOM -- such as when using XMLHttpRequest or flockHttpRequest -- you must rely on regular expressions alone.
You can pull form field values using either the Basic or XPath syntax described for forms, above. By default, if you specify <field name="username"/>, for example, then the value of the "username" form field will be extracted and returned as "username" in the results hash. If you want the resulting value to be called something else, however, you can specify an extractas="myvar" attribute to rename it.
There is a shorthand syntax available that can make the XML you have to write simpler in some situations. Here are the main points:
<conditions> element is optional. If you omit it from your rule, then any children of the <detect> element (excluding <results>) will be considered conditions.
<results> element is optional. If you omit it from your rule, then any children of the <detect> element (excluding <conditions>) will be considered results.
<document> element is optional. If you omit it, then any conditions (eg. <xpath>, <regexp>, etc.) not contained within a <url>, <form> or <cookie> element are assumed to apply to the document.
So using the original longhand syntax as previously described, we might have a rule like this:
<detect type="accountinfo">
<conditions>
<url domain="example.com"/>
<document>
<xpath><![CDATA[//form[@id="userinfo"]/input[@id="username"]]]></xpath>
</document>
</conditions>
<results>
<document>
<xpath name="username" extract="value">
<![CDATA[//form[@id="userinfo"]/input[@id="username"]]]>
</xpath>
</document>
</results>
</detect>
And it could be more succinctly written using shorthand syntax, like so:
<detect type="accountinfo">
<url domain="example.com"/>
<xpath name="username" extract="value">
<![CDATA[//form[@id="userinfo"]/input[@id="username"]]]>
</xpath>
</detect>
In the above shorthand rule, the <url> and <xpath> elements are treated as both conditions and results. (No results are actually obtained from the <url>, but that's ok.) Furthermore, the <xpath> is assumed to apply to the document, even though <document> is not expressed.
delicious.xml
facebook.xml
flickr.xml
livejournal.xml
myspace.xml
photobucket.xml
wordpress.xml
xanga.xml
youtube.xml