Skip to content. Skip to main navigation.


Just-in-Time Information Propagation (WebVigiL) - Architecture

Sentinel

WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements of a user, pertaining to the Web. Each monitoring request is termed as Sentinel. The specification language supports the following features:

  • A suite of change types at appropriate levels of granularity that are of interest to a large class of users. For example, changes only at the level of a page may be overkill in many cases. One may be looking for changes in keywords or phrases of interest.
  • Ability to monitor a page based on the actual change frequency, or at a user-specified frequency. The specification of the actual change frequency relieves the user of knowing when the page changes and requests the system to do its best effort.
  • Ability to specify detection of multiple types of changes on a page.
  • Notification frequency either as best effort or with pre-determined frequency.
  • Multiple ways to compare changes (e.g., pairwise, every n, or moving n).
  • Specification of a sentinel in terms of previously defined sentinels. Also, start and stopping of a sentinel may be based on other sentinels. This provides a mechanism for tracking correlated changes.

For example consider the Scenario: Jill wants to be notified daily by e-mail for changes to links and images to the page "http://www.cnn.com" starting from December 2, 2002 to January 2, 2003. The sentinel generated for the above scenario is as follows:

  • Create Sentinel s1 Using http://www.cnn.com
  • Monitor all links AND all images
  • Fetch 2 day
  • From 12/02/02 To 01/02/03
  • Notify By email jill@aol.com Every 4 day
  • Compare pairwise

Verification Module

Verification module provides the required communication interface between the system and the user for specification of sentinels. User requests (sentinels) are processed for syntactic and semantic correctness. Valid sentinels are populated in Knowledge base and a notification of the valid sentinels is sent to change detection module. In general the functionality of verification module can be categorized as

  • Load balancing of syntactic validation between client and server, thereby reducing excessive communication like validating start date set to a date in past at the client's end than checking at the server.
  • Semantic validation of sentinels at the server, as the dependency information specified in the sentinel is available at the server. For example if start of a sentinel s1 was specified on the end of another sentinel s2, and at the time of specification if s2 had already expired an error should be thrown to the user.

Knowledge Base

Knowledge Base is a persistent repository containing meta-data about each user, number and names of sentinels set by each user, and details of the contents of the sentinel (frequency of notification, change type etc.). The details of a sentinel need to be stored (in a persistent and recoverable manner) as several modules use this information at run time. For example, the change detection module detects changes based on sentinel information such as the URL to be monitored, the change and compare specifications, and the start and end of a sentinel. The fetch module fetches the pages based on the user specified fetch policy. The notification module requires appropriate contact information and notification mechanism to notify the changes. User information, such as the sentinel installation date, and the page versions for change detection and storage path of detected changes also need to be stored to allow a user to keep track of his/her sentinels.

To satisfy all the above requirements, the metadata (the WebVigiL Knowledge Base) generated and used by different modules is stored in a relational DBMS. The monitoring request is parsed and sentinel properties are extracted, validated and stored in the KB. For example, the following parameters are stored for notification: the frequency of notification and the mechanism to notify the user. In addition, important run time parameters computed by the different modules, such as the status of the created sentinels and parameters of the change detection module are also persisted in the KB. Finally, relational database provides mechanisms to extract the required information in a convenient manner in the form of queries or using the JDBC Bridge.

Change Detection Module

Every valid user request arriving at WebVigiL, initiates a series of operations that occur at different points in time. Some of these operations are: creation of a sentinel (based on start time), monitoring the requested page, detecting changes of interest, notifying the user(s) of the change, and deactivation of sentinel. In WebVigiL, for every sentinel, the ECA rule generation module generates ECA rules [17, 18] to perform some of these operations. This module is responsible for

  • Activating and deactivating sentinels
  • Maintaining Change Detection Graph
  • Generating Fetch rules.

Activation/Deactivation

During its lifespan, a sentinel is active and participates in change detection. A sentinel can be disabled (does not detect changes during that period) or enabled (detects changes). By default, a sentinel is enabled during its lifespan. The user can also explicitly change the states of the sentinel during its lifespan. The start/end of a sentinel can be time points or events. When a sentinel's start time is now, it is enabled immediately. But in cases where the start is at a later time point or depends on another event that has not occurred, we need to enable the sentinel only when the start time is reached or the event of interest has occurred.

Change Detection Graph

When a page is fetched, for every sentinel that is interested in that page, change is computed and notified to the user. In situations where there are two or more sentinels interested in the same type of change on the same page we have to compute the change more than once. We avoid this by capturing the relationship between the pages and sentinels, and grouping the sentinels on their change and page. Hence all sentinels interested on the same type of change and on the same page are grouped together. In order to represent this relationship we construct a change detection graph. The change detection graph for the sentinels s1 and s3 is shown in Figure 3.2 . The different types of nodes in the graph are as follows:

  • URL node: A URL node is a leaf node that denotes the page of interest. The number of URL nodes in the graph is equal to the number of distinct pages the system is monitoring at any particular instant of time.
  • Change type node: All level-1 nodes in the graph belong to this category. This node represents the type of change on a page (all words, links, images, keywords, phrases, table, list, regular expression, any change).
  • Composite Node: A Composite node represents a combination of change types. All higher-level nodes (> level-1) in the graph belong to this type. Currently we support composite changes on a single page. We plan on extending this to multiple pages.

In the graph, to facilitate the detection and propagation of changes, the relationship between nodes at different levels is captured using the subscription/notification mechanism. The higher-level nodes subscribe to the lower level nodes in the graph. This subscription information is maintained in the subscriber list at each node. At the URL node, this list contains the references to the change type nodes. At the change type nodes each sentinel will have a subscriber that will contain the references to the composite nodes. When a page is fetched, the associated URL node is notified about the page. The URL node propagates this page to all the change type nodes that have subscribed to it. Finally at the change type nodes the change is computed between the current page received and an appropriate reference page (based on the compare option) that is fetched from the page repository. If there is any change then the sentinels subscribed to it are notified. When this change type is a part of a composite change, those composite nodes are notified.

Detection algorithms

A detection algorithm associated with each change type node computes changes between two versions of a page with respect to that change type. For a change to be detected, the object of interest is extracted from the given versions of the page depending upon the change type. Change detection algorithms have been developed to detect different types of changes to HTML and XML pages. The change types currently supported are: links, images, all words, keywords, phrase and regular expressions. Change to links, images, words and keyword(s) is captured in terms of insertion or deletion. For phrases in addition to insertion/deletion update is also detected.

Fetch Module

Fetch Module of WebVigiL is responsible for retrieving the pages registered with it and thus serves as a local wrapper for the task of fetching pages depending upon the user set fetching policy i.e. fetching a page after a specified interval (set by the user) or fetching the page on change (the system determines the frequency of fetching based on change frequency of the pages). The Fetch module informs the version controller of every version it fetches, stores in the page repository and notifies the change-detection-graph of a successful fetch. The wrapper fetches the page only when there is change in the properties of the pages. By properties, we mean the size of the page and last modified time stamp. When there is change in time stamp of the page with an increase or decrease in page size, the wrapper fetches and caches the page. In cases where time stamp is modified, but the page size remains the same, the wrapper fetches and calculates the checksum of the page. The page is cached only if the calculated checksum differs from the checksum of the cached copy of this page.

Version Management

An important feature of WebVigiL architecture is its centralized server based repository service (Version controller) that archives and manages versions of pages. WebVigiL retrieves and stores only those pages needed by a sentinel. The primary purpose of the repository service is to reduce the number of network connections to the remote web server, there by reducing network traffic. When a remote page fetch is initiated, the repository service checks for the existence of the remote page in its cache and if present, the latest version of the page in the cache is returned. In cases of cache miss, the repository service requests that the page be fetched from the appropriate remote server. Subsequent requests for the web page can access the page from the cache instead of repeatedly invoking a fetch procedure.

The repository service reduces network traffic and latency for obtaining the web page because WebVigiL can obtain the "Target Web Pages" from the cache instead of having to request the page directly from the remote server. The quality of service for the repository service includes managing multiple versions of pages with out excessive storage overhead.

Presentation Module

The principal functionality of this module is to present clearly the detected differences between two web pages to the user. Therefore, computing and displaying the detected differences is very important.

Change Presentation

Different methods of displaying changes used by the existing tools are: 1.) Merging two documents, 2.) Displaying only the changes 3.) Highlighting the differences in both the pages. Summarizing the common and changed data into a single merged document has the advantage of displaying the common portions only once. The disadvantage of this approach is that it is difficult for the user to view the changes when they are large in number. Displaying only the computed differences is a better option when the user is interested in tracking changes to multiple pages or when the number of changes is large. But, highlighting the differences by displaying both the pages side-by-side is preferable for changes like "any change" and "phrase change". In this case, the detected differences can be perceived better if the change in the new page is shown relative to the old page.

Because WebVigiL will track multiple types of changes on a web page, and eventually notify using different media (email, PDA, laptop etc.), combination of all presentation styles discussed above will be relevant, as the information to be notified will vary depending on factors like notification method, number of detected differences and type of changes.

Change Notification

Users need to be notified of detected changes. The mechanism selected for notification is important especially when multiple types of devices with varying capabilities are involved. What, When and How to notify are three important issues for notification.

Presentation Content

Presentation content should be concise and lucid. Users should be able to clearly perceive the computed differences in the context of his/her predefined specification. The notification report could contain the following basic information:

  • The change detected in the latest page relative to the reference page
  • User specified type of change like "any change", "all words" etc.
  • URL for which the change detection module is invoked.
  • Small summary explaining the detected change.

This could include statuses of changes such as Insert, Delete and Changed for certain type of user-defined types of changes like "images", "all links" and "keywords" or/and the different timestamps indicating the modification, polling, change detection and notification date. The size of the notification report will depend upon the maximum information that can be sent to a user by satisfying the network quality of service requirements.

Notification frequency

A detected change can be notified in two ways: i) Notify immediately when the change is detected ii) Notify after a fixed time interval. The user may want to be notified immediately of changes on particular pages. In such cases, immediate notification should be sent to the user. Alternatively, frequency of change detection will be very high for web pages that are modified frequently. Since frequent notification of these detected changes will prove to be a bottleneck on the network, it is preferable to send notification periodically. The notification has to be sent to the user taking into consideration the QoS constraints. The system should incorporate the flexibility to allow users to specify the desired frequency of notification. For example, in sentinel s1, Jill wants to be notified once in 4 days, irrespective of when the changes are detected.

Notification methods

Different notify options like email, fax, PDA and interactive way, can be used for notification. Interactive is a pull based notification approach where the user pulls the detected changes as and when needed. A dashboard will be provided to the user to view and query the changes generated by his/her sentinels.

Department of Computer Science and Engineering