Skip to content

Extend current functionality of AChecker to validate dynamically generated content

Notifications You must be signed in to change notification settings

tejasshah93/AChecker-dynamic-validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

layout title date disqus
post
Documentation
2014-08-11 16:21:31 -0700
y

PROJECT IDEA

######- Pre-GSoC State of Art:

AChecker validates the static html of a web page given as input against the WCAG 2.0 guidelines. Basically it takes the source code of the input URL and validates it according to the specified WCAG accessibility norms.

However, some accessibility issues can not be identified just by this static validation. For e.g a lot of HTML elements within a webpage are triggered by Document Object Model (DOM) events which are unaccounted for.

######- Purpose & Scope of the Project:

The main crux is to improve the checks done by AChecker, by taking the current WCAG 2.0 guidelines implemented as a black box, and extend the current AChecker implementation to validate the dynamically generated Web content too.


DESIGN & ARCHITECTURE

Aim: Fetching newly generated dynamically triggered HTML from the webpage upon a certain set of events (triggers) and then validate.

Basic steps carried out:

  • Javascript and jQuery scripts to detect the common events that manipulate DOM and cause generation of new HTML contents.
  • Using PhantomJS and CasperJS libraries (discussed in detail below), to push all such events into an array and evaluate each of them.
    • Thereafter, fetch the new HTML content upon triggering all such dynamic events.
  • Merge the new HTML contents obtained from each trigger with the original source code (without duplication).
  • Pass this newly obtained HTML for AChecker validation and get union of the results obtained. Thus, the final result of AChecker will be validation of all such dynamic events including the default static HTML source (Integration with AChecker)

Crux: The above task of fetching the dynamically generated new HTML contents requires the site to be opened and an event to be triggered. The javascript written in the source of the webpage for a particular event triggered needs to get rendered.

- Architectural Requirements

To accomplish the above said, we need a headless browser Webkit i.e. a browser framework without the actual UI which renders the JavaScripts and is able to modify DOM => PhantomJS
PhantomJS in itself has DOM handling and jQuery selector functionality. However, to ease the usage of PhantomJS and a better hand at navigation scripting, we use CasperJS too. Its basically an utility written for PhantomJS which provides high-level functions for mamipulating DOM and remotely accessing it, the most apt library for this project.


Setting up the environment
For installing PhantomJS:

Using the native package manager (apt-get for Ubuntu and Debian, pacman for Arch Linux, pkg_add for OpenBSD, etc).

e.g.: for Ubuntu
$ sudo apt-get install phantomjs
Click here for more information

For installing CasperJS:
Using npm:

$ npm install -g casperjs

Click here for detailed instructions and alternative methods


IMPLEMENTATION

Developer Manual

- Detecting events that trigger and manipulate DOM contents

Basically since CasperJS is an utility framework in Javascript which allows accessing remote DOM, events that manipulate DOM are detected using standard jQuery selectors and the HTML entities are returned back to CasperJS environment.

To get a gist, following aptly describes the process being carried out

CasperJS framework

The evalute() function acts as a gate between the CasperJS environment and the webpage opened. Thus, everytime a closure is passed to evaluate(), we enter the page and execute code as if using the browser console.


For e.g., let there be a function __getOnClickTriggerElements()__:
function getOnClickTriggerElements(){
    /* .. Javascript/jQuery HTML entity selector code here .. */
    return onClickTriggerElements;
    }

This function returns a certain set of HTML elements which are capable of manipulating DOM and generating new HTML content:
- Evaluating each trigger element separately

Now the above function is evaluated as:

onClickTriggerElements = casper.evaluate(getOnClickTriggerElements);

Here, __onClickTriggerElements__ represents array of HTML entities in CasperJS environment returned from __evaluate()__ function. This array contains list of HTML elements which when triggered, manipulate DOM and generate new HTML contents respectively. Thus similar functions (such as onClick, input Forms, mouseover and related mouse events, button triggers, etc) are written which fetch such DOM manipulating elements and return them in the form of an array.

Now that all the trigger elements are returned in an array, all these elements need to be triggered one-by-one and then the newly generated HTML content can be fetched accordingly.

Following the CasperJS evaluation architecture described above, we trigger each of these elements (using casper.each() since each element of the array is to be triggered separately) in the evaluate() function and render the respective newly generated HTML content.
Assumption: Currently, a wait() function is used assuming it takes atmost 1 sec to load the DOM after triggering an element. Thereafter, the source code at that particular instant is captured and sent back to CasperJS environment from the evaluate() function. For each of the trigger, a HTML file with name data<counter>.html is generated (counter represents no. of such data files) which contains source code of the webpage at the point after triggering an element respectively.

A sample code snippet for this would seem like:

// wait for approx. 1000 ms to load the DOM
casper.wait(1000, function(){
        HTMLSource = this.evaluate(function(){
            return document.getElementsByTagName('html')[0].outerHTML;
            });
        // save HTMLSource contents into separate files for each of the triggers 
        });
Merge different new HTML contents generated by different triggers (without duplication).

Crux: Since in the part discussed above, each of the DOM manipulating element is triggered iteratively, the HTML source code grows incrementally with duplication. Consider following illustration:

Say there are 4 trigger elements on a webpage.

  • Now the data0.html file contains the static source code of the webpage being validated.
  • After processing 1st trigger, say some new HTML content gets generated and thus, data1.html contains a snapshot of the source code of the webpage after triggering 1st dynamic element.
  • After processing 2nd trigger, data2.html contains source code after triggering 2nd element. However, it also includes the HTML content generated by 1st trigger since this is an iterative process and the HTML/DOM is kept triggering continually.
    • A fresh reload is not done after every trigger because loading the webpage after every trigger would prove to be costly in terms of time.

Also assuming that the last HTML generated (say _dataN.html_) would contain all of the newly generated HTML alongwith original source code is not correct since some triggers may overwrite the content written by other.
**Aim**: Get all the newly generated HTML contents by each of the trigger (Maximization) considering even slightest trigger and merge them.
**Implementation**: So, basically following iterative approach, a __diff__ of adjacent HTML files is taken using
diff -u data0.html data1.html

which provides an output in the form of git diff. Iteratively fetching all the '+' differences from the diff output would give the dynamic HTML content generated by all the triggers scattered across different HTML files. All such positive diffs are merged into a file say dynamicDOMElements.html. (File mergeFiles.py does the work described)

Integration with AChecker

Our objective is to get the union of this newly formed merged dynamicDOMElements file and the original source code of the webpage, and thereafter pass this whole HTML content to AChecker for validation. However, while getting the union of dynamicDOMElements with source code, there needs to be some HTML headers associated with dynamic content else it would lead to false validation of that content via AChecker stating some problem types of HTML headers with dynamically obtained HTML content although it might be alright in the original source code. In a nutshell, false negatives regarding HTML headers( <!DOCTYPE> headers and <html> attributes) w.r.t to validating new content must not be given as output by AChecker.

Now that dynamicDOMElements file is created, instead of unifying the webpage source code with dynamic content disjointly with manually inputing DOCTYPE and HTML headers, we perform a selective merge contents of this file with the main source code. Here selective merge refers that all this dynamically generated content must be placed above </body></html> tags, thus preserving the original DOCTYPE and HTML headers of the webpage. Thus now mergedSourceContent.html (say) contains a merged HTML code which contains source code of original webpage selectively merged with dynamically generated content while preserving the headers (avoiding false negatives).

Result: Thus, with performing above steps now we have mergedSourceContent.html file with merged contents which were generated by triggering DOM manipulating elements. Also, this file contains apt DOCTYPE headers and html attributes (same as that of the original webpage), thus leading to no ambiguous warnings/problems reporting from AChecker. Now that mergedSourceContent.html is generated, the task of integration breaks down into following:

On getting the URL from the input form, if the URL contains no errors, then a execute.sh script is called which contains a sequence of steps to be done

  • Calls the CasperJS script with the URL as a parameter and stores the dynamic content fetched from each trigger into HTMLSourceFiles folderwith filenames as data0.html, data1.html, data2.html and so on.
  • Thereafter python script mergeFiles.py is called and dynamicDOMElements.html gets generated with all the dynamic contents merged into one file.
  • After this gets done, a selective merge of this dynamic content and static source code of the webpage is done (using some bash commands). This results in generation of mergedSourceContent.html file now containing the webpage source code merged with dynamic content.
  • Replacing the $validate_content variable: Thereafter, the content to be validated is read from the above generated file instead of directly taking the static source code from the web and thus is loaded into $validate_content variable.
    • With help of some switches, we fetch contents of the mergedSourceContent.html file if the "Show Source" option is enabled in the options menu while validating the URL (i.e. the source code of the webpage to be validated along with dynamic content is to be shown).


Testing

The above implemented has been tested thoroughly on a sample site built for reference and debugging purposes hosted here. The site mentioned is of simplistic form but contains minimal required features for triggering. It contains 4 DOM manipulating events which generate new HTML content. Codebase has been made rigorous enough to tackle such elements within other sites. After being completely built, it has been tested against some sites which gives additional known, likely, potential problems accordingly to the HTML content they generate. Results have been discussed and found satisfactory enough.

Screenshots

SampleHTML validation comparison.. Full image here

Comparison sample HTML

--

Google.com validation comparison.. Full image here

Comparison Google.com


Further work and ToDos
  • Currently, this integrated dynamic validating AChecker does not provide a seperate option whether to validate dynamic content or not. Thus, since this dynamic validation consumes considerably more time, and also, to report the user as to "where-in" actually the problem validated by AChecker lies, there must be a different section for this. Thus, differentiating the results. This was thought as a todo for the project and this idea was given considerable discussion, however due to time constraints it was not accomplished.

Also, some problems that were noticed recently are:
- While testing a site, say it has 10 DOM manipulating elements. Now if one of them has a input type="submit", (i.e. a form), currently codebase is structured assuming the site would not navigate to another webpage as such and would report some warnings, etc about blank fields then and there itself. However, if it navigates to another site, say on 6th trigger, then further triggers would not run successfully since it was assumed that the webpage would not change (we do not refresh webpage on every trigger => costly). Thus, since the webpage itself got navigated, those remaining triggers would not be evaluated successfully which would miss out some content. - Solution: The verbose log of PhantomJS reports something like this for every navigation:
```
[phantom] Navigation requested: url=<some-url-here>, type=Other,
willNavigate=true, isMainFrame=true
```  
<br/>
Now, our solution would be we place a check on every requested navigation
and if the URL where the page is to be navigated is same as the given input
URL, then we pass, else we break the navigation. Sounds optimal, and can be
implemented.

MISCELLANEOUS

For doing standalone work, a local github repo is maintained. Link-to-local-repo-used

Since the implementation discussed above requires reading, writing, modification of files via Apache hosted server, its necessary to give required permissions to the 'AChecker/checker' folder. i.e. basically giving Apache server the ownership of the files ('apache' user in Fedora and alike, whereas 'www-data in Ubuntu and similar')

Following commands should do the work:

In Fedora:

sudo chown -R apache:apache <folder-name>  

In Ubuntu:

sudo chown -R www-data:www-data <folder-name>


Contact Me

For any queries or just to get in touch:

Tejas Shah
Email ID: [email protected]
github : tejasshah93
IRC nick: jash4/carver404


About

Extend current functionality of AChecker to validate dynamically generated content

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published