Skip to content

lucsorel/web-content-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-content-analysis

Web scrapper for HTML and sitemap.xml content analysis.

This Node.js webapp is a small tool for SEO with 2 functionalities which can be discovered on this free-dynoed Heroku app, which may be subject to downtimes if overused:

HTML analysis

This tab parses the content of a given url (http://www.lucsorel.com/ for example) and displays the words by a decreasing order of importance, according to some weights assigned to different html tags. Being displayed in a h1 tag brings more weight to a word than being displayed in a h2 tag, and so on). The weighs are (rather arbitrarily) defined on the front-end side.

In the result of the analysis of the http://www.lucsorel.com/ page, you can interpret:

 virtual: 33
a: 3  h2: 2  b: 1

as:

  • a total weight of 23 for the word virtual
  • which appears 3 times in a <a> tag, 2ce in a <h2> tag and 1ce in a <b> tag

Sitemap.xml analysis

A sitemap is an XML file, often located at the root of a website along the robots.txt file, listing the URLs of a website to ease the work of indexation engines. Its format is explained on sitemaps.org. Each URL can be optionally characterized with:

  • a priority describing the importance of the page in the site
  • an update frequency to let indexation engines know how often the content is updated
  • a last edition date

For example, the www.sitemaps.org/sitemap.xml only describes the URLs and their last edition date (when this doc was written).

The sitemap analysis is done in two steps (see the example of the www.sitemaps.org/sitemap.xml analysis):

  • the first step lists the URLs along with their optional characteristics and highlights duplicated URLs
  • on this result screen, you can select the URLs to check their existence, HTML title, HTTP status and possible redirection

Technologies involved