Skip to content

How to use blocks for data extraction

Lu edited this page May 24, 2023 · 1 revision

Information on the syntax and possibilities and limitations of extraction blocks. There's also a working sample taken from one available service.


What are blocks?

Blocks are used to identify sections in the HTML which are of interest for data extraction. They follow a simple YAML syntax, which you can see below. The program will iterate over each block, extract the tag and the attributes, and feeds them into a BeautifulSoup query to extract the section of interest from the services' response content.

Block structure

blocks:
  - tag:
  # or
  - tag:
    attr1: value
  # or
  - tag:
    attr1: value
    attr2: value

Tag equals to the tag name, under which the targeted information is located. It is the root element of what will be extracted by BeautifulSoup. You can either provide no attribute, one, or several attributes.

What happens internally?

Lets's take this block from above and follow its way through the code.

blocks:
  - tag:
  # or
  - tag:
    attr1: value
  # or
  - tag:
    attr1: value
    attr2: value

The YAML structure will be parsed into JSON by pyyaml.safe_load(stream). The result of the block section looks like this:

{
  "blocks": [
    {"tag": null},
    {"tag": null, "attr1": "value"},
    {"tag": null, "attr1": "value", "attr2": "value"}
  ]
}

The first item in each dict is the tag, all following items are the attributes. The class does not have to be complete. If parts of the class are obfuscated, just use a clear and uniqe value. Just check, that it is really unique and does not cover any other useless section which f*cks up the data.

All those block will be fed in a bs4 query to fetch the wanted section. Taking the sample from above, those calls will look like this:

soup.find("div", attrs={"class": "partial"})
soup.find("section", attrs={"id": "section", "class": "asd"})
soup.find("header")

Sample block taken from a working service

houzz:
  # ...
  extract:
    blocks:
      - header:
        data-container: Basic Pro Info
      - section:
        id: business
  # ...