-
Notifications
You must be signed in to change notification settings - Fork 0
How to use blocks for data extraction
Information on the syntax and possibilities and limitations of extraction blocks. There's also a working sample taken from one available service.
Blocks are used to identify sections in the HTML which are of interest for data extraction. They follow a simple YAML syntax, which you can see below. The program will iterate over each block, extract the tag and the attributes, and feeds them into a BeautifulSoup query to extract the section of interest from the services' response content.
blocks:
- tag:
# or
- tag:
attr1: value
# or
- tag:
attr1: value
attr2: value
Tag equals to the tag name, under which the targeted information is located. It is the root element of what will be extracted by BeautifulSoup. You can either provide no attribute, one, or several attributes.
Lets's take this block from above and follow its way through the code.
blocks:
- tag:
# or
- tag:
attr1: value
# or
- tag:
attr1: value
attr2: value
The YAML structure will be parsed into JSON by pyyaml.safe_load(stream)
. The result of the block section looks like this:
{
"blocks": [
{"tag": null},
{"tag": null, "attr1": "value"},
{"tag": null, "attr1": "value", "attr2": "value"}
]
}
The first item in each dict is the tag, all following items are the attributes. The class does not have to be complete. If parts of the class are obfuscated, just use a clear and uniqe value. Just check, that it is really unique and does not cover any other useless section which f*cks up the data.
All those block will be fed in a bs4 query to fetch the wanted section. Taking the sample from above, those calls will look like this:
soup.find("div", attrs={"class": "partial"})
soup.find("section", attrs={"id": "section", "class": "asd"})
soup.find("header")
houzz:
# ...
extract:
blocks:
- header:
data-container: Basic Pro Info
- section:
id: business
# ...
Thanks for reading all this stuff and I'm happy to see and merge your contributions. Be it either a new feature, a new service, or simply improvements to the code.