Skip to content

Framework for Automated Measurements of Online Behavioral Advertising

License

Notifications You must be signed in to change notification settings

fukuda-lab/OpenOBA

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

OpenOBA

About OBA and the Framework

๐Ÿ“Œ The **OpenOBA** Framework (*Framework for Online Behavioral Advertising Measurement*) is a Python web privacy experimentation tool that measures and analyses the occurrence of Online Behavioral Advertising resulting from specific browsing behavior set up by the user or researcher.

Based on Bannerclick's version of the OpenWPM framework, OpenOBA provides a flexible and easy-to-set-up environment where highly configurable experiments involving web crawlers and ad capture can be created and run.

After each experiment's successful run, its configuration parameters, browser profile, and browsing data are saved. This allows the user to load any created experiment to keep feeding the browser with the specified behavior, or to analyze the data collected until that point to measure its OBA occurrence.

Figures explaining its usage can be seen in this folder.

What does it mean to measure OBA?

๐Ÿ‘‰ Measuring OBA means quantifying a user's exposure to online advertisements targeted specifically to him, based on his past web browsing behavior as a result of *web tracking*.

For a user to be shown targeted ads, his activity and interests must have been profiled and narrowed down to specific categories on the browsers he has used.

To quantify this phenomenon, we require all of the ads that were shown to the user together with the information about their content/category so that we can get how many of them were related to the userโ€™s profile category.

Installation

Prerequisites

OpenOBA is built on top of Bannerclick's OpenWPM framework ver 0.21.0. It uses the following versions of its parts as reference:

First prerequisite is mamba, which will be used to install the openwpm conda environment. As stated in the mamba installation guide, we can use miniforge. To install it we can simply

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Cloning the repository

git clone https://github.com/fukuda-lab/OpenOBA.git
cd OpenOBA

Install OpenWPM

We will use the same script of OpenWPM, the only change is that we want to use specifically Firefox version 108.0.2. If it doesn't work, try deleting the --force tag in ./install.sh file

./install.sh

Install additional dependencies

If the last step was successful, we can now just install the missing dependencies. To do this, we have to activate the conda openwpm environment by running:

conda activate openwpm

Run demos

If everything is working correctly, we should be able to run the demo files from the demos folder (with the openwpm env activated). In summary, these demo files show a very basic use of the main classes of the framework OBAMeasurementExperiment , DataProcesser, and ExperimentMetrics (merged with a previously called OBAAnalysis class).

The demos would be run chronologically as 1, 2, 3 and 4.

  • In MacOS, remember:

    OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES before any python command:

    OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES python -m demos.1_create_experiment_demo

See their code to follow/change the directories for data, results, and plots.

1. Create experiment demo

Demo on how to create and run a new fresh experiment instance using the OBAMeasurementExperiment class, selecting the experiment instance name, cookie banner action and setting its training pages. Note that control_visits_rate in the start() method, is a percentage (from 0 to 100) that dictates the proportion of control visits.

python -m demos.1_create_experiment_demo

2. Load experiment demo

Demo on how to load an experiment instance previously created with the OBAMeasurementExperiment class, loading its saved browser profile to resume the same experiment instance.

python -m demos.2_load_experiment_demo

3. Data processing demo

Demo on how to filter, process, and categorize all the AdURLs captured during the control visits of an experiment instance using the DataProcesser class.

python -m demos.3_data_processer_demo

4. Ads Analysis example demo

Demo on what an ad analysis script could look like, using the ExperimentMetrics class, which includes some example methods to query, tabulate, and plot an experiment ads data.

python -m demos.4_ads_analysis_demo

Ad Extraction command demo

To crawl and extract ads from any website without going through all the process of an OpenOBA framework experiment, you can call the ExtractAdsCommand as any OpenWPM command would do. We have published a demo on how to use this command similar to OpenWPM's demo.

python -m openoba_adscraper_demo

Youtube ads extraction demo

Ads from Youtube videos could be extracted, if the browser's autoplay is enabled. Else, the browser display mode has to be "native" for the crawl so the user can manually play each video before the ExtractAdsCommand is executed. This does not work flawlessly so expect to encounter difficulties.

python -m demos.youtube_crawler_demo

Errors

OpenWPM does not output all of the errors directly. For more insights when encountering errors, see the geckodriver.log file created after running OpenWPM in the root directory of the repository, and openwpm.log in the experiment directory folder (inside the data_dir folder) created after creating an experiment.

Paper Experiment

For the input files used in our paper, see the oba/input_run_files folder.

This shows how to make all the three OBA Run instances performed in the experiment, following the same 0, 1, 2, 3 and 4 steps. Read the paper to understand better the idea of separating in instances.

Other files

Other scripts can be found within oba_analysis/data_analysis and oba/third_party_analysis folders, with much more code showing data processing for ads, cookies and http requests in CSVs, plots, and markdowns.

This code can be untidy and hard to understand, because it mixes code referring to the OBA Runs and Random Runs (also called Control Runs throughout the code). It was specific to the experiments performed in the paper, but could help as guidance on how to use the methods in the ExperimentMetrics class to do an analysis on OBA, cookies and http requests.

Documentation

  • Categorizer (private class, internal use only): Given valid credentials, using WebShrinker API is able to categorize URLs with the IAB taxonomy or WebShrinker own Taxonomy. Used by TrainingPagesHandler and DataProcesser.
  • TrainingPagesHandler (public class if a user wants to access its functionalities, intended to be used just by OBACrawler class): This class has several functionalities, but in summary:
    1. it takes charge into fetching training pages from tranco and saving them in a file
    2. loading them from already fetched previous dates
    3. categorizing any given set of training pages with the Categorizer (either loaded from Tranco or a custom training pages list provided by the user) and saving the training pages in a SQLite database categorized
    4. given already categorized pages in an SQLite database, return a list of all the training pages that belong to an input category
    5. more methods related to cookie banners presence of training pages
  • OBAMeasurementExperiment (public class, directly used by the users): This is the entrypoint for the framework to run the crawlings and pages. This class handles the setup of the environment according to the arguments values, it includes the calls to the TrainingPagesHandler. Functions include:
    1. init, the setup (initializer) where it can either create a new experiment or load an old one, load either pages from Tranco Top, or from a custom list and can either categorize the lists for them or not, making. the validations accordingy.
    2. Filter and set the training pages by category in case they were categorized beforehand.
    3. Run the actual crawling for the experiment, saving the ad urls found in the control_sites for advertisements, and adding all the necessary data about the visits to the sites for them to be analyzed later. It also handles the saving of the browser profiles to be then loaded when wanting to resume a previously started experiment.
  • DataProcesser (public to the user): This is the other entrypoint for the Framework. It should only recieve the experiment_name. With that name it can connect to the sqlite database with all the crawling data (site visits, browser ids, etc), to the {experiment_name}_config.json. It is in charge of resolving the Ad URLs after being extracted during an experiment run
  • ExperimentMetrics (public): ****Used to get several insights about the experiment after having its ads processed. Several other scripts in third_party_analysis and oba_analysis that are used to generate tables and analysis for the resources of an experiment.

Experiment run

This is an explanation of some of the parameters for an experiment run using OpenOBA. Adjust the imports/runs according to where the scripts are being ran/called

Description

๐Ÿ‘‰ We want to measure the impact of usersโ€™ choice of cookie banners on the exposure to OBA they will receive.

For this, we would need to run three different experiment instances, with the same parameters, but with a different cookie_banner_action each.

In this tutorial, we will show how to run one of those experiment instances to show some of the OpenOBA features.

I. Pre-Crawling phase

  1. Create a new experiment

    First, create a dictionary with the corresponding arguments

    experiment_name and fresh_experiment parameters are required, the rest depend on the experiment. In this case:

    • cookie_banner_action of 1: accept all cookies when asked while training
    • tranco_pages_params: training pages will be retrieved from an updated list of Tranco most popular sites, of a size of 100000
    • We need valid webshrinker_credentials because we will need to categorize the pages. This must be provided by the user.
    oba_cookie_banner_experiment_with_categorization = {
            "experiment_name": "example_clothing_accept_cookie_banner_experiment",
            "fresh_experiment": True,
            "cookie_banner_action": 1,
            "tranco_pages_params": {
                "updated": True,
                "size": 100000,
                },
    				# Real values should be provided by the user
            "webshrinker_credentials": {"api_key": API_KEY, "secret_key": SECRET_KEY},
        }

Create the experiment (this will take some time because the pages need to be categorized)

from oba_crawler import OBAMeasurementExperiment

experiment = OBAMeasurementExperiment(**oba_cookie_banner_experiment_with_categorization)
  1. Set the training pages for the experiment

    ๐Ÿ‘‰ **Loading an experiment**

    If we first just created the experiment, and now in another script or run we want to load it, we can just do:

    experiment = OBAMeasurementExperiment(experiment_name="example_clothing_accept_cookie_banner_experiment", fresh_experiment=False)

    Now, since we are using the tranco pages, to set our training pages we need a category, we will pick Clothing since we know it has cookie banners (we have to do this every time we want to run an experiment):

    experiment.set_training_pages_by_category(category="Style & Fashion")

    and we can start the crawling

II. Crawling (training)

With an experiment created, loaded into an instance of OBAMeasurementExperiment and categories set (in case of using tranco), we can start an instance of the crawling for the amount of time that we desire.

experiment.start(hours=3, minutes=30, browser_mode="headless")

This will always first run clean visits over the control pages (in clear browsers), so we gather ads that we know that are not due to OBA, and then the training + control process will start.

III. Data Processing

Now we are ready to use the DataProcesser to get the landing pages for the ads as shown in the demo files 3 and 4.

About

Framework for Automated Measurements of Online Behavioral Advertising

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.1%
  • TypeScript 6.4%
  • HTML 4.3%
  • JavaScript 3.6%
  • Shell 0.5%
  • Dockerfile 0.1%