Skip to content

A Python package for flexible subset selection for data visualization along with the data, figures, and examples.

License

Notifications You must be signed in to change notification settings

uwgraphics/flexibleSubsetSelection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flexible Subset Selection for Data Visualization

Abstract

Subset selection is crucial in data visualization for decluttering, summarizing, and emphasizing key insights. This project proposes a unified approach using multi-criterion optimization for flexible subset selection strategies. By defining objective functions, we tailor subsets to meet specific visualization needs, leveraging both traditional and innovative selection methods. General-purpose solvers facilitate rapid prototyping, demonstrated through realistic examples of scatterplot decluttering, dataset summarization, and exemplar highlighting.

Flexible Subset Selection Python Package

Usage

Once downloaded, installation is recommended to be done using using pip and an environment manager. Dependencies can be installed from the requirements.txt file:

pip install -r requirements.txt

The flexibleSubsetSelection package can be installed locally:

pip install .

Alternatively, if you want to modify the source code of the package, you can install it in editable mode:

pip install -e .

Once installed, the package can be imported, such as by:

import flexibleSubsetSelection as fss

Sets

Overview

The sets.py module includes classes for datasets and their subsets. Dataset objects can be initialized from an existing source such as one loaded in a Pandas DataFrame or generated by random dataset generation. Processing the dataset prior to subsetting via scaling, binning, one-hot encoding, and other custom preprocessing functions is supported. Subsets can be created from a Dataset and an indicator vector indicating which items in the dataset to include in the subset. Datasets and Subsets can be saved and reloaded.

Dataset Class

  • Initialization: Initialize with tabular data or generate random data using methods in generate.py.
  • Preprocessing: Apply custom preprocessing functions.
  • Scaling: Scale dataset values within specified intervals.
  • Binning: Convert continuous data into discrete intervals.
  • One-Hot Encoding: Encode categorical variables into binary vectors.
  • Saving: Save datasets as pickled or csv files.
  • Loading: Load datasets from pickled or csv files.

Example Usage

dataset = fss.Dataset(randTypes="multimodal", size=(500, 10), seed=123)
dataset.save("../data/myNewDataset", fileType="csv")

Subset Class

The Subset class represents subsets of datasets specified by indicator vectors.

  • Initialization: Initialize a subset from a Dataset based on an indicator vector z.
  • Saving: Save datasets as pickled or csv files.
  • Loading: Load datasets from pickled or csv files.

Example Usage

subset = fss.Subset(dataset, z)
subset.save("../data/myNewSubset", fileType="csv")

Generate

The generate.py module provides functions for generating random datasets based on various distribution types using numpy, scipy, and sklearn. Distribution types include uniform, binary, categorical, normal, multimodal, skew, and blobs. Specify the random generation type when creating datasets using the randTypes parameter of Dataset as a string name or a list of string names per column of the new Dataset.

Example Usage

skewDataset = fss.Dataset(randTypes="skew", size=(1000, 2))
variedDataset = fss.Dataset(randTypes=["skew", "normal", "multimodal"], size=(1000, 3))

Loss Functions and Metrics

The loss.py module provides classes and functions for defining multi-criterion and single-criterion loss functions, as well as various metric functions for evaluating datasets and subsets.

MultiCriterion Class

The MultiCriterion class defines a multi-criterion loss function using a set of objective functions, parameters, and optional weights. It allows combining multiple objectives into a single loss function for subset selection.

  • Initialization: Initialize with a list of objective functions, parameters for each objective, and optional weights.
  • Calculate: Compute the overall loss function by evaluating each objective function with its corresponding parameters and combining them with weights.

Example Usage

objectives = [fss.loss.earthMoversDistance, fss.loss.distinctiveness]
parameters = [{"dataset": dataset.dataArray}, 
              {"solveArray": "distances", "selectBy": "matrix"}]
weights = np.array([1000, 0.1])
solveMethod.loss = fss.loss.MultiCriterion(objectives, parameters, weights=weights)

UniCriterion Class

The UniCriterion class defines a single-criterion loss function with an objective function and optional parameters for subset selection.

  • Initialization: Initialize with an objective function, solve array name, selection method, and additional parameters.
  • Calculate: Compute the loss by evaluating the objective function on the selected subset for given parameters.

Example Usage

dataset.preprocess(distances = fss.loss.distanceMatrix)
solveMethod.loss = fss.UniCriterion(objective = fss.loss.distinctiveness,
                                    solveArray = "distances",
                                    selectBy = "matrix")

Solvers

The Solver class encapsulates a solver with an algorithm and loss function for subset selection.

  • Initialization: Initialize with a solve algorithm and a loss function.
  • Solve: Execute the algorithm on a specified dataset using optional parameters.

Example Usage

solveMethod = fss.Solver(algorithm=fss.solver.greedyMinSubset, loss=lossFunction)

Plot

The plot module provides functions for configuring and creating plots of datasets and subsets using the Matplotlib and Seaborn libraries.

Color Class

  • Color Class: Define color palettes and color bars for consistent visualization.

Figure Operations

  • moveFigure: Move the upper-left corner of a figure to a specified pixel position.
  • clearAxes: Clear all axes in a given figure.
  • removeAxes: Remove all axes from a given figure.
  • setPickEvent: Set a pick event on a figure, invoking a specified function upon selection.

Plot functions

Functions that take datasets and/or subsets and generate corresponding plots. Including scatter, parallelCoordinates, and histogram.

Example Usage

# Initialize color and plot settings
color = fss.plot.Color()
fss.plot.initialize(color)

# Create a scatterplot
fss.plot.scatter(ax, color, dataset, subset, alpha=0.5)

Data

Data is stored in the data folder, which contains subdirectories containing saved pickle files of pandas dataframes representing randomly generated data and calculated subsets used in this project. Example datasets used are in the data/exampleDatasets subdirectory.

Figures

Figures are stored in the figures directory as PDFs.

A series of example demonstration Jupyter Notebooks can be found in the jupyter directory.