Skip to content

ZJU-DAILY/Snoopy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

My LogoSnoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Snoopy: an effective and efficient semantic join discovery framework powered by proxy-column-based column embeddings. The proposed column embeddings are obtained from the column-to-proxy-column relationships captured by a lightweight approximate-graph-matching-based column projection function. To acquire good pivot columns for guiding the column projection process, a rank-aware contrastive learning paradigm is introduced.

Requirements

  • Python 3.7
  • PyTorch 1.10.1
  • CUDA 11.5
  • NVIDIA 3090 GPU

Please refer to the source code to install all required packages in Python.

Datasets

We use WikiTable, Opendata, and WDC. We provide our experimental datasets.

Run Experimental Case

To construct training data:

python DataGen.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3

To learn proxy columns using the generated data:

python train.py --datasets "WikiTable" --type mat --tau 0.2 --list_size 3 --version Your_Model_Version

To perform semantically join search via learned proxy columns:

python search.py --datasets "WikiTable" --version Your_Model_Version --topk 25

Parameters

  • --datasets: the dataset used (e.g., "WikiTable")

  • --type: which data generation strategy to be used ("mat" means embedding-level, and "text" means text-level)

  • --tau: the threshold of cell matching

  • --list_size: the size of the positive ranking list

  • --version: the model version you saved during the training phase and used for online search

  • --topk: top-k joinable columns will be returned

Acknowledgementt

The original datasets are form WikiTable, opendata, and WDC Web Table Corpus.

The baseline Deepjoin is implemented with the details provided by the authors after contacting them.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages