GitHub - EgorDudyrev/caspailleur: Minimalistic python package for mining many concise data representations. Part of SmartFCA project.

Caspailleur is a python package for mining concepts and implications in binary data with FCA framework. Part of SmartFCA ANR project.

Get started

The stable version of the package can be installed from PyPI with:

pip install caspailleur

and the latest version of the package can be installed from GitHub repository:

pip install caspailleur@git+https://github.com/EgorDudyrev/caspailleur

Analysis example

Glossary

The field of Formal Concept Analysis has many mathematical terms and some conflicting notation traditions. Here is the glossary used throughout the caspailleur package: Glossary.md.

Data description

Let us study the following "Fruit" dataset:

title	smooth	fruit	color	form
apple	✓	✓	yellow	round
grapefruit		✓	yellow	round
kiwi		✓	green	oval
plum	✓	✓	blue	oval
mango	✓	✓	green	oval

Let us download binarised version of the same data:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/EgorDudyrev/FCApy/main/data/mango_bin.csv', index_col=0)
df = df[df['fruit']]

print("__Objects:__", ', '.join(df.index))
print("__Attributes:__", ', '.join(df.columns))
print(df)

Objects: apple, grapefruit, kiwi, plum, mango

Attributes: firm, smooth, fruit, color_is_yellow, color_is_green, color_is_blue, color_is_white, form_is_round, form_is_oval, form_is_cubic

Binarised fruit dataset

title firm smooth fruit ... form_is_round form_is_oval form_is_cubic

apple False True True ... True False False

grapefruit False False True ... True False False

kiwi False False True ... False True False

plum False True True ... False True False

mango False True True ... False True False

[5 rows x 10 columns]

Tip

Caspailleur package can only work with binary data (and is optimised for this). You can consult Paspailleur package that extends Caspailleur functionality for complex non-binary data.

Mining concepts

Now we can find all concepts in the data:

import caspailleur as csp
concepts_df = csp.mine_concepts(df)

print(concepts_df[['extent', 'intent']].map(', '.join))

Concepts table (10 rows)

concept_id extent intent

0 plum, mango, grapefruit, apple, kiwi fruit

1 plum, mango, apple smooth, fruit

2 plum, kiwi, mango form_is_oval, fruit

3 plum, mango form_is_oval, smooth, fruit

4 grapefruit, apple color_is_yellow, form_is_round, fruit

5 kiwi, mango form_is_oval, color_is_green, fruit

6 apple smooth, color_is_yellow, form_is_round, fruit

7 mango form_is_oval, color_is_green, smooth, fruit

8 plum form_is_oval, color_is_blue, smooth, fruit

9 firm, color_is_yellow, form_is_round, form_is_oval, color_is_green, color_is_blue, smooth, form_is_cubic, color_is_white, fruit

The number of concepts is exponential to the number of objects and attributes in the data. To find only the most interesting concepts, specify min_support, min_delta_stability and/or n_stable_concepts parameters:

concepts_df = csp.mine_concepts(
  df, min_support=3, min_delta_stability=1,
  to_compute=['intent', 'keys', 'support', 'delta_stability', 'lesser']
)
print(concepts_df)

concept_id	intent	keys	support	delta_stability	lesser
0	{fruit}	[{}]	5	2	{1, 2}
1	{fruit, smooth}	[{smooth}]	3	1	{}
2	{fruit, form_is_oval}	[{form_is_oval}]	3	1	{}

Mining implications

For many datasets, the number of concepts is too large to be read by hand. Luckily, relationships between attributes can be described via implication bases whose number is usually much smaller (although there may be many implications selecting no objects).

import caspailleur as csp
implications_df = csp.mine_implications(df)

print(implications_df[['premise', 'conclusion']].map(', '.join))

Implications table (15 rows)

implication_id	premise	conclusion
0		fruit
1	form_is_round	color_is_yellow
2	color_is_yellow	form_is_round
3	color_is_green	form_is_oval
4	color_is_blue	form_is_oval, smooth
5	form_is_cubic	firm, color_is_white, color_is_blue, color_is_...
6	color_is_white	firm, color_is_blue, color_is_yellow, form_is_...
7	firm	color_is_white, color_is_blue, color_is_yellow...
8	form_is_oval, form_is_round	firm, color_is_white, color_is_blue, color_is_...
9	color_is_blue, form_is_round	form_is_cubic, color_is_white, firm, color_is_...
10	form_is_round, color_is_green	firm, color_is_white, color_is_blue, form_is_c...
11	color_is_blue, color_is_green	firm, color_is_white, color_is_yellow, form_is...
12	form_is_oval, color_is_yellow	firm, color_is_white, color_is_blue, color_is_...
13	color_is_yellow, color_is_blue	form_is_cubic, color_is_white, firm, color_is_...
14	color_is_yellow, color_is_green	firm, color_is_white, color_is_blue, form_is_c...

We can read the implications in the table and find out dependencies in the data. For example:

every object is a fruit
(from impl. 0: Ø -> fruit);
every round fruit is yellow and vice versa
(from impl. 1: form_is_round -> color_is_yellow, and impl. 2: color_is_yellow -> form is round);
firm fruits do not exist
(from impl. 7: firm -> color_is_white, color_is_blue,... ).

If finding full implication basis takes too much time, one can mine only a part of columns and implications:

implications_df = csp.mine_implications(
  df, basis_name='Canonical', unit_base=True,
  to_compute=['premise', 'conclusion', 'support'],
  min_support=2, min_delta_stability=1
)

print(implications_df)

implication_id	premise	conclusion	support
0	{}	fruit	5
1	{fruit, color_is_green}	form_is_oval	2
2	{fruit, form_is_round}	color_is_yellow	2
3	{fruit, color_is_yellow}	form_is_round	2

Mining descriptions

Finally, Caspailleur can output all descriptions in the data and their characteristics. But note that the number of descriptions = 2^number of attributes.

import caspailleur as csp
descriptions_df = csp.mine_descriptions(df)

print('__n. attributes:__', df.shape[1])
print('__n. descriptions:__', len(descriptions_df))
print('__columns:__', ', '.join(descriptions_df.columns))
print(descriptions_df[['description', 'support', 'is_key']].head(3))

n. attributes: 10
n. descriptions: 1024

columns: description, extent, intent, support, delta_stability, is_closed, is_key, is_passkey, is_proper_premise, is_pseudo_intent

description_id	description	support	is_key
0	{}	5	True
1	{firm}	0	True
2	{smooth}	3	True

Approach for faster computation

Caspailleur does three things to fasten up the computations:

It exploits the connections between characterisic attribute sets.
E.g. a function to compute proper premises takes intents and keys as inputs, and not the original binary data.
The set of intents is computed by LCM algorithm
well-implemented in scikit-mine package: https://pypi.org/project/scikit-mine/;
All intrinsic computations are performed with bitwise operations
provided by bitarray package: https://pypi.org/project/bitarray/

The diagram below presents dependencies between the characteristic attribute sets. For example, the arrow "intents -> keys" means that the set of intents is required to compute the set of keys.

  graph TD;
      S["<b>itemsets</b><br><small><tt>csp.np2bas(...)</tt></small>"];
      A["<b>intents</b><br><small><tt>csp.list_intents_via_LCM(...)</tt></small>"];
      B["<b>keys</b><br><small><tt>csp.list_keys(...)</tt></small>"];
      C["<b>passkeys</b><br><small><tt>csp.list_passkeys(...)</tt></small>"];
      D["<b>intents ordering</b><br><small><tt>csp.sort_intents_inclusion(...)</tt></small>"]; 
      E["<b>pseudo-intents</b><br><small><tt>csp.list_pseudo_intents_via_keys(...)</tt></small>"];
      F["<b>proper premises</b><br><small><tt>csp.iter_proper_premises_via_keys(...)</tt></small>"];
      G["<b>linearity index</b><br><small><tt>csp.linearity_index(...)</tt></small>"];
      H["<b>distributivity index</b><br><small><tt>csp.distributivity_index(...)</tt></small>"];
      
      S --> A
      A --> B;
      A --> C;
      A --> D;
      A --> E; 
      B --> E;  
      B --> F; A --> F; 
      A --> G; D --> G;
      D --> H; A --> H;

In case the diagram is not compiling, visit the GitHub version of README: https://github.com/EgorDudyrev/caspailleur

Note

Although caspailleur package implements many optimisations to fasten up the computations, we do not state that it is the fastest FCA package ever existed. For example, our algorithm for computing pseudo-intent basis is far from the state-of-art. Knowing that, we find caspailleur fast enough for comfortable everyday use.

How to cite

There are no papers written about caspailleur (yet). So you can cite the package itself.

@misc{caspailleur,
  title={caspailleur},
  author={Dudyrev, Egor},
  year={2023},
  howpublished={\url{https://www.smartfca.org/software}},
}

Funding

The package development is supported by ANR project SmartFCA (ANR-21-CE23-0023).

SmartFCA (https://www.smartfca.org/) is a big platform that will contain many extensions of Formal Concept Analysis including pattern structures, Relational Concept Analysis, Graph-FCA and others. While caspailleur is a small python package that covers only the basic notions of FCA.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github/workflows		.github/workflows
caspailleur		caspailleur
logo		logo
tests		tests
.gitignore		.gitignore
Glossary.md		Glossary.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Get started

Analysis example

Glossary

Data description

Mining concepts

Mining implications

Mining descriptions

Approach for faster computation

How to cite

Funding

About

Releases 1

Packages

Contributors 2

Languages

title	firm	smooth	fruit	...	form_is_round	form_is_oval	form_is_cubic
apple	False	True	True	...	True	False	False
grapefruit	False	False	True	...	True	False	False
kiwi	False	False	True	...	False	True	False
plum	False	True	True	...	False	True	False
mango	False	True	True	...	False	True	False
[5 rows x 10 columns]

concept_id	extent	intent
0	plum, mango, grapefruit, apple, kiwi	fruit
1	plum, mango, apple	smooth, fruit
2	plum, kiwi, mango	form_is_oval, fruit
3	plum, mango	form_is_oval, smooth, fruit
4	grapefruit, apple	color_is_yellow, form_is_round, fruit
5	kiwi, mango	form_is_oval, color_is_green, fruit
6	apple	smooth, color_is_yellow, form_is_round, fruit
7	mango	form_is_oval, color_is_green, smooth, fruit
8	plum	form_is_oval, color_is_blue, smooth, fruit
9		firm, color_is_yellow, form_is_round, form_is_oval, color_is_green, color_is_blue, smooth, form_is_cubic, color_is_white, fruit

License

EgorDudyrev/caspailleur

Folders and files

Latest commit

History

Repository files navigation

Get started

Analysis example

Glossary

Data description

Mining concepts

Mining implications

Mining descriptions

Approach for faster computation

How to cite

Funding

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages