Skip to content

Commit

Permalink
Add data lineage newtutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Abdullah Mamun committed Jul 24, 2023
1 parent 42ff042 commit 7a5928d
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 10 deletions.
36 changes: 36 additions & 0 deletions doc/data_lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Data lineage

As a part of this tutorial, we will see how to view data lineage using Kolle without writing any single line of code.

According to the search engine- Data lineage is the process of tracking the flow of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.

There are two types of data lineage One is business lineage and the other one is technical lineage. Business lineage provides a high-level view of data for the user who want to see what is the source of data and where it is going as a destination to ensure data privacy rule or validate any other legal requirement. Technical lineage enables users to view details of transformation and drill down into schema and attributes level.

Using Kolle's automated data lineage, users can view the transformation and link between the producer and consumer model.

Domain: Insurance claim

Source data: [Datasets](https://github.com/databricks-industry-solutions/dlt-insurance-claims/tree/main/data/samples/mongodb/claims.json)

#### High level

Import producer model -> Transformation -> Consumer model

End to end lineage

### Processing step

1. Importing source models from claims
2. Data transformation
3. Data lineage between producer and consumer

### Technical setup

1. Json file as a source
2. Kafka for event streaming to ingest and process data in real-time
3. Kolle for metabase repository and lineage

### Show me

[![Introduction](https://img.youtube.com/vi/tFWhxj-SCPA/0.jpg)](https://youtu.be/tFWhxj-SCPA)

24 changes: 14 additions & 10 deletions doc/low_code_introduction.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
## Low code introduction

Kolle low code is an RDF triple-based declarative mapping language. All mapping has a subject, property, and object.
Kolle low code is an RDF triple-based declarative language.

The main purpose of defining Kolle low-code is to easily convert to no-code and understandable to machine and human. On the other hand, RDF is easy to represent and transform. The input and output of the Kolle compiler is RDF triple.

Each line of Kolle input is triple and output is also triple. The final emitter of Kolle can be SQL statement, data-linage UI, model visualization, Avro schema, etc.

Triple has a subject, property, and object.

Subject: It is the name of the model or type. It will always symbol and not nil

Expand Down Expand Up @@ -247,7 +253,7 @@ Convert document model to flatten model
Subject Property Object
-----------------------------------------------------------------------------
person_full nil (metadata {:person {:f_name "" :l_name "" dob ""}
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
_ _ (flatten person_full);```
```
Expand All @@ -261,7 +267,7 @@ Apply is used for batch operation when the consumer model needs to remove duplic
Subject Property Object
-----------------------------------------------------------------------------
person_full nil (metadata {:person {:f_name "" :l_name "" dob ""}
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
_ _ (flatten person_full);
_ _ (apply distinct _raw)```
```
Expand All @@ -274,7 +280,7 @@ Get returns one model from the document or hierarchical model.
Subject Property Object
-----------------------------------------------------------------------------
person_full nil (metadata {:person {:f_name "" :l_name "" dob ""}
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
:address [{:post_code 23454 :street "" :city ""}]} ) ```;
person_raw nil (get person_full person);
address_raw nil (get person_full address)```
```
Expand All @@ -286,7 +292,7 @@ Change attribute value from producer model to consumer model.
```
Subject Property Object
-----------------------------------------------------------------------------
person nil (metadata {:f_name "" :l_name "" gender ""}} )
person nil (metadata {:f_name "" :l_name "" gender ""}} )
person_refined first_name person/f_name
person_refined last_name person/l_name
person_refind gender (replace-value person/gender {"m" "male" "f" "female"} "na")
Expand All @@ -299,7 +305,7 @@ Accessing array element from producer model.
```
Subject Property Object
-----------------------------------------------------------------------------
person nil (metadata {:f_name "" :l_name "" gender "" mobile_no ["0176-564"]}} )
person nil (metadata {:f_name "" :l_name "" gender "" mobile_no ["0176-564"]}} )
person_refined first_name person/f_name
person_refined last_name person/l_name
person_refined mobile_no (index-of person/mobile_no 1)
Expand Down Expand Up @@ -344,15 +350,14 @@ person latest only store latest value from producer person. **id** is primary ke
##### Data quality rule
```
Subject Property Object
-------------------------------------
party_raw id nil
party_raw f_name nil
party_raw l_name nil
party_raw age nil

party_refined nil (select party_raw/*)
party_refined age party_raw/age
party_refined nil (assoc-dv-attr id party_raw/id)
Expand All @@ -362,5 +367,4 @@ party_refined nil (where+ (!=null party_raw/f_name party_raw/l_name)
party_refined nil (where+ (< 4 party_raw/l_name))
```
where+ is used with different attribute to apply different data quality rule.
where+ is used with different attribute to apply different data quality rule.

0 comments on commit 7a5928d

Please sign in to comment.