Reusable Dataform Slowly Changing Dimension Type 2 Backfill for Changing schema and initial loads

Slowly Changing Dimension

A slowly changing dimension in data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule.

Slowly Changing Dimension Type 2

Row versioning, Track changes as version records with current flag & active dates and other metadata.

Use cases for this Code?

1 - Initial Load 2 - History Backfill for schema changes 3 - Reload for a specific timestamp due to data ingestion issues/corrupt date, etc

Variables

Key variables in this code is the Source Table details (Catalog, schema and Table name), Target Table details (Catalog, schema and Table name), From date, To date and Portion Size. The other very important variable that we can set (just for CLI at the moment when this Readme was written) is the concurrency limits. This variable helps us set the number of parallel steps we want to execute at a time. The hash (unique and non_unique), eff_date and expiry date column names are optional, but if there are any specific naming conventions it could be used.

Reusable?

The code divides the From and To Date into portions size you enter and merge everything together. So in case of the data being huge for each partion, we could process it for daily all parallel (in case we want to limit, the concurrency could be used to define the max parallel tasks). And in case of smaller partition size we could do all at once.

Flowchart

The other thing that makes it reusable is that uses Information Schema to get all the Non Unique Column names and the internal metadata table for unique Column names. The joins and window functions are majorly made by the hash keys generated in the child tables and hence optimizes the code a little further.

The future steps to this code is a set of assertions to confirm if the data looks as expected. Also, depending on the use case we could either swap the final table created by this process or use a part of it as needed after renaming and droping the processing columns like hashing columns, min_source_eff_date, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
definitions		definitions
.gitignore		.gitignore
README.md		README.md
cloudbuild_scd2_backfill.yaml		cloudbuild_scd2_backfill.yaml
dataform.json		dataform.json
package-lock.json		package-lock.json
package.json		package.json
scd2_run.sh		scd2_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reusable Dataform Slowly Changing Dimension Type 2 Backfill for Changing schema and initial loads

About

Releases

Packages

Languages

latika-wadhwa/dataform-scd-backfill

Folders and files

Latest commit

History

Repository files navigation

Reusable Dataform Slowly Changing Dimension Type 2 Backfill for Changing schema and initial loads

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages