Skip to content

kamir/etosha

Repository files navigation

Welcome to Etosha!

Etosha aims on building a bridge between the Big-Data and Linked-Data domains.

A global "Dataset-Graph" is our goal, global like Wikipedia, or an enterprise wide metadata graph which
spans multiple departments or institutions.

We create tools for contextualization, metadata extraction, and metadata integration.

The project is managed on our public JIRA.

##How Etosha Enables the Global Data Village? Etosha is a distributed decentralized fault tolerant metadata service. Publishing and linking of information about datasets is its main purpose. Etosha allows real knowledge management while HCatalog and the Hive megastore are focused on low level-technical metadata.

A “cluster internal” perspective offers analysts, developers, and administrators an easy and consistent set of procedures to aggregate existing information like dataset schema, table or column statistics (which can also be created by Hive-Stats or SARAH) within in a data-link-server. An external perspective offers features to connect multiple data-link servers, and to embed database metadata into non-technical contexts via semantic links. This embedding into the global linked data graph, using DOAx files, allows plausibility checks and supports interpretation and comparison of results while access control stays always in the users hands – data has not be moved to a public provider. Etosha operates as a Gateway between emerging datahubs.

Only active linking between technical data and domain knowledge allows a 360° view on any data set, no matter if it is a collection of documents, a normalized database model, or even a loose collection of files in HDFS.

Collecting data and availability of processing resources became mainstream during the last decade, especially since the breakthrough and the following enterprise success of the entire Hadoop ecosystem. But real insights require not just numbers or facts, but also context – Etosha provides scalable context management services.

Communication is the key for success in general and communication tools are widely accepted but there is an obvious lag of standards which allow the growth of a data-set network – comparable to Facebook’s social-graph or the CrunchBase business-graph – allows context embedding in a collaborative environment. Such a graph allows cluster spanning dataset discovery and a public global data catalog, which is considered to be the driving force towards a new era in data analysis and data driven business.

The Etosha-Graph will be built to arrange, visualize, filter and share dataset context information using a class of ontologies described as DOAx files. Etosha follows the path, shown by the DOAP project already years ago and applies the concept of interlinked project-life cycle management to context sensitive dataset life cycle management.