Skip to content
QIUSHI BAI edited this page Aug 11, 2019 · 5 revisions

Front-end developers need to tell Cloudberry which dataset to query and how the dataset looks like so that it can utilize the Cloudberry optimization techniques.

To do this, send the DDL (Data Definition Language) JSON file to Cloudberry /admin/register path by using POST HTTP method. The following page introduces how to write a DDL JSON file and how to send it to Cloudberry. We still use the Twitter data example for illustration, the schema is defined in Prepare Dataset.

DDL JSON

To declare a dataset schema to Cloudberry, write a JSON file including the following components:

  • dataset : the dataset (table) name in your database.
  • schema : the schema definition.
    • typeName : (optional) type name for the dataset. (Only useful for AsterixDB)
    • dimension : the columns to do group by on. They are usually the x-axis in a visualization figure.
    • measurement : the columns to apply the aggregation functions on, such as count(), sum(), average(), min(), max(). They can also be used to filter the data but they should not be used as group by keys.
    • primaryKey : the primary key column name.
    • timeField : the time column name. Used for query slicing.

The following JSON request can be used to register the Twitter dataset inside AsterixDB to the middleware.

{
  "dataset":"twitter.ds_tweet",
  "schema":{
    "typeName":"twitter.typeTweet",
    "dimension":[
      {"name":"create_at","isOptional":false,"datatype":"Time"},
      {"name":"id","isOptional":false,"datatype":"Number"},
      {"name":"coordinate","isOptional":false,"datatype":"Point"},
      {"name":"lang","isOptional":false,"datatype":"String"},
      {"name":"is_retweet","isOptional":false,"datatype":"Boolean"},
      {"name":"hashtags","isOptional":true,"datatype":"Bag","innerType":"String"},
      {"name":"user_mentions","isOptional":true,"datatype":"Bag","innerType":"Number"},
      {"name":"user.id","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.stateID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.countyID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.cityID","isOptional":false,"datatype":"Number"},
      {"name":"geo","isOptional":false,"datatype":"Hierarchy","innerType":"Number",
        "levels":[
          {"level":"state","field":"geo_tag.stateID"},
          {"level":"county","field":"geo_tag.countyID"},
          {"level":"city","field":"geo_tag.cityID"}]}
    ],
    "measurement":[
      {"name":"text","isOptional":false,"datatype":"Text"},
      {"name":"in_reply_to_status","isOptional":false,"datatype":"Number"},
      {"name":"in_reply_to_user","isOptional":false,"datatype":"Number"},
      {"name":"favorite_count","isOptional":false,"datatype":"Number"},
      {"name":"retweet_count","isOptional":false,"datatype":"Number"},
      {"name":"user.status_count","isOptional":false,"datatype":"Number"}
    ],
    "primaryKey":["id"],
    "timeField":"create_at"
  }
}

Note:

  • Columns that are not interesting to visualization are not required to appear in the schema declaration.
  • isOptional: columns that can be missed in semi-structured databases or nullable in traditional relational databases.
  • datatype: data type of the declared column, choices introduced as following.

Data Types

Cloudberry supports the following data types:

  • Boolean : boolean in databases.
  • Number : a superset including int8, int32, int64, float, double in databases.
  • Point : geo-location point composed of two Numbers, e.g. Point(80.00, -10.0).
  • Time : datetime in databases.
  • String : string in databases. It is usually used for dimension columns to do filtering and "group by".
  • Text : text in databases or string for databases who do not support text. It is only applicable to measurement columns to do filtering by a full-text search. Usually, it implies there is an inverted-index built on the field.
  • Bag : set in databases (mainly AsterixDB, traditional relational databases usually do not support set).
  • Hierarchy : A synthetic field that defines hierarchical relationships between the existing columns.

Cloudberry supports the following pre-defined functions for different data types:

Pre-defined Functions

Datatype Filter Groupby Aggregation
Boolean isTrue, isFalse self distinct-count
Number <, >, ==, in, inRange bin(scale) count, sum, min, max, avg
Point inRange cell(scale) count
Time <, >, ==, inRange interval(x hour) count
String contains, matchs, ~= self distinct-count, topK
Text contains distinct-count, topK (on word-token result)
Bag contains distinct-count, topK (on internal data)
Hierarchy rollup

Register End Point

The front-end application can send the ddl JSON file to Cloudberry /admin/register path by using POST HTTP method. E.g., we can register the previous ddl using the following command line:

curl -X POST -H "Content-Type: application/json" -d @JSON_FILE_NAME http://localhost:9000/admin/register

You can access the following url to check all datasets' schema that successfully registered in Cloudberry.

http://localhost:9000/

Now you have the dataset registered to Cloudberry, you can move on to Query Cloudberry.