This image can be used for any enviroments to spinup a Apache Drill in Distributed mode. Distributed mode needs an external zookeper running available.
prerequisites
- Apache Drill
- Apache Zookeeper
- Docker installed
- AWS keys or AWS ec2 instance with Iam role attached with read and list permissions.
- AWS S3 bucket
If you are deploying the apache drill to ec2-instance that has iam role attached the
Add the below code to your core-site.xml file before build the Docker Image if your are using AWS IAM role to access s3 storage
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl.disable.cache</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>
</configuration>
If you are using a AWS Secret keys and Access keys then copy down the below config to your core-site.xml file before buiding the Docker Image
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>ENTER-YOUR-ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>ENTER-YOUR-SECRET-KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>
</configuration>
I am living closer to us-east-1 so my s3 endpoint is s3.amazonaws.com check your closer s3 endpoint in AWS Endpoint Search
clone the repository to your local file system.
cd apache-drill
Clone the repo and build the docker image using the commands below:
docker build -t apachedrill:latest .
docker compose up -d
Apache drill ui
docker pull ravuri96/apache-drill
- Create a s3 bucket
- Add the insurance-data.csv file in the project to s3 bucket.
- enable s3 storage plugin
To enable s3 stroge plugin
{
"type": "file",
"connection": "s3a://your-bucket-name",
"config": null,
"workspaces": {
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"extractHeader": true,
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"httpd": {
"type": "httpd",
"logFormat": "%h %t \"%r\" %>s %b \"%{Referer}i\"",
"timestampFormat": null
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
},
"enabled": true
}
copy the above configuration and paste it in your s3 storage plugin and update it.
We have already defined the AWS credentails or IAM Role in core-site.xml so we dont need to define again in storage plugin configuration
All configurations are done
Now go to http://localhost/8047/query
select * from s3.`insurance-data.csv`;
Happy Drilling..!!!