1. Hadoop Assessment Tool

This tool is developed to enable detailed Assessment and automated assessment of Hadoop clusters. It helps measure the migration efforts from the current Hadoop cluster. It generates a PDF report with information related to the complete cluster according to different categories.

Following are the specific categories:

Hardware & OS Footprint
Framework & Software Details
Data & Security
Network & Traffic
Operations & Monitoring
Application

2 .Tool Functionality

The Hadoop Assessment tool is built to analyze the on-premise Hadoop environment based on various factors/metrics.

This python-based tool will use Cloudera API, Generic - YARN API, and OS based CLI request to retrieve information from the Hadoop cluster
Information from the APIs will come in the form of JSON files
Information from CLI command request will be outputs stored in variables
With the help of Python parsing methods, required insights about the features will be retrieved
As an output of tool execution, a PDF report will be generated which will contain information about all the features
Script will also generate a log file, which will contain execution information & errors (if any)

3. Prerequisites

Important highlights of the Tool
1. The tool runs on versions above python 3.6, 3.7 & 3.8
2. The tool supports Cloudera version - CDH 5.13.3 and above; CDH 6.X, CDH 7.X
3. The tool runs on the following OS versions Centos 6, Redhat 7, Debian 9, Ubuntu 16, Sles 12sp5 and above versions for all.
4. The tool requires pip installed

Complete information to run the tool

The tool runs only on one of the master nodes
The tool requires ~265 megabytes of space
Preferred time to run the tool: It is recommended to run the tool during hours when there is the least workload on the Hadoop cluster
This tool supports the following Python versions:
1. python 3.6
2. python 3.7
3. python 3.8
This tool supports the following Cloudera versions:
1. CDH 5.13.3 and above
2. CDH 6.x
3. CDH 7.x
This tool runs on the following Linux versions:
1. Centos 6 and above
2. Redhat 7 and above
3. Debian 9 and above
4. Ubuntu 16 and above
5. Sles 12sp5 and above
This tool requires pip installed.
This tool requires an updated command line package manager before running the script. This will update the package list for the packages to be upgraded, from the operating system’s central repository. The package manager can be upgraded with the help of the below respective OS commands:
1. Redhat/Centos:
```
yum -y update && upgrade
```
2. Ubuntu/Debian:
```
sudo apt-get update
```
3. Open Suse:
```
   sudo zypper update
```

Cloudera manager user should have one of the following roles:

Hadoop Version	Roles
CDH 5.13.3	Dashboard User, User Administrator, Full Administrator, Operator, BDR Administrator, Cluster Administrator, Limited Operator, Configurator, Read-Only, Auditor, Key Administrator, Navigator Administrator
CDH 6.x	Dashboard User, User Administrator, Full Administrator, Operator, BDR Administrator, Cluster Administrator, Limited Operator, Configurator, Read-Only, Auditor, Key Administrator, Navigator Administrator
CDH 7.x	Auditor, Cluster Administrator, Configurator, Dashboard User, Full Administrator, Key Administrator, Limited Cluster Administrator, Limited Operator, Navigator Administrator, Operator, Read Only, Replication Administrator, User Administrator

Following OS packages will be installed to generate metrics:
1. Tool checks if each package is already installed
2. If the package is not already installed, it will install from a local repo
3. If it is not present in the local repo, then it will download it from the internet

Package	Package Description
Python-dev	package contains the header files and dependent packages which need to fabricate python augmentations. Hence, those files and packages need to be on the system in such a way that they can be found while running the script
Gcc gcc-c++	The package is used in python pandas functions which are used to build data frames in the script.
unixODBC-devel	The package is used to define a database-neutral interface to connect relational databases.
cyrus-sasl-devel	The package is used to connect with the hive metastore which helps in the retrieval of features related to the hive.
Nload	The package is a third-party monitoring tool used for the system monitoring and checking ingress-egress-related information.
Vnstat	A package is a third-party tool, used to check network utilization over a period of time.
Iostat	The package is used for monitoring system input/output statistics for devices and partitions.
Libsasl2-dev (ubuntu specific)	To install python3-venv in Ubuntu, libsasl2-dev packages will be installed as ubuntu does not provide venv inside a python package.

This tool comes with below python packages and will be locally installed in ephemeral python virtual environment:

Python packages in tool	Python packages in tool	Python packages in tool
certifi==2020.12.5	chardet==4.0.0	cycler==0.10.0
DateTime==4.3	fpdf==1.7.2	greenlet==1.0.0
idna==2.10	importlib-metadata==3.7.3	kiwisolver==1.3.1
matplotlib==3.3.4	numpy==1.19.5	pandas==1.1.5
Pillow==8.1.2	psycopg2-binary==2.8.6	PyMySQL==1.0.2
pyodbc==4.0.30	pyparsing==2.4.7	python-dateutil==2.8.1
pytz==2021.1	requests==2.25.1	scipy==1.5.4
seaborn==0.11.1	six==1.15.0	SQLAlchemy==1.3.22
tqdm==4.59.0	typing-extensions==3.7.4.3	urllib3==1.26.4
virtualenv==15.1.04	zipp==3.4.1	zope.interface==5.2.0
wheel=0.36.2	pip=21.0.1	cffi=1.14.5
cryptography==3.4.7	pycparser=2.20	cx-Oracle=8.1.0

Tool supports below hive metastore
1. PostgreSQL
2. MySQL
3. Oracle
Tool needs access to dfsadmin utility.
Detailed execution information will be logged in the two log files, which will be present at location :
1. Python Log:- ./hadoop_assessment_tool_{YYYY-MM-DD_HH:MM:SS}.log
2. Shell Log:- ./hadoop_assessment_tool_terminal.log
After a successful tool execution, a PDF report will be generated at the location: ./hadoop_assessment_report_{YYYY-MM-DD_HH-MM-SS}.pdf

User Input Requirements
1. The tool needs the below permissions to run the code and generate the PDF report:
  1. Sudo permission on the node.
  2. Cloudera Manager(User should have one of the roles mentioned in 3.1.9)
    1. Host IP
    2. Port Number
    3. User Name
    4. Password
    5. Cluster Name
  3. Hive Metastore
    1. User Name
    2. Password
  4. Kafka
    1. Number of brokers
    2. Host Name of each broker
    3. IP of each broker
    4. Port number of each broker
    5. Log directory path of each broker
  5. SSL is enabled or not(conditional input - if automatic detection doesn't work it will be prompted)
  6. Yarn (conditional input - if automatic detection doesn't work it will be prompted)
    1. Resource managers hostname or IP address
    2. Port Number

4. Installation Steps

Step 1: Create tarball of this repository or clone this repo directly to execute it on an EdgeNode that has access to the cluster management services. It can be uploaded in multiple ways, one of them being with the help of SCP command between the local machine and node or by using tools like Winscp if the local system is windows.
Step 2: Go to the Tarball location (the location where it was uploaded in Step 1)
Step 3: Extract the Tarball hadoop-discovery-tool.tar

tar -xvf hadoop-discovery-tool.tar

Step 4: Go to the hadoop-discovery-tool tool directory

cd hadoop-discovery-tool/tool

Step 5: Give execute permission to the scripts

chmod +x build.sh
chmod +x run.sh
chmod +x os_package_installer.py
chmod +x python_package_installer.py

Step 6: Run the first script called build.sh for building the environment, using the following command

sudo bash build.sh

Step success message: Hadoop Assessment tool deployed successfully

Step 7: Run the second script to run the python script, using the command

sudo bash run.sh

Step success criteria: Hadoop Assessment Tool has been successfully completed and the report is available at the following location

Step 8: Following details would be required for further execution of the script:
1. Step 8.1(Conditional step) - SSL: If the tool is unable to automatically detect SSL enabled on the cluster, it would display the following message
```
Do you have SSL enabled for your cluster? [y/n]
```
  1. Step 8.1.1: If you select 'y', continue to Step 8.2 -
```
 As SSL is enabled, enter the details accordingly
```
  2. Step 8.1.2: If you select 'n', continue to Step 8.2 -
```
 As SSL is disabled, enter the details accordingly
```
2. Step 8.2 - Cloudera Manager credentials: the prompt would ask you if you want to provide the Cloudera Manager credentials, you would have to select 'y' or 'n'
  1. Step 8.2.1: If you select 'y', continue to Step 8.2.1.1 -
```
 A major number of metrics generation would require Cloudera manager credentials Therefore, would you be able to provide your Cloudera Manager credentials? [y/n]: 
```
    1. Step 8.2.1.1: Enter Cloudera Manager Host IP
      Enter Cloudera Manager Host IP:
    2. Step 8.2.1.2: Cloudera Manager Port - the prompt would ask you if your Cloudera Manager Port is 7180. If true select 'y' else select 'n'
      Enter Cloudera Manager Host IP:
      1. Step 8.2.1.2.1: If you select 'y', continue to Step 8.2.1.3
        Is your Cloudera Manager Port number 7180? [y/n]:
      2. Step 8.2.1.2.2: If you select 'n', continue to Step 8.2.1.2.2
        Is your Cloudera Manager Port number 7180? [y/n]:
      3. Step 8.2.1.2.3: Since the port number is not 7180, enter your Cloudera Manager Port number
        Enter your Cloudera Manager Port number:
    3. Step 8.2.1.3: Cloudera Manager username
      Enter Cloudera Manager username:
    4. Step 8.2.1.4: Cloudera Manager password
      Enter Cloudera Manager password:
    5. Step 8.2.1.5: Select the Cluster
      Select the cluster from the list below: 1] Cluster 1 2] Cluster 2 . . n] Cluster n Enter the serial number (1/2/../n) for the selected cluster name:
  2. Step 8.2.2: If you select 'n', continue to Step 8.4
```
 A major number of metrics generation would require Cloudera manager credentials Therefore, would you be able to provide your Cloudera Manager credentials? [y/n]: 
```
3. Step 8.3: Hive Metastore database credentials - This would only be prompted if Cloudera Manager credentials were provided in the previous step. The prompt would ask you if you want to provide Hive Metastore database credentials, you would have to select 'y' or 'n'
  1. Step 8.3.1: If you select 'y', continue to Step 8.3.1.1
```
 To view hive-related metrics, would you be able to enter Hive credentials?[y/n]: 
```
    1. Step 8.3.1.1: Hive Metastore username - the prompt would ask you to enter your Hive Metastore username
      Enter Hive Metastore username: hive
    2. Step 8.3.1.2: Hive Metastore password - the prompt would ask you to enter your Hive Metastore password
      Enter Hive Metastore password:
  2. Step 8.3.2: If you select ‘n’, continue to the next step
```
 To view hive-related metrics, would you be able to enter Hive credentials?[y/n]: 
```
4. Step 8.4 (Conditional step) - YARN Configurations: If the tool is unable to automatically detect YARN configurations, it would prompt you to enter Yarn credentials, you would have to select 'y' or 'n'
  1. Step 8.4.1: If you select 'y', continue to Step 8.4.1.1
```
 To view yarn-related metrics, would you be able to enter Yarn credentials?[y/n]:
```
    1. Step 8.4.1.1: Enter Yarn Resource Manager Host IP or Hostname:
      Enter Yarn Resource Manager Host IP or Hostname:
    2. Step 8.4.1.2: Enter Yarn Resource Manager Port:
      Enter Yarn Resource Manager Port:
  2. Step 8.4.2: If you select 'n', continue to Step 8.5
```
 To view yarn-related metrics, would you be able to enter Yarn credentials?[y/n]:
```
5. Step 8.5: Kafka credentials - the prompt would ask you whether you want to enter your Kafka credentials; you would have to select 'y' or 'n'
  WARNING: If a user enters wrong inputs, the tool doesn’t prompt for invalid user inputs.
  1. 8.5.1: If you select 'y', continue to Step 8.5.1.1
```
 To view Kafka-related metrics, would you be able to provide Kafka credentials?[y/n]: 
```
    1. Step 8.5.1.1: Number of brokers in Kafka
      Enter the number of Kafka brokers:
    2. Step 8.5.1.2: Enter the hostname and port name of each broker (n times) - iterated for the number of brokers present
      1. Step 8.5.1.2.1: Hostname or IP of Broker n1
        Enter the hostname or IP of broker n1:
      2. Step 8.5.1.2.2: Port Number of Broker- the prompt would ask you if the broker has the port number 9092. If it is 9092, select 'y' else select 'n'
        Enter the hostname or IP of broker n1:
        
        Step 8.5.1.2.2.1: If you select 'y', continue to Step 8.5.1.2.3
        Is your broker hosted on <broker_name> have port number 9092? [y/n]:
        
        Step 8.5.1.2.2.2: If you select 'n', continue to next step
        Is your broker hosted on <broker_name> have port number 9092? [y/n]:
        
        Step 8.5.1.2.2.3: Since the port number is not 9092, enter the port number of broker n1, enter a valid port number
        Please enter the port number of broker hosted on <broker_name>
      3. Step 8.5.1.2.3: Confirm the log directory path of the broker - the prompt would ask you if the broker is on a certain log directory path, you will have to confirm the path. If the given path is correct, select 'y' or 'n'
        
        Step 8.5.1.2.3.1: If you select 'y' and there are more brokers left, steps from 8.5.1.2 would be repeated
        Does the broker hosted on <broker_name> have the following path to the log directory path/var/local/kafka/data/?[y/n]:
        
        Step 8.5.1.2.3.2: If you select 'n', continue to the next step
        Does the broker hosted on <broker_name> have the following path to the log directory path/var/local/kafka/data/?[y/n]:
        
        Step 8.5.1.2.3.3: Since the port number path was different
        Enter the log directory path of broker hosted on <broker_name>:
      4. Step 8.5.2: If you select 'n', continue to Step 8.6
        To view kafka-related metrics, would you be able to enter Kafka credentials?[y/n]:
6. Step 8.6: Date range for the Assessment report - Select one of the below options for a date range to generate the report for this time period
```
Select the time range of the PDF Assessment report from the options below:
[1] Week: generates the report from today to 7 days prior
[2] Month: generates the report from today to 30 days prior
[3] Custom: generates the report for a custom time period
Enter the serial number [1/2/3] as required:
```
  1. If you select 1 and 2, the report automatically gets generated based on the selected range as per the description.
  2. If you select 3, here’s the prompt that appears, Important note: Please enter the timing details according to the timezone of the tool hosting node:
```
Enter start date: [YYYY-MM-DD HH:MM]
2021-03-15 00:00
Enter end date: [YYYY-MM-DD HH:MM]
2021-03-30 00:00
```
Step 9: PDF Report - A PDF report will be generated at the end of successful execution, which can be downloaded with the help of the same SCP client or WinSCP tool with the help of which we uploaded the tar in Step1.

FAQ

What is this tool?
The Hadoop assessment tool (https://github.com/GoogleCloudPlatform/hadoop-discovery-tool) is a quick way to get an understanding of your Hadoop cluster topology, workloads, and utilization; it's a combination of scripts that interact with Cloudera Manager / YARN.
Are the scripts invasive?
The scripts are not invasive and make read-only API calls.
Why should I run this tool?
The tool provides critical information that helps everyone understand the current state and where optimizations can be gained during the migration phase.
How long will it take to run?
A Hadoop Administrator can run the tool in less than half a day.
What's the output?
A PDF report will be generated at the end with a summary of insights and recommendations (Hadoop Assessment tool sample report: https://drive.google.com/file/d/1NmVj4uvxUPj5QwHATsb5aKTVgB9vfpZf/view?resourcekey=0-RMr51QdyjWWL81KvzgxGHw).
Why does it require Linux packages to be installed?
The tool installs some Linux packages to generate a pdf report and corresponding visualizations; the package upgrades are not mandatory if the required dependencies are already satisfied. If so, the upgrade command can be commented out.
What if I don't want to install this tool on my cluster?
You could add a temporary edge node where you can install this discovery tool and then remove the edge node once a pdf report is generated.
Is dfsadmin access mandatory?
No, dfsadmin access is optional.

Contributing

We'd love to accept your patches and contributions to this project. For more details on how to contribute read CONTRIBUTING.md .

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
tool		tool
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
architectural_diagram.png		architectural_diagram.png
code_release.py		code_release.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Hadoop Assessment Tool

2 .Tool Functionality

3. Prerequisites

4. Installation Steps

FAQ

Contributing

About

Releases

Packages

Languages

License

Raniksingh/hadoop-discovery-tool

Folders and files

Latest commit

History

Repository files navigation

1. Hadoop Assessment Tool

2 .Tool Functionality

3. Prerequisites

4. Installation Steps

FAQ

Contributing

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages