Init Hadoop Eco System on DigitalOcean

Briefing
Prepare SSH Key
Buy 4 hosts
Pre configuration
Install CDH (Cloudera Distribution Including Apache Hadoop)
Install Java 8 & Spark 2 (If you only use spark 1, ignore these steps)
For python user (If you only use Scala, you can ignore these steps)
Store a image copy, for secure..
Done !!!

Briefing

If you want to use it in production environment, please aware of the security configuration. The front few steps you can refer to this video: Youtube Video Here we use digital ocean as it’s cheap, if you use any IAAS platform as AWS, ALICLOUD, should be similar.

Prepare SSH Key

Create your ssh key as this instruction by digital ocean
I’m using Mac OS, I throw my ssh key as file ~/.ssh/id_rsa.

Buy 4 hosts

Create digital ocean account
Key in your credit card information.
Go to Main page -> Create -> Droplets -> choose CentOS -> one 8GB & three 2GB -> use the ssh key just created.

** Record down your IP **

65.135.11 hadoop-01.bde.com hadoop-01
189.36.21 hadoop-02.bde.com hadoop-02
189.42.80 hadoop-03.bde.com hadoop-03
189.40.146 hadoop-04.bde.com hadoop-04

Pre configuration

Modify the configuration on these four hosts, also you can modify your machine also.

ssh root@hadoop-01.bde.com
# Config Host
vi /etc/hosts
	# remove the self point ...
	# add the above IP Hostname mapping.
	# After configuration, looks like：
		# The following lines are desirable for IPv4 capable hosts
		127.0.0.1 localhost.localdomain localhost
		127.0.0.1 localhost4.localdomain4 localhost4

		# The following lines are desirable for IPv6 capable hosts
		::1 localhost.localdomain localhost
		::1 localhost6.localdomain6 localhost6

		159.65.135.11 hadoop-01.bde.com hadoop-01
		206.189.36.21 hadoop-02.bde.com hadoop-02
		206.189.42.80 hadoop-03.bde.com hadoop-03
		206.189.40.146 hadoop-04.bde.com hadoop-04

vi /etc/selinux/config
	SELINUX=disabled

reboot
ssh root@hadoop-02.bde.com
...
ssh root@hadoop-03.bde.com
...
ssh root@hadoop-04.bde.com
...
vi /etc/hosts

Install CDH (Cloudera Distribution Including Apache Hadoop)

Install CHD

ssh root@hadoop-01.bde.com
yum install wget
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin 
./cloudera-manager-installer.bin 
# Just click 'accept' 

Configure CDH

Wait for a while, let it start the CDH manager, use your own browser, go to http://hadoop-01.bde.com:7180/
Use ‘admin / admin’ login, a wizard will take you initial the whole CDH cluster.
You can use the default setting for most configuraitons. Some special listed down:

Specify hosts for your CDH cluster installation.
```
hadoop-0[1-4].bde.com
```
Provide Login Credentials Use our ssh key created just now.
Select Services I choose the Core with Spark, you can choose others

After we finish the wizard above
At 7180，click YARN - configuration - search Container Memory Set yarn.nodemanager.resource.memory-mb to ‘1.5GB’ Set yarn.scheduler.maximum-allocation-mb to ‘1.5GB’ Distrubute this configuration, click the button as a file beside the power button.

Let your data scientist / algrithm engineer use the HDFS role !!! Don’t give the root role to them ..

vi /etc/ssh/sshd_config
    # To disable tunneled clear text passwords, change to no here!
	#PasswordAuthentication yes
	PermitEmptyPasswords no
	PasswordAuthentication yes

Try login with hdfs

ssh hdfs@hadoop-01.bde.com
	password: hdfs

Good, we have start up a hadoop cluster !

Install Java 8 & Spark 2 (If you only use spark 1, ignore these steps)

Install Java 8 (Spark2 dependency), Cloudera Official Instruction

1. At 7180, stop Cluster all services
2. At 7180, stop Cloudera Management Service
3. Hadoop-01 terminal： 
	service cloudera-scm-agent stop
	service cloudera-scm-server stop
4. In all four hosts, run below commands:
	yum install wget
	cd /usr/java
	wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz"
	tar xzf jdk-8u171-linux-x64.tar.gz
	rm -rf jdk-8u171-linux-x64.tar.gz
5. Back to Hadoop-01 terminal：
	vi /etc/default/cloudera-scm-server
	Add: **export JAVA_HOME=/usr/java/jdk1.8.0_171**

	Run commands:    
	service cloudera-scm-server start
	service cloudera-scm-agent start
6. Wait for the server start up, at 7180，Hosts -> configure -> Java Home Directory, put /usr/java/jdk1.8.0_171
7. At 7180，start Cluster all services
8. At 7180，start Cloudera Management Service

Install Spark2，Cloudera Offical Link

cd
wget http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.3.0.cloudera2.jar
mv SPARK2_ON_YARN-2.3.0.cloudera2.jar /opt/cloudera/csd
chmod 644 /opt/cloudera/csd/SPARK2_ON_YARN-2.3.0.cloudera2.jar
service cloudera-scm-server restart

Wait for 7180 restart，there is a gift shape icon on the top right corner, click it, you can see spark2 in the list，click **download-Distribute-Active**.
Suggest click configuration and reselect **Validate Parcel Relations**
Restart the whole cluster
At 7180，Add service -> Spark2， depend on YARN。
History server select 'hadoop-01'， gateway select all, countine.
At 7180，stop spark，remove spark，choose None to both selection.
Go to Hadoop-01 terminal
	mkdir /var/log/spark2/lineage
	chmod 777 /var/log/spark2/lineage
	（Seems a new function for 2.3, simple debug here only ...）

Try Spark2

1. Switch to hdfs user
	su hdfs
	cd
2. Put a file into HDFS system
	vi purplecow.txt
	then write something.
	hadoop fs -put purplecow.txt .
3. Try spark2-shell
	export JAVA_HOME=/usr/java/jdk1.8.0_171
	spark2-shell
	>>  sc.textFile("purplecow.txt").count()
4. Try pyspark2
	JAVA_HOME=/usr/java/jdk1.8.0_171 pyspark2
	>> sc.textFile("purplecow.txt").count()
5. Exit
	Control + D
	exit

For python user (If you only use Scala, you can ignore these steps)

Install pyspark + juptyer notebook

python --version
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py
yum install python-devel
yum groupinstall "Development tools"
pip install jupyter
pip install findspark

Run jupyter

export PYSPARK_PYTHON=/usr/bin/python
export HADOOP_USER_NAME=hdfs
export SPARK_HOME=/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2 
export JAVA_HOME=/usr/java/jdk1.8.0_171 
jupyter notebook --ip=159.65.135.11 --port=80 --allow-root &

Try jupyter,Sample 1, Sample 2

import findspark
findspark.init()
from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)

# Write into HDFS
df.write.csv("example.csv")

# Read from HDFS
df_load = sparkSession.read.csv('example.csv')
df_load.show()

Store a image copy, for secure..

DigitalOcean main page -> Image -> create Snapshot

Done !!!

Take a rest ~