Briefing

If you want to use it in production environment, please aware of the security configuration. The front few steps you can refer to this video: Youtube Video Here we use digital ocean as it’s cheap, if you use any IAAS platform as AWS, ALICLOUD, should be similar.

Prepare SSH Key

Buy 4 hosts

  • Create digital ocean account
  • Key in your credit card information.
  • Go to Main page -> Create -> Droplets -> choose CentOS -> one 8GB & three 2GB -> use the ssh key just created.
    servers

** Record down your IP **

159.65.135.11 hadoop-01.bde.com hadoop-01
206.189.36.21 hadoop-02.bde.com hadoop-02
206.189.42.80 hadoop-03.bde.com hadoop-03
206.189.40.146 hadoop-04.bde.com hadoop-04

Pre configuration

Modify the configuration on these four hosts, also you can modify your machine also.

ssh root@hadoop-01.bde.com
# Config Host
vi /etc/hosts
	# remove the self point ...
	# add the above IP Hostname mapping.
	# After configuration, looks like:
		# The following lines are desirable for IPv4 capable hosts
		127.0.0.1 localhost.localdomain localhost
		127.0.0.1 localhost4.localdomain4 localhost4

		# The following lines are desirable for IPv6 capable hosts
		::1 localhost.localdomain localhost
		::1 localhost6.localdomain6 localhost6

		159.65.135.11 hadoop-01.bde.com hadoop-01
		206.189.36.21 hadoop-02.bde.com hadoop-02
		206.189.42.80 hadoop-03.bde.com hadoop-03
		206.189.40.146 hadoop-04.bde.com hadoop-04

vi /etc/selinux/config
	SELINUX=disabled

reboot
ssh root@hadoop-02.bde.com
...
ssh root@hadoop-03.bde.com
...
ssh root@hadoop-04.bde.com
...
vi /etc/hosts

Install CDH (Cloudera Distribution Including Apache Hadoop)

Install CHD

ssh root@hadoop-01.bde.com
yum install wget
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin 
./cloudera-manager-installer.bin 
# Just click 'accept' 

Configure CDH

Wait for a while, let it start the CDH manager, use your own browser, go to http://hadoop-01.bde.com:7180/
Use ‘admin / admin’ login, a wizard will take you initial the whole CDH cluster.
You can use the default setting for most configuraitons. Some special listed down:

  • Specify hosts for your CDH cluster installation.
    hadoop-0[1-4].bde.com
    
  • Provide Login Credentials Use our ssh key created just now.

  • Select Services I choose the Core with Spark, you can choose others

After we finish the wizard above
At 7180,click YARN - configuration - search Container Memory Set yarn.nodemanager.resource.memory-mb to ‘1.5GB’ Set yarn.scheduler.maximum-allocation-mb to ‘1.5GB’ Distrubute this configuration, click the button as a file beside the power button.

Enable password login

Let your data scientist / algrithm engineer use the HDFS role !!! Don’t give the root role to them ..

vi /etc/ssh/sshd_config
    # To disable tunneled clear text passwords, change to no here!
	#PasswordAuthentication yes
	PermitEmptyPasswords no
	PasswordAuthentication yes

Try login with hdfs

ssh hdfs@hadoop-01.bde.com
	password: hdfs

Good, we have start up a hadoop cluster !

Install Java 8 & Spark 2 (If you only use spark 1, ignore these steps)

Install Java 8 (Spark2 dependency), Cloudera Official Instruction

1. At 7180, stop Cluster all services
2. At 7180, stop Cloudera Management Service
3. Hadoop-01 terminal: 
	service cloudera-scm-agent stop
	service cloudera-scm-server stop
4. In all four hosts, run below commands:
	yum install wget
	cd /usr/java
	wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz"
	tar xzf jdk-8u171-linux-x64.tar.gz
	rm -rf jdk-8u171-linux-x64.tar.gz
5. Back to Hadoop-01 terminal:
	vi /etc/default/cloudera-scm-server
	Add: **export JAVA_HOME=/usr/java/jdk1.8.0_171**

	Run commands:    
	service cloudera-scm-server start
	service cloudera-scm-agent start
6. Wait for the server start up, at 7180,Hosts -> configure -> Java Home Directory, put /usr/java/jdk1.8.0_171
7. At 7180,start Cluster all services
8. At 7180,start Cloudera Management Service
cd
wget http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.3.0.cloudera2.jar
mv SPARK2_ON_YARN-2.3.0.cloudera2.jar /opt/cloudera/csd
chmod 644 /opt/cloudera/csd/SPARK2_ON_YARN-2.3.0.cloudera2.jar
service cloudera-scm-server restart

Wait for 7180 restart,there is a gift shape icon on the top right corner, click it, you can see spark2 in the list,click **download-Distribute-Active**.
Suggest click configuration and reselect **Validate Parcel Relations**
Restart the whole cluster
At 7180,Add service -> Spark2, depend on YARN。
History server select 'hadoop-01', gateway select all, countine.
At 7180,stop spark,remove spark,choose None to both selection.
Go to Hadoop-01 terminal
	mkdir /var/log/spark2/lineage
	chmod 777 /var/log/spark2/lineage
	(Seems a new function for 2.3, simple debug here only ...)

Try Spark2

1. Switch to hdfs user
	su hdfs
	cd
2. Put a file into HDFS system
	vi purplecow.txt
	then write something.
	hadoop fs -put purplecow.txt .
3. Try spark2-shell
	export JAVA_HOME=/usr/java/jdk1.8.0_171
	spark2-shell
	>>  sc.textFile("purplecow.txt").count()
4. Try pyspark2
	JAVA_HOME=/usr/java/jdk1.8.0_171 pyspark2
	>> sc.textFile("purplecow.txt").count()
5. Exit
	Control + D
	exit

For python user (If you only use Scala, you can ignore these steps)

Install pyspark + juptyer notebook

python --version
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py
yum install python-devel
yum groupinstall "Development tools"
pip install jupyter
pip install findspark

Run jupyter

export PYSPARK_PYTHON=/usr/bin/python
export HADOOP_USER_NAME=hdfs
export SPARK_HOME=/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2 
export JAVA_HOME=/usr/java/jdk1.8.0_171 
jupyter notebook --ip=159.65.135.11 --port=80 --allow-root &

Try jupyter,Sample 1, Sample 2

import findspark
findspark.init()
from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)

# Write into HDFS
df.write.csv("example.csv")

# Read from HDFS
df_load = sparkSession.read.csv('example.csv')
df_load.show()

Store a image copy, for secure..

DigitalOcean main page -> Image -> create Snapshot

Done !!!

Take a rest ~