Hybrid Cloud and a sample Spark job launched from On-premise(Private cloud) to Azure Data bricks(Public Cloud)

October 01, 2019

Hybrid Cloud and a sample Spark job launched from On-premise(Private cloud) to Azure Data bricks(Public Cloud)

Hybrid Cloud brings the ability to allow workloads to move between private and public clouds as computing needs and costs change, hybrid cloud gives businesses greater flexibility and more data deployment options.

Let me walk you through a demo of how one can launch an Apache Spark job from python code written on-prem and gets executed in Azure Databricks

Pre-Requisites:

JDK 1.8 and above
Python 3.5 and above
Any Linux flavor machine

Steps :

To be Performed at the Databrciks Cluster Level

Configure the databricks cluster,token and workspace details , Generate a dedicated token for yourself (Databricks workspace->Account->User Settings->Generate New Token Copy the token as it will vanish immediately

Before making a spark call, ensure these 2 below listed spark config is added in the cluster settings


spark.databricks.service.server.enabled true

spark.databricks.service.port 8787

To be performed on-premise or private cloud
Image result for on-premise datacenter

Create a dedicated virtual environment backed by Python 3.5 and above


	conda create -n databricks_venv

Install the pre-requisite package
pip install pypandoc

Remove any other pyspark installed
pip uninstall pyspark

Install the databricks-connect package


  ##Pick the right runtime version that matches your databricks cluster runtime

  pip install -U databricks-connect==5.2.*

Activate the virtual environment and configure dbconnect
Obtain Databricks cluster details from the workspace url .

Ex.https://<Shard-endpoint-name>/?o=<workspaceID>#/setting/clusters/<clusterID>/configuration


(databricks_venv) [user@host ~]$ databricks-connect configure







Do you accept the above agreement? [y/N] y

Set new config values (leave input empty to accept default):

Databricks Host [no current value, must start with https://]: https://eastus2.azuredatabricks.net

Databricks Token [no current value]: dapie452d7ed0




IMPORTANT: please ensure that your cluster has:

		- Databricks Runtime version of DBR 5.1

		- Python version same as your local Python (i.e., 2.7 or 3.5)

		- the Spark conf `spark.databricks.service.server.enabled true` set

		- (Azure-only) the Spark conf `spark.databricks.service.port 8787` set




		Cluster ID (e.g., 0921-001415-jelly628) [no current value]: 1234-5678-alpha3569

																	

		Org ID (Azure-only, see ?o=orgId in URL) [0]: <your_org_id_obtain this from datbricks url>

		Port (Use 15001 for AWS, 8787 for Azure) [15001]: 8787




		Updated configuration in /home/suresh/.databricks-connect

		* Spark jar dir: /home/suresh/anaconda/envs/test_env/lib/python3.5/site-packages/pyspark/jars

		* Run `pip install -U databricks-connect` to install updates

		* Run `pyspark` to launch a Python shell

		* Run `spark-shell` to launch a Scala shell

		* Run `databricks-connect test` to test connectivity

Test the connectivity
(databricks_venv) [user@host~]$ databricks-connect test
* PySpark is installed at /home/suresh/anaconda/envs/databricks_venv/lib/python3.6/site-packages/pyspark
* Checking SPARK_HOME
* Checking java version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
* Testing scala command
19/05/07 17:14:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:14:40 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:14:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1557263680800).
Spark session available as 'spark'.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
res0: Long = 4950

scala> :quit

* Testing python command
19/05/07 17:15:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:15:00 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:15:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
* Testing dbutils.fs
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/init_scripts/', name='init_scripts/', size=0), FileInfo(path='dbfs:/init_scripts_jlee/', name='init_scripts_jlee/', size=0), FileInfo(path='dbfs:/init_scripts_jlee>/', name='init_scripts_jlee>/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0)]
* All tests passed.

Run a sample spark sql against the databricks table
(databricks_venv) [user@host ~]$ pyspark
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
19/05/07 14:06:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 14:06:29 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
19/05/07 14:06:30 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
19/05/07 14:06:39 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/

Using Python version 3.6.2 (default, Jul 20 2017 13:51:32)
SparkSession available as 'spark'.
>>> spark.sql("show databases").show()
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+------------+
|databaseName|
+------------+
| default|
+------------+

>>> spark.sql("show tables").show(200,False);
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+--------+--------------------------------------+-----------+
|database|tableName |isTemporary|
+--------+--------------------------------------+-----------+
|test|emp |false |
|test |dept |false |
|test|false |

>>> spark.sql("select * from emp limit 20").show();
[Stage 33:=======================================> (35 + 8) / 50]19/05/07 14:15:56 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
[Stage 37:=========> (8 + 8) / 50]+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
r|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
| 000ea5f818eb46c6| 1| CT| 20190312| -1| -1| -1| -1| -1| -1| -1|West Village Tire...| [Automotive]| [Car Dealers]| []| []|West Village Tire...| null| 06093| 1| 1| 1| 1| 1| 1| 1| Y| M|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+

Remember:

Everytime you launch pyspark in the dedicated virtual environment you have created for Databricks work , ensure it points to databricks cluster and the python version matches with the virtual environment's default python version.

At times when there are multiple spark clusters are configured using python packages, to avoid conflict unset the environment variable PYSPARK_PYTHON(ex. $unset PYSPARK_PYTHON)

Search This Blog

BigData & Advanced Analytics Tips&Tricks

Hybrid Cloud and a sample Spark job launched from On-premise(Private cloud) to Azure Data bricks(Public Cloud)

Comments

Post a Comment

Popular Posts

How to resolve Parquet File issue

Install a Python module from conda forge files