Hybrid Cloud and a sample Spark job launched from On-premise(Private cloud) to Azure Data bricks(Public Cloud)







Related image

Hybrid Cloud brings the ability to allow workloads to move between private and public clouds as computing needs and costs change, hybrid cloud gives businesses greater flexibility and more data deployment options.

Let me walk you through a demo of how one can launch an Apache Spark job from python code written on-prem and gets executed in Azure Databricks 

Pre-Requisites:

JDK 1.8 and above
Python 3.5 and above
        Any Linux flavor machine

Steps :


Databricks logo
  To be Performed at the Databrciks Cluster Level
Configure the databricks cluster,token and workspace details , Generate a dedicated token for yourself (Databricks workspace->Account->User Settings->Generate New Token Copy the token as it will vanish immediately
Before making a spark call, ensure these 2 below listed spark config is added in the cluster settings
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787

To be performed on-premise or  private cloud
Image result for on-premise datacenter
 Create a dedicated virtual environment backed by Python 3.5 and above
conda create -n databricks_venv

 Install the pre-requisite package
       pip install pypandoc 

Remove any other pyspark installed
pip uninstall pyspark

 Install the databricks-connect package
  ##Pick the right runtime version that matches your databricks cluster runtime
  pip install -U databricks-connect==5.2.*


Activate the virtual environment and configure dbconnect
Obtain Databricks cluster details from the workspace url .

Ex.https://<Shard-endpoint-name>/?o=<workspaceID>#/setting/clusters/<clusterID>/configuration
(databricks_venv) [user@host ~]$ databricks-connect configure
Do you accept the above agreement? [y/N] y
Set new config values (leave input empty to accept default):
Databricks Host [no current value, must start with https://]: https://eastus2.azuredatabricks.net
Databricks Token [no current value]: dapie452d7ed0
IMPORTANT: please ensure that your cluster has:
- Databricks Runtime version of DBR 5.1
- Python version same as your local Python (i.e., 2.7 or 3.5)
- the Spark conf `spark.databricks.service.server.enabled true` set
- (Azure-only) the Spark conf `spark.databricks.service.port 8787` set
Cluster ID (e.g., 0921-001415-jelly628) [no current value]: 1234-5678-alpha3569
Org ID (Azure-only, see ?o=orgId in URL) [0]: <your_org_id_obtain this from datbricks url>
Port (Use 15001 for AWS, 8787 for Azure) [15001]: 8787
Updated configuration in /home/suresh/.databricks-connect
* Spark jar dir: /home/suresh/anaconda/envs/test_env/lib/python3.5/site-packages/pyspark/jars
* Run `pip install -U databricks-connect` to install updates
* Run `pyspark` to launch a Python shell
* Run `spark-shell` to launch a Scala shell
* Run `databricks-connect test` to test connectivity



Test the connectivity
(databricks_venv) [user@host~]$ databricks-connect test
* PySpark is installed at /home/suresh/anaconda/envs/databricks_venv/lib/python3.6/site-packages/pyspark
* Checking SPARK_HOME
* Checking java version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
* Testing scala command
19/05/07 17:14:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:14:40 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:14:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
  ____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1557263680800).
Spark session available as 'spark'.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
res0: Long = 4950

scala> :quit

* Testing python command
19/05/07 17:15:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:15:00 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:15:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
* Testing dbutils.fs
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/init_scripts/', name='init_scripts/', size=0), FileInfo(path='dbfs:/init_scripts_jlee/', name='init_scripts_jlee/', size=0), FileInfo(path='dbfs:/init_scripts_jlee>/', name='init_scripts_jlee>/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0)]
* All tests passed.



 Run a sample spark sql against the databricks table
(databricks_venv) [user@host ~]$ pyspark
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
19/05/07 14:06:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 14:06:29 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
19/05/07 14:06:30 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
19/05/07 14:06:39 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
Welcome to
  ____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Python version 3.6.2 (default, Jul 20 2017 13:51:32)
SparkSession available as 'spark'.
>>> spark.sql("show databases").show()
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+------------+
|databaseName|
+------------+
|     default|
+------------+

>>> spark.sql("show tables").show(200,False);
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+--------+--------------------------------------+-----------+
|database|tableName                             |isTemporary|
+--------+--------------------------------------+-----------+
|test|emp                |false      |
|test |dept |false      |
|test|false      |


>>> spark.sql("select * from emp limit 20").show();
[Stage 33:=======================================>                (35 + 8) / 50]19/05/07 14:15:56 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
[Stage 37:=========>                                               (8 + 8) / 50]+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
r|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
|            000ea5f818eb46c6|                      1|                         CT|    20190312|                    -1|           -1|                  -1|                  -1|                  -1|                   -1|               -1|West Village Tire...|          [Automotive]|         [Car Dealers]|                      []|                     []|West Village Tire...|        null|       06093|                               1|                      1|                             1|                             1|                             1|                              1|                          1|                           Y|                                M|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+


Remember: 

 Everytime you launch pyspark in the dedicated virtual environment you have created for Databricks work , ensure it points to databricks cluster and the python version matches with the virtual environment's default python version.

At times when there are multiple spark clusters are configured using python packages, to avoid conflict unset the environment variable PYSPARK_PYTHON(ex. $unset PYSPARK_PYTHON)



Comments

Popular Posts