Hybrid Cloud and a sample Spark job launched from On-premise(Private cloud) to Azure Data bricks(Public Cloud)
Hybrid Cloud brings the ability to allow workloads to move between private and public clouds as computing needs and costs change, hybrid cloud gives businesses greater flexibility and more data deployment options.
Let me walk you through a demo of how one can launch an Apache Spark job from python code written on-prem and gets executed in Azure Databricks
Pre-Requisites:
JDK 1.8 and above
Python 3.5 and above
Any Linux flavor machine
Steps :
To be Performed at the Databrciks Cluster Level
Configure the databricks cluster,token and workspace details , Generate a dedicated token for yourself (Databricks workspace->Account->User Settings->Generate New Token Copy the token as it will vanish immediately
Before making a spark call, ensure these 2 below listed spark config is added in the cluster settingsspark.databricks.service.server.enabled truespark.databricks.service.port 8787
To be performed on-premise or private cloud
Create a dedicated virtual environment backed by Python 3.5 and above
conda create -n databricks_venv
Install the pre-requisite package
pip install pypandoc
Remove any other pyspark installed
pip uninstall pyspark
Install the databricks-connect package
##Pick the right runtime version that matches your databricks cluster runtimepip install -U databricks-connect==5.2.*
Activate the virtual environment and configure dbconnect
Obtain Databricks cluster details from the workspace url .
Ex.https://<Shard-endpoint-name>/?o=<workspaceID>#/setting/clusters/<clusterID>/configuration
(databricks_venv) [user@host ~]$ databricks-connect configureDo you accept the above agreement? [y/N] ySet new config values (leave input empty to accept default):Databricks Host [no current value, must start with https://]: https://eastus2.azuredatabricks.netDatabricks Token [no current value]: dapie452d7ed0IMPORTANT: please ensure that your cluster has:- Databricks Runtime version of DBR 5.1- Python version same as your local Python (i.e., 2.7 or 3.5)- the Spark conf `spark.databricks.service.server.enabled true` set- (Azure-only) the Spark conf `spark.databricks.service.port 8787` setCluster ID (e.g., 0921-001415-jelly628) [no current value]: 1234-5678-alpha3569Org ID (Azure-only, see ?o=orgId in URL) [0]: <your_org_id_obtain this from datbricks url>Port (Use 15001 for AWS, 8787 for Azure) [15001]: 8787Updated configuration in /home/suresh/.databricks-connect* Spark jar dir: /home/suresh/anaconda/envs/test_env/lib/python3.5/site-packages/pyspark/jars* Run `pip install -U databricks-connect` to install updates* Run `pyspark` to launch a Python shell* Run `spark-shell` to launch a Scala shell* Run `databricks-connect test` to test connectivity
Test the connectivity
(databricks_venv) [user@host~]$ databricks-connect test
* PySpark is installed at /home/suresh/anaconda/envs/databricks_venv/lib/python3.6/site-packages/pyspark
* Checking SPARK_HOME
* Checking java version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
* Testing scala command
19/05/07 17:14:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:14:40 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:14:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1557263680800).
Spark session available as 'spark'.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
res0: Long = 4950
scala> :quit
* Testing python command
19/05/07 17:15:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 17:15:00 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 17:15:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
* Testing dbutils.fs
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/init_scripts/', name='init_scripts/', size=0), FileInfo(path='dbfs:/init_scripts_jlee/', name='init_scripts_jlee/', size=0), FileInfo(path='dbfs:/init_scripts_jlee>/', name='init_scripts_jlee>/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0)]
* All tests passed.
Run a sample spark sql against the databricks table
(databricks_venv) [user@host ~]$ pyspark
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
19/05/07 14:06:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/07 14:06:29 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/05/07 14:06:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
19/05/07 14:06:30 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
19/05/07 14:06:39 WARN HTTPClient: Setting proxy configuration for HTTP client based on env var HTTP_PROXY=http://proxy.hedani.net:8080
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 3.6.2 (default, Jul 20 2017 13:51:32)
SparkSession available as 'spark'.
>>> spark.sql("show databases").show()
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+------------+
|databaseName|
+------------+
| default|
+------------+
>>> spark.sql("show tables").show(200,False);
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
+--------+--------------------------------------+-----------+
|database|tableName |isTemporary|
+--------+--------------------------------------+-----------+
|test|emp |false |
|test |dept |false |
|test|false |
>>> spark.sql("select * from emp limit 20").show();
[Stage 33:=======================================> (35 + 8) / 50]19/05/07 14:15:56 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
View job details at https://eastus2.azuredatabricks.net/?o=<org_id>#/setting/clusters/<cluster_id_141642-fate625/sparkUi
[Stage 37:=========> (8 + 8) / 50]+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
r|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
| 000ea5f818eb46c6| 1| CT| 20190312| -1| -1| -1| -1| -1| -1| -1|West Village Tire...| [Automotive]| [Car Dealers]| []| []|West Village Tire...| null| 06093| 1| 1| 1| 1| 1| 1| 1| Y| M|
+----------------------------+-----------------------+---------------------------+------------+----------------------+-------------+--------------------+--------------------+--------------------+---------------------+-----------------+--------------------+----------------------+----------------------+------------------------+-----------------------+--------------------+------------+------------+--------------------------------+-----------------------+------------------------------+------------------------------+------------------------------+-------------------------------+---------------------------+----------------------------+---------------------------------+
Remember:
Everytime you launch pyspark in the dedicated virtual environment you have created for Databricks work , ensure it points to databricks cluster and the python version matches with the virtual environment's default python version.
At times when there are multiple spark clusters are configured using python packages, to avoid conflict unset the environment variable PYSPARK_PYTHON(ex. $unset PYSPARK_PYTHON)
Comments
Post a Comment