Spark Live by Databricks
I was at SparkLive by Databricks in New York last week , was thinking it will be another more marketing less technical meet , but to my surprise the marketing element was just the first 1 hr , rest of the day was complete hands-on in Databricks community edition platform.
I tried to give you below what I was able to capture during the training .
Introduction
Customer success story from EyeView on how they use Databricks platform for personalized Video Advertising was cool.Databricks 4 onwards the support on Google cloud platform is added it seems
From the Technical part
Surprisingly Spark can be invoked from
- C#, F# (via Mobius)
- JavaScript (via Eclair)
Apache Spark releases in to their stack etc.
Db has 80% of the contents work with Apache Spark https://docs.databricks.com/index.html
Key components in Spark Programming
- Driver : opens a spark context
- Executor: on a worker node,cache the data in its storage
- Cluster Manager: to acquire resources (some of the example cluster managers are :spark standalone,YARN,Mesos, Kubernetes supported from Databricks 2.3) :
- Jobserver : Livy and SparkJobServer are some of the jobserver which accepts the request from user from UI/notebook and channels to spark-submit
Key components in a Spark Application
- Query - most queries atleast will trigger 1 job atleast
- Jobs - triggered when spark physically need to move data , also each distinct orberBy will trigger a job
- Stage - each job is broken into stages by Shuffle boundaries, reading some data in and doing easy parallel task feeding to next stage. A Job with no shuffle will have 1 stage,
- Task- Inside each stage, exactly 1 task per partition is created
Tuning:
- Check if the file is gzip compressed, since gzip is non-splittable it will run in single thread so change it to parquet
- You need to have more partitions than the cores
- As of Spark2.0 - reading from disk is (DataSourceScanExec.scala) - 1 partition per core
- Parquet : works better with Spark, when u often read and write data from or to disk , use parquet format(its highly comoressed and hihly splittable)
- Spark - Doesnt support indices
- Datasets - If you have a Java code, which cant be rewritten in spark becasue of the complexity or the code is not shared to you , Datasets helps us tp achive that to call Java code from scala
- Intel - OAP - Optimized Analytics Package for Spark Platform( https://github.com/Intel-bgdata/OAP )
- Spark2.2 onwards _Tunr it On manually, CostBasedOptimization : WAy to analyze ur data , build histograms, before making joins and filters utilize the histograms, Spark re-orders the joins
CBO Takeaway: Benchmarking Perf Improvements on TPC-DS
16 queries show speedup > 30%
Max speedup is 8x
Geometric mean of speedup is 2.2x
Notebook App:
Is no different than the Datascience workbench or Zeppelin notebook but additionally supports SQL along with Scala,Python and R
List the files
dbfs - wrapper around HDFS :-)
%fs ls /databricks-datasets/amazon/
Spark-scala code
val query = "SELECT * FROM parquet.`/databricks-datasets/amazon/data20K` WHERE rating = 1 AND review LIKE '%awesome%'"spark.sql(query)
Databricks sql equivalent code
%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K` WHERE rating = 1 AND review LIKE '%awesome%'// Here's the earlier query with the DataFrame/Dataset API:
val query = spark.table("pageviews").filter("project = 'en'").filter('page like "%Spark%").orderBy('requests desc)
display(query)
Databricks Equivalent
%sql SELECT * FROM pageviews WHERE project = 'en' AND page LIKE '%Spark%' ORDER BY requests DESC;
Rich documentation
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
Structured streaming:
- Spark Streaming : 90% of the streams work well with simplified approach, IF the source is a socket you need to have KAfka like streaming sources to guratntee fault tolerance feed into Spark, other than that simpler streaming data can be processed with Spark Structured streams
- Spark Can get stream data from HDFS/S3/Kafka/Kinesis
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course India
ReplyDeleteThanks for the review, glad that it helped.
ReplyDeleteThis blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Big Data Hadoop Training in Chennai
JAVA Training in Chennai
Python Training in Chennai
IOS Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data training in chennai anna nagar