Spark Live by Databricks



I was at SparkLive by Databricks in New York last week , was thinking it will be another more marketing less technical meet  , but to my surprise the marketing element was just the first 1 hr , rest of the day was complete hands-on in Databricks community edition platform.

    I tried to give you below what I was able to capture during the training .

Introduction

 Customer success story from  EyeView on how they use Databricks platform for personalized Video Advertising was cool.

 Databricks 4 onwards the support on Google cloud platform is added it seems

From the Technical part



Surprisingly Spark can be invoked from
  • C#, F# (via Mobius)
  • JavaScript (via Eclair)
since the co-founder of Databricks is from the Apache Spark team , DB has little advantage over othr players in the market in terms of including the latest and greatest of
Apache Spark releases in to their stack etc.

 Db has 80% of the contents work with Apache Spark  https://docs.databricks.com/index.html

Key components in Spark Programming 

  •         Driver : opens a spark context
  •         Executor: on a worker node,cache the data in its storage
  •         Cluster Manager: to acquire resources (some of the example cluster managers are :spark standalone,YARN,Mesos, Kubernetes supported from Databricks 2.3) :
  •      Jobserver : Livy and SparkJobServer are some of the jobserver which accepts the request from user from UI/notebook and channels to spark-submit

Key components in a Spark Application

  • Query - most queries atleast will trigger 1 job atleast
  •  Jobs -  triggered when spark physically need to move data , also each distinct orberBy will trigger a job
  • Stage - each job is broken into stages by Shuffle boundaries, reading some data in and doing easy parallel task feeding to next stage. A Job with no shuffle will have 1 stage,
  •  Task- Inside each stage, exactly 1 task per partition is created

Tuning:

  • Check if the file is gzip compressed, since gzip is non-splittable it will run in single thread so change it to parquet
  • You need to have more partitions than the cores
  • As of Spark2.0 - reading from disk is (DataSourceScanExec.scala) - 1 partition per core 
  • Parquet : works better with Spark, when u often read and write data from or to disk , use parquet format(its highly comoressed and hihly splittable)
  •   Spark - Doesnt support indices
  •   Datasets - If you have a Java code, which cant be rewritten in spark becasue of the complexity or the code is not shared to you , Datasets helps us tp achive that to call Java code from scala
  •   Intel - OAP - Optimized Analytics Package for Spark Platform( https://github.com/Intel-bgdata/OAP )
  •  Spark2.2 onwards _Tunr it On manually, CostBasedOptimization : WAy to analyze ur data , build histograms, before making joins and filters utilize the histograms, Spark re-orders the joins

    CBO Takeaway: Benchmarking Perf Improvements on TPC-DS

    16 queries show speedup > 30%
    Max speedup is 8x
    Geometric mean of speedup is 2.2x 

 Notebook App:


   Is no different than the Datascience workbench or Zeppelin notebook but additionally supports SQL along with Scala,Python and R


List the files
dbfs - wrapper around HDFS :-)
%fs ls /databricks-datasets/amazon/



Spark-scala code

val query = "SELECT * FROM parquet.`/databricks-datasets/amazon/data20K` WHERE rating = 1 AND review LIKE '%awesome%'"
spark.sql(query)

Databricks sql equivalent code

%sql SELECT * FROM parquet.`/databricks-datasets/amazon/data20K` WHERE rating = 1 AND review LIKE '%awesome%'

Basicaly Databricks wraps the rudimentary coding standards

// Here's the earlier query with the DataFrame/Dataset API:

val query = spark.table("pageviews").filter("project = 'en'").filter('page like "%Spark%").orderBy('requests desc)

display(query)

Databricks Equivalent


%sql SELECT * FROM pageviews WHERE project = 'en' AND page LIKE '%Spark%' ORDER BY requests DESC;

Rich documentation

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Structured streaming:

  • Spark Streaming : 90% of the streams work well with simplified approach, IF the source is a socket you need to have KAfka like streaming sources to guratntee fault tolerance feed into Spark, other than that simpler streaming data can be processed with Spark Structured streams
  • Spark Can get stream data from HDFS/S3/Kafka/Kinesis
       

Comments

Post a Comment

Popular Posts