Skip to content

Spark

  • intro
    • fast general purpose distributed computing platform
    • spark make efficient use of memory and can execute equivalent job 10 to 100 times faster than hadoops mapreduce
    • spark creators managed to abstract the fact that one is working with a cluster of machines and instead seem as if working with a set of collections-based API's
  • definition
    • spark is a unified computing engine and is a set of libraries for parallel data processing on computer clusters
    • it supports widely used programming languages - py, java, scala, r
    • sql to streaming ml
    • run from laptop to multiple clusters
    • easy system to start with and scale-up to big data processing or large scale
  • meaning
    • unified - it supports wider range of data analytics tasks
      • simple data loading
      • sql queries
      • machine learning and streaming computation
    • computing engine
      • spark handles loading data form storage systems and performing computation on it
      • not permanent storage as the end itself
    • libraries
      • unified api to common data analytics tasks
      • spark sql - sql
      • mllib - ml
      • spark streaming - stream processing
      • graphX - graph analytics
  • history
    • uc berkely - 2009

Outline

Resources

  • todo