Spark

intro
- fast general purpose distributed computing platform
- spark make efficient use of memory and can execute equivalent job 10 to 100 times faster than hadoops mapreduce
- spark creators managed to abstract the fact that one is working with a cluster of machines and instead seem as if working with a set of collections-based API's
definition
- spark is a unified computing engine and is a set of libraries for parallel data processing on computer clusters
- it supports widely used programming languages - py, java, scala, r
- sql to streaming ml
- run from laptop to multiple clusters
- easy system to start with and scale-up to big data processing or large scale
meaning
- unified - it supports wider range of data analytics tasks
  - simple data loading
  - sql queries
  - machine learning and streaming computation
- computing engine
  - spark handles loading data form storage systems and performing computation on it
  - not permanent storage as the end itself
- libraries
  - unified api to common data analytics tasks
  - spark sql - sql
  - mllib - ml
  - spark streaming - stream processing
  - graphX - graph analytics
history
- uc berkely - 2009

Outline