Real World Uses of Spark

alibaba, yahoo, google, facebook , netflix use spark
speed is core attraction of spark
offer many interactive api in multiple languages including scala, java, python, and R
why spark is popular
- favorite among developer as it allows them to write applications in java, scala, python
- backed by adn active developer community, and is also supported by a dedicated company - databricks
- although majority of spark application use HDFS as the underlying data file storage, it is also compatible with data sources like Cassndra, MySQL, AWS S3
- developed on top of hadoop ecosystem that allows for east and fast development
- increase in big data
applications
- processing streaming data
  - with so much data being processed it become essential for companies to stream and analyze data in real time
  - spark streaming unifies disparate data processing capabilities allowing developers to use single framework to accommodate all there processing needs
  - general ways that spark streaming is being used by business today are
    - streaming STL
    - data enrichment
    - trigger event detection
    - complex session analysis
- machine learning
  - MLlib
  - MLlib works in areas such as clustering, classification, and dimensionality reduction
  - very common big data functions like predictive intelligence, customer segmentation for marketing purposes and sentiment analysis
- fog computing
  - bigdata + iot -
  - fog computing decentralizes data processing and storage, instead performing those function on edge of network
  - However, Fog computing brings new complexities to processing decentralized data, because it increasingly requires low latency, massively parallel processing of machine learning, and extremely complex graph analytics algorithms.
  - Fortunately, with key stack components such as Spark Streaming, an interactive real-time query tool (Shark), a machine learning library (MLib), and a graph analysis engine (Graphx), Spark more than qualifies as a fog computing solution.
  - In fact, as the loT industry gradually and inevitably converges, many industry experts predict that compared to other open source platforms Spark has the potential to emerge as the de facto fog infrastructure.
- interactive analysis
  - MapReduce was built to handle batch processing and SQL on hadoop engines such as Hive or Pig but too slow for interactive analysis
  - apache spark is fast enough to perform exploratory queries without sampling