Skip to content

Hadoop vs Spark

Parameter Hadoop Spark
Performance slow, use disk for storage and depends on disk read and write speed fast, in memory performance with reduced disk reading and writing operations
Cost open source - less expensive to run, use affordable consumer hardware, easy to find trained hadoop professional open source, relies on memory consumption - increase cost
Data Processing batch processing, use mapreduce to split large dataset across cluster for parallel analysis for iterative live stream data analysis - works with RDD and DAG to run operations
Fault Tolerance highly fault tolerance - replicates the data across nodes and uses them in case of an issue tracks RDD block creation process, and then it can rebuild a dataset when a partition falls. can also use DAG to rebuild data across nodes
Scalability easily scalable by adding nodes and disks for storage. supports tens of thousands of nodes without a known limit bit more challenging to scale because it relies on RAM for computations. Support thousands of nodes in a cluster
Security secure support LDAP, ACLa, Kebros,SLA, etc not secure. default tuned off. relies on integration with hadoop to achieve necessary security level
ease of use and language support difficult - less supported languages - use java python mapreduce more user friendly - allows interactive shell integration mode, API can be written in java, scala, r, python , spark sql
machine learning slower than spark, data fragments, can be too large and create bottleneck. mahout is the main library much faster with in memory processing. uses mllib for computations
scheduling and resource management use external solutions. YARN is the most common option. oozle is available for workflow scheduling. has built in tools for resource allocation, scheduled and monitoring
  • code sample hadoop vs spark

    • hadoop mapreduce
      • main class
      • mapper class
      • reducer class
    • spark
      • one main class
  • spark toolkit

    • upper level
      • structure streaming
      • advance analytics
      • libraries and ecosystems
    • Structured apis
      • datasets
      • dataframes
      • sql
    • low level API
      • RDD
      • distributed variables