Spark API
- api - application programming interfaces
- helps to provide similar performance in all languages
- language API
- scala
- java
- python
- sql
- r - spark
SparkSession
- we can control spark application through a driver process called the
SparkSession
SparkSession
instance is the way Spark executes user-defined manipulations across the clusters- one to one correspondence between a
SparkSession
and aSparkApplication
SparkSession
object is available to the user, which is the entrance point to the spark code- python, r that spark translates into code that it can ron on executor jvm
Structured API overview
- The Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files.
-
these api refers to three core types of distributed collection API's:
- datasets
- data frames
- SQL tables views
-
Spark has two notions of structured collections
- DataFrames
- Datasets
- Spark uses an engine called Catalyst that maintains its own type of information through the planning and processing of work
- In doing so, this opens up a wide variety of execution optimizations that make significant differences
- Spark types map directly to the different language APIs that spark maintains and there exist a lookup table for each of these in Scala, java ,python , sql, r
- Even if we use spark's structured APIs form python or R, the majority of manipulations will operate strickly on spark types, not python types
val df = spark.range(500).toDF("number")
df.select(df.col("number")+10)
df = spark.range(500).toDF("number")
df.select(df["number"] + 10)