Apache Spark
What is Apache Spark?
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Spark is built on the concept of distributed datasets, which contain arbitrary Java or Python objects. You create a dataset from external data, then apply parallel operations to it. The building block of the Spark API is its RDD API. In the RDD API, there are two types of operations: transformations, which define a new dataset based on previous ones, and actions, which kick off a job to execute on a cluster. On top of Spark’s RDD API, high-level APIs are provided, e.g. DataFrame API and Machine Learning API. These high-level APIs provide a concise way to conduct certain data operations.
Spark is built on the concept of distributed datasets, which contain arbitrary Java or Python objects. You create a dataset from external data, then apply parallel operations to it. The building block of the Spark API is its RDD API. In the RDD API, there are two types of operations: transformations, which define a new dataset based on previous ones, and actions, which kick off a job to execute on a cluster. On top of Spark’s RDD API, high-level APIs are provided, e.g. DataFrame API and Machine Learning API. These high-level APIs provide a concise way to conduct certain data operations.
Tags
Tips
Still no tips, be the first!