Skip to content

Apache Spark#

Apache Spark is a fast and general-purpose cluster computing system with big data applications. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

alt-text

Features#

Find below some of the main features of Apache Spark:

  • Speed (e.g. Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing.)
  • Powerful Caching (Spark provides powerful caching and disk persistence capabilities.)
  • Deployment (Apache Spark clusters can be deployed through Spark’s own cluster manager)
  • Real-Time (Spark provides real-time computation & low latency because of in-memory computation)
  • Polyglot (Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages.)

Use Cases#

Find below some examples of possible use cases:

  • Performing compute-intensive tasks
  • Performing various relations operations (e.g. text search or simple data operations) on both internal and external data sources
  • Performing Machine Learning (ML) tasks such as feature extraction, classification, regression, clustering, recommendation, and more

Resources#

Find below some interesting links providing more information on Apache Spark: