Apache Spark#
Apache Spark is a fast and general-purpose cluster computing system with big data applications. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Features#
Find below some of the main features of Apache Spark:
- Speed (e.g. Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing.)
- Powerful Caching (Spark provides powerful caching and disk persistence capabilities.)
- Deployment (Apache Spark clusters can be deployed through Spark’s own cluster manager)
- Real-Time (Spark provides real-time computation & low latency because of in-memory computation)
- Polyglot (Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages.)
Use Cases#
Find below some examples of possible use cases:
- Performing compute-intensive tasks
- Performing various relations operations (e.g. text search or simple data operations) on both internal and external data sources
- Performing Machine Learning (ML) tasks such as feature extraction, classification, regression, clustering, recommendation, and more
Resources#
Find below some interesting links providing more information on Apache Spark: