Apache spark

Kanishdankani
2 min readMay 20, 2020

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Apache Spark defined

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts away much of the grunt work of distributed computing and big data processing.

When we compare with Hadoop Apache spark faster.

Usage

Apache Spark is open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.

Security

By default there is no any security. This could mean you are vulnerable to attack by default. But you can enable security. YARN RPC encryption being enabled for the distribution of secrets to be secure.

Spark supports AES-based encryption for RPC connections. For encryption to be enabled, RPC authentication must also be enabled and properly configured. AES encryption uses the Apache Commons Crypto library, and Spark’s configuration system allows access to that library’s configuration for advanced users.

There is also support for SASL-based encryption, although it should be considered deprecated. It is still required when talking to shuffle services from Spark versions older than 2.2.0.

Apache spark support SSL configuration, Http security configuration,port configuration for network security

Scalability

Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation.

Apache spark is the n-tier distributed system.

--

--

Kanishdankani

I am a Software Engineering undergraduate currently living in Mullaitivu, Sri Lanka.My interests range from programming to web development.