Big Data and Apache Spark: A Review
Abstract
Big Data is currently a very burning topic in the fields of Computer Science and Business Intelligence, and with such a scenario at our doorstep, a humungous amount of information waits to be documented properly with emphasis on the market. By market, we mean the current technologies in use, the current prevalent tools, and the companies playing an imperative role in taming the data with such a colossal outreach.
Keywords
Download Options
Introduction
This paper details the concept of big data, the nitty-gritty of the field, and the associated analytics. The paper is divided into seven sections. We start by introducing the concept of Big Data. We move forward towards sections, one of which explains a very important slice of big data, the three V’s. Subsequently, we have sections on big data analytics and security issues in big data analytics. This is followed by an entire section that gives a very streamlined idea of the enormity of the extent to which data is generated in this world – what are the sources, what are the sinks, and how we go about transforming them to develop lineages or provenances – following the ETL.
Then we go about discussing the variety of excellent tools available in the market coming from big names such as Apache, on which note, we have considered writing two sections – one on Apache Hadoop, and other on Apache Spark. The prime difference we put a spotlight on is the use of disk by Hadoop’sMapReduce as compared to Apache Spark’s use of memory, making Spark a more competitive product which has established quite a few benchmark records by now.
Conclusion
The paper concludes with the proposition that Big Data is a booming field at the current moment, and the immense amount of data that gets generated every moment calls for a very effective management and analysis system that can deal with the magnitude.
Furthermore, this paper seeks to justify the characteristics of Apache Spark and its standing as very efficient software pertaining to the current scenario of Big Data. In October 2014, Databricks took an interest in the Sort Benchmark and set another world record for sorting 100 terabytes (TB) of information, or 1 trillion 100-byte records. The group utilized Apache Spark on 207 EC2 virtual machines and sorted 100 TB of information in 23 minutes. In comparison, the past world record set by Hadoop MapReduce utilized 2100 machines as a part of a private data center and took 72 minutes. This section tied with a UCSD research group constructing high performance frameworks.