Big Data and Apache Spark: A Review

Abhishek Bhattacharya; Shefali Bhatnagar

Big Data and Apache Spark: A Review

Authors: Abhishek Bhattacharya; Shefali Bhatnagar

DIN

IJOER-MAR-2016-9

Abstract

Big Data is currently a very burning topic in the fields of Computer Science and Business Intelligence, and with such a scenario at our doorstep, a humungous amount of information waits to be documented properly with emphasis on the market. By market, we mean the current technologies in use, the current prevalent tools, and the companies playing an imperative role in taming the data with such a colossal outreach.

Keywords

Big Data Cloud Apache Hadoop Spark Analytics

Download Options

Download Full PDF

Share Article

Facebook Twitter LinkedIn

Introduction

This paper details the concept of big data, the nitty-gritty of the field, and the associated analytics. The paper is divided into seven sections. We start by introducing the concept of Big Data. We move forward towards sections, one of which explains a very important slice of big data, the three V’s. Subsequently, we have sections on big data analytics and security issues in big data analytics. This is followed by an entire section that gives a very streamlined idea of the enormity of the extent to which data is generated in this world – what are the sources, what are the sinks, and how we go about transforming them to develop lineages or provenances – following the ETL.

Then we go about discussing the variety of excellent tools available in the market coming from big names such as Apache, on which note, we have considered writing two sections – one on Apache Hadoop, and other on Apache Spark. The prime difference we put a spotlight on is the use of disk by Hadoop’sMapReduce as compared to Apache Spark’s use of memory, making Spark a more competitive product which has established quite a few benchmark records by now.

Conclusion

The paper concludes with the proposition that Big Data is a booming field at the current moment, and the immense amount of data that gets generated every moment calls for a very effective management and analysis system that can deal with the magnitude.

Furthermore, this paper seeks to justify the characteristics of Apache Spark and its standing as very efficient software pertaining to the current scenario of Big Data. In October 2014, Databricks took an interest in the Sort Benchmark and set another world record for sorting 100 terabytes (TB) of information, or 1 trillion 100-byte records. The group utilized Apache Spark on 207 EC2 virtual machines and sorted 100 TB of information in 23 minutes. In comparison, the past world record set by Hadoop MapReduce utilized 2100 machines as a part of a private data center and took 72 minutes. This section tied with a UCSD research group constructing high performance frameworks.

Article Preview

Impact Factor: 4.12

Citation Indices	All	Since 2020
Citation	2359	1680
h-index	19	15
i10-index	57	24

Acceptance Rate (By Year)
Year	Percentage
2024	17.64%
2023	9.64%
2022	13.14%
2021	14.26%
2020	11.8%
2019	16.3%
2018	18.65%
2017	15.9%
2016	20.9%
2015	22.5%

Quick Actions

Download PDF

Article Info

Pages

206-210

Big Data and Apache Spark: A Review

Abstract

Keywords

Download Options

Share Article

Introduction

Conclusion

Article Preview

Quick Actions

Article Info

Citation Styles