Big data has become a bit of a household term. But what is it, and how critical is “big data” to the world of programmers, developers, administrators, business analysts and project managers?
In general, big data refers to data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. The truth is that big data is everywhere because it has implications for all sorts of businesses. Big data touches:
- Understanding and Targeting Customers
- Understanding and Optimizing Business Processes
- Personal Quantification and Performance Optimization
- Improving Healthcare and Public Health
- Improving Sports Performance
- Improving Science and Research
- Optimizing Machine and Device Performance
- Improving Security and Law Enforcement
- Improving and Optimizing Cities and Countries
- Financial Trading
The challenges behind big data include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.
At times, the term “big data” refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. “There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem.”
The META Group (now Gartner) defines data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner definition is, as follows: “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” Gartner’s definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that “Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value.”
Big Data, R and Statistics
What’s behind the statistics the analytics and the visualizations that today’s data scientists and business leaders rely on to make powerful decisions? And why consider R for big data?
Many Data Science courses are now entirely taught in R software which is an open source statistical programming language and one of the essential tools that are a part of any Data Scientist’s tool kit. Due to its extensive package repository around statistical and analytics applications, R is tremendously growing in popularity around the world and many firms are on the lookout for R programmers.
R is a statistical programming and scripting language and environment that data experts use for mapping broad social and marketing trends, and developing financial and climate models that help our economies and communities. R makes it easy to save and rerun analyses on updated data sets, and a tool that really allows integration of R and C++. R, unlike Excel and other GUI analysis programs, is completely auditable. Using R takes advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.
So what exactly is R, and where did R start?
Originally R was developed in New Zealand by two professors, Robert Gentleman and Ross Ihaka, who wanted a better statistical platform for their students. So, they created one. They modeled R after the statistical language S created by John Chambers.
They, and many others continued working on R, creating new tools for R and finding new applications for R every day. Today R is an integrated suite of software facilities for data manipulation, calculation and graphical display. The tools R provides are:
An effective data handling and storage facility
A suite of operators for calculations on arrays
A large, coherent, integrated collection of intermediate tools for data analysis
A well-developed, simple and effective programming language that includes conditionals, loops, user-defined recursive functions and input and output facilities
Some statistical techniques it provides are linear and nonlinear modeling, classical statistical tests, time-series analysis, classification and clustering. While S language is usually the means used for research in statistical methodology, R provides and Open Source route to participation in that activity.
That’s a lot of technical talk. What businesses use R and what do they use them for? Here’s a few examples:
Google uses R to calculate the ROI on advertising campaigns.
Ford uses R to improve the design of its vehicles.
Twitter uses R to monitor user experience.
The US National Weather Service uses R to predict severe flooding.
The New York Times uses R to create infographics and interactive data journalism applications
FREE Webinar on R
R is the best at what it does, letting experts quickly and easily interpret interact with and visualize data
Learn more about how OpenSource R continues to shape the future of statistical analysis and data science with one of our experts, Jose Portilla.
Save your seat on Thursday, November 17th and 1pm EST, USA by clicking here: