Ever wonder how Google Translate can transform the text of a website written in Danish into American English that’s intelligible, mostly? It can do that because the translations are based on big-data statistical analyses of text.
Everyone has heard of “big data” but many of us are just getting our feet wet in that particular ocean. So here’s a beginner’s 20,000-ft view of some of the most commonly used but little understood Big Data terminology and concepts as a primer for future blog posts. We’ll start with the elephant in the room:
What is Big Data?
It’s massively humongous, complex sets of data, both structured and unstructured, being generated by the plethora of data sources across the internet, including billions of IoT devices, retail transactions, social media, Web 2.0 sites like Google and eBay, and website click streams. While the massive amount (volume) and complexity (variety) of data grows, its rapid generation (velocity) critically challenges businesses to extract meaningful trends and patterns as near in time to its creation as possible. The volume and variety of data spurred the development of sophisticated data-handling and analytical frameworks such as Hadoop and NoSQL. Big data’s value resides in its countless possible applications and increasingly in its value as a bottom-line commercial asset (“infonomics”). The term may also refer to predictive analytics and user-behavior analytics. Big-data analysis holds profound importance for businesses in terms of cost reduction, risk reduction, troubleshooting and sales, as well as AI and machine learning, and scientific analysis.
The Big Data landscape has evolved in a few short years into one of vast complexity. For the sake of making the point simply, the image here represents the landscape as of 2012; more recent depictions are far more intricate.
The landscape comprises a constellation of technologies, data sources, and applications manipulating or generating data sets across a range of niche spaces such as health, IoT, finance & economics, social media and so on. Software frameworks for storing, handling and analyzing these data sets include technologies such as Hadoop and non-relational databases like Cassandra. Built upon these, a plethora of tech companies, focusing on infrastructure, analytics or applications, inhabit the space.
The key to corporate innovation is interconnectivity, a huge challenge for businesses. With the enormous variety of data models, platforms and applications available today, integrating data in purposeful ways is challenging and expensive. It takes a great amount of planning and strategic investment to put Big Data together to generate meaningful results. Companies like Oracle, Cognos, QlikView and others are examples of successful, innovative corporations inhabiting the business-intelligence space.
Created by computer scientists Doug Cutting and Mike Cafarella in 2006, Hadoop is an open-source technology that provides reliable, scalable, simultaneous, distributed computing. Hadoop is a project of the Apache Software Foundation, which works to provide software products for the public good. Hadoop is used to store and analyze ginormous sets of complex data across clusters of networked servers using the MapReduce programming model. Its distributed file system allows for rapid transfer rates and reliability in the event of failure at any of its constituent nodes. The term can also refer to the entire ecosystem of applications such as Chuckwa that can be installed on top of or alongside Hadoop’s primary modules.
High-performance, non-relational (or “not only” SQL), scalable databases that allow for quick retrieval of data and fault tolerance. “SQL” stands for Structured Query Language. It’s been around since the 1960s but its current popularity was spurred by Web 2.0 companies like Facebook, Amazon and Google. An example of a NoSQL database is Apache Cassandra.
MapReduce is a simple framework that aggregates information drawn from distributed data servers to provide meaningful responses to the query. One of the earliest examples of MapReduce in practice is Google’s page-rank algorithm.
We’re not talking about a supercomputer Colossus (although some people may worry), but we are talking about algorithms that can learn from and make predictions on data, allowing machines to function beyond static instructions. While machine learning and data mining often use similar methods and overlap, machine learning focuses on predictive output while data mining entails gathering previously unknown information or deriving patterns from within the data.
Check out our other posts, now and soon to come, as tekMountain explores machine learning, artificial intelligence and much more. As the innovation center of the Southeast, tekMountain is ready to assist you in realizing your tech-startup vision.