Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.
Industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:
- Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
- Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
- Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Two more dimensions were added as
- In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.
- Today’s data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:
- Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.;
- Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies and perform predictions of outcomes and behaviors.
Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS. Big Data is very large, loosely structured data set that defies traditional storage. Few examples are as
- Facebook : has 40 PB of data and captures 100 TB / day
- Yahoo : 60 PB of data
- Twitter : 8 TB / day
- EBay : 40 PB of data, captures 50 TB / day
- An example of sensor and machine data is found at the Large Hadron Collider at CERN, the European Organization for Nuclear Research CERN scientists can generate 40 terabytes of data every second during experiments.
- Boeing jet engines can produce 10 terabytes of operational information for every 30 minutes they turn. A four- engine jumbo jet can create 640 terabytes of data on just one Atlantic crossing
- Social network data is a new and exciting source of big data that companies would like to leverage. The micro blogging site Twitter serves more than 200 million users who produce more than 90 million “tweets” per day, or 800 per second. Each of these posts is approximately 200 bytes in size. On an average day, this traffic equals more than 12 gigabytes and, throughout the Twitter ecosystem, the company produces a total of eight terabytes of data per day. In comparison, the New York Stock Exchange produces about one terabyte of data per day.
- In July 2013, Facebook announced they had surpassed the 750 million active-user mark, making the social networking site the largest consumer-driven data source in the world. Facebook users spend more than 700 billion minutes per month on the service, and the average user creates 90 pieces of content every 30 days. Each month, the community creates more than 30 billion pieces of content ranging from Web links, news, stories, blog posts and notes to videos and photos.
In defining big data, it’s also important to understand the mix of structured, unstructured and multi-structured data that comprises the volume of information.
- Structured data is a generic label for describing data that is contained in a database or some other type of data structure. It is displayed in titled columns and rows which can easily be ordered and processed by data processing tools. It can be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access. It is usually managed by SQL in RDBMS. It is highly structured and includes transactions, reference tables and relationships, as well as the metadata that sets its context. Traditional business data is the vast majority of what IT managed and processed, in both operational and BI systems. Usually structured and stored in relational database systems.
- Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.
- Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information. As digital disruption transforms communication and interaction channels—and as marketers enhance the customer experience across devices, web properties, face-to-face interactions and social platforms—multi-structured data will continue to evolve.