Apply for HBase Certification Now!!
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.
Industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:
- Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
- Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
- Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Two more dimensions were added as
- In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.
- Today’s data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:
- Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.;
Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies and perform predictions of outcomes and behaviors.Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS. Big Data is very large, loosely structured data set that defies traditional storage. Few examples are as
- Facebook : has 40 PB of data and captures 100 TB / day
- Yahoo : 60 PB of data
- Twitter : 8 TB / day
- EBay : 40 PB of data, captures 50 TB / day
- An example of sensor and machine data is found at the Large Hadron Collider at CERN, the European Organization for Nuclear Research CERN scientists can generate 40 terabytes of data every second during experiments.
- Boeing jet engines can produce 10 terabytes of operational information for every 30 minutes they turn. A four- engine jumbo jet can create 640 terabytes of data on just one Atlantic crossing
- Social network data is a new and exciting source of big data that companies would like to leverage. The micro blogging site Twitter serves more than 200 million users who produce more than 90 million “tweets” per day, or 800 per second. Each of these posts is approximately 200 bytes in size. On an average day, this traffic equals more than 12 gigabytes and, throughout the Twitter ecosystem, the company produces a total of eight terabytes of data per day. In comparison, the New York Stock Exchange produces about one terabyte of data per day.
- In July 2013, Facebook announced they had surpassed the 750 million active-user mark, making the social networking site the largest consumer-driven data source in the world. Facebook users spend more than 700 billion minutes per month on the service, and the average user creates 90 pieces of content every 30 days. Each month, the community creates more than 30 billion pieces of content ranging from Web links, news, stories, blog posts and notes to videos and photos.
In defining big data, it’s also important to understand the mix of structured, unstructured and multi-structured data that comprises the volume of information.
- Structured data is a generic label for describing data that is contained in a database or some other type of data structure. It is displayed in titled columns and rows which can easily be ordered and processed by data processing tools. It can be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access. It is usually managed by SQL in RDBMS. It is highly structured and includes transactions, reference tables and relationships, as well as the metadata that sets its context. Traditional business data is the vast majority of what IT managed and processed, in both operational and BI systems. Usually structured and stored in relational database systems.
- Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.
- Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information. As digital disruption transforms communication and interaction channels—and as marketers enhance the customer experience across devices, web properties, face-to-face interactions and social platforms—multi-structured data will continue to evolve.
Big Data Types
Big data can be classified as
- Social Networks (or human-sourced information): this information is the record of human experiences, previously recorded in books and works of art, and later in photographs, audio and video. Human-sourced information is now almost entirely digitized and stored everywhere from personal computers to social networks. Data are loosely structured and often ungoverned.
- Internet of Things (or machine-generated data): derived from the phenomenal growth in the number of sensors and machines used to measure and record the events and situations in the physical world. The output of these sensors is machine-generated data, and from simple sensor records to complex computer logs, it is well structured. As sensors proliferate and data volumes grow, it is becoming an increasingly important component of the information stored and processed by many businesses. Its well-structured nature is suitable for computer processing, but its size and speed is beyond traditional approaches.
Human Generated Data is emails, documents, photos and tweets. We are generating this data faster than ever. Just imagine the number of videos uploaded to You Tube and tweets swirling around. This data can be Big Data too.
Machine Generated Data is a new breed of data. This category consists of sensor data, and logs generated by ‘machines’ such as email logs, click stream logs, etc. Machine generated data is orders of magnitude larger than Human Generated Data. Before ‘Hadoop’ was in the scene, the machine generated data was mostly ignored and not captured. It is because dealing with the volume was NOT possible, or NOT cost effective.
Big Data Challenges
The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, prevent diseases, combat crime and so on. Few challenges are summarized as
- Size of Big Data – Big data is… well… big in size! For a small company that is used to dealing with data in gigabytes, 10TB of data would be BIG. However for companies like Facebook and Yahoo, petabytes is big. The size of big data, makes it impossible (or at least cost prohibitive) to store in traditional storage like databases or conventional filers but also the cost to store gigabytes of data.
- Big Data is unstructured or semi structured – A lot of Big Data is unstructured.
Lack of structure makes relational databases not well suited to store Big Data. Plus, not many databases can cope with storing billions of rows of data. - Processing this huge data to mine intelligence out of it is also a big challenge.
- Analysis and making predictions from such voluminous unstructured data is also challengeable.
A parallel processing framework can solve the posed problems applying the divide and conquer. The solution involves, division of data into smaller sets which is processed in a parallel manner. But, it needs a robust storage platform which can scale to a very large degree (and at reasonable cost) as the data grows and allows for system failure. Processing all this data may take thousands of servers, so the price of these systems must be affordable to keep the cost per unit of storage reasonable.
Big Data Benefits
Leveraging of big data can be of immense benefit to organization. The benefits includes
- Discovery of patterns & relations across variety of data sources for domain-specific competitive insights
- Leveraging social media for competitive advantage as well as identification of social risk
- Cost-effective solutions for storing & processing large volumes of data and getting meaningful analytics
- By adopting a platform that can scale to a massive degree, a company can extend the shelf life of its system and so save money, as the investment involved can be spread over a longer time.
- By getting involved in the big data field now, a company can future-proof itself and reduce risk by building a vastly scalable distributed platform.
Big Data Applications
Due various benefits associated with big data, it can be applied to business to better optimize, as
- Seismic Data Processing – Seismic data processing using low cost massive parallel processing platform
- Steel Plant Factory Optimization – Optimization of the steel manufacturing factory performance in areas of predictive maintenance, implementation of value engineering and environment management
- Retail Recommender for Consumers – A solution to provide personalized & category-wise recommendations based on learning
- Clinical Trials Analytics – Enhance the understanding of clinical data and also to improve healthcare using faster prediction and analysis
- Pricing – Organizations could vary their price if they had enough information on each user to know how much they might pay. To a certain degree, this happens in online retail with airlines targeting previous browsers, and some stores, changing prices depending on which physical store the customer is nearest.
- Weather – Companies can use sensors to map atmospheric readings. Mobile handsets such as the Samsun S4, contain a barometer, hygrometer (humidity), ambient thermometer and lightmeter. The prospect of millions of personal weather stations feeding into one machine that will average out readings is exciting, and one that has the potential to improve forecasting.
- Infectious diseases – Data from local climate and temperature helps to find correlations with how infectious disease spreads. This analysis is used to predict the location of future outbreaks.
Non-relational database
A non-relational database is a database that does not use the tabular schema of rows and columns found in most traditional database systems. Instead, non-relational databases use a storage model that is optimized for the specific requirements of the type of data being stored. For example, data may be stored as simple key/value pairs, as JSON documents, or as a graph consisting of edges and vertices.
What all of these data stores have in common is that they don’t use a relational model. Also, they tend to be more specific in the type of data they support and how data can be queried. For example, time series data stores are optimized for queries over time-based sequences of data, while graph data stores are optimized for exploring weighted relationships between entities. Neither format would generalize well to the task of managing transactional data.
The term NoSQL refers to data stores that do not use SQL for queries, and instead use other programming languages and constructs to query the data. In practice, “NoSQL” means “non-relational database,” even though many of these databases do support SQL-compatible queries. However, the underlying query execution strategy is usually very different from the way a traditional RDBMS would execute the same SQL query.
The following sections describe the major categories of non-relational or NoSQL database.
Document data stores – A document data store manages a set of named string fields and object data values in an entity referred to as a document. These data stores typically store data in the form of JSON documents. Each field value could be a scalar item, such as a number, or a compound element, such as a list or a parent-child collection. The data in the fields of a document can be encoded in a variety of ways, including XML, YAML, JSON, BSON, or even stored as plain text. The fields within documents are exposed to the storage management system, enabling an application to query and filter data by using the values in these fields.
Typically, a document contains the entire data for an entity. What items constitute an entity are application specific. For example, an entity could contain the details of a customer, an order, or a combination of both. A single document might contain information that would be spread across several relational tables in a relational database management system (RDBMS). A document store does not require that all documents have the same structure. This free-form approach provides a great deal of flexibility. For example, applications can store different data in documents in response to a change in business requirements.
Columnar data stores – A columnar or column-family data store organizes data into columns and rows. In its simplest form, a column-family data store can appear very similar to a relational database, at least conceptually. The real power of a column-family database lies in its denormalized approach to structuring sparse data, which stems from the column-oriented approach to storing data.
You can think of a column-family data store as holding tabular data with rows and columns, but the columns are divided into groups known as column families. Each column family holds a set of columns that are logically related and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in separate column families. Within a column family, new columns can be added dynamically, and rows can be sparse (that is, a row doesn’t need to have a value for every column).
Key/value data stores – A key/value store is essentially a large hash table. You associate each data value with a unique key, and the key/value store uses this key to store the data by using an appropriate hashing function. The hashing function is selected to provide an even distribution of hashed keys across the data storage.
Most key/value stores only support simple query, insert, and delete operations. To modify a value (either partially or completely), an application must overwrite the existing data for the entire value. In most implementations, reading or writing a single value is an atomic operation. If the value is large, writing may take some time.
An application can store arbitrary data as a set of values, although some key/value stores impose limits on the maximum size of values. The stored values are opaque to the storage system software. Any schema information must be provided and interpreted by the application. Essentially, values are blobs and the key/value store simply retrieves or stores the value by key.
Key/value stores are highly optimized for applications performing simple lookups using the value of the key, or by a range of keys, but are less suitable for systems that need to query data across different tables of keys/values, such as joining data across multiple tables.
Key/value stores are also not optimized for scenarios where querying or filtering by non-key values is important, rather than performing lookups based only on keys. For example, with a relational database, you can find a record by using a WHERE clause to filter the non-key columns, but key/values stores usually do not have this type of lookup capability for values, or if they do it requires a slow scan of all values.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes on separate machines.
Graph data stores – A graph data store manages two types of information, nodes and edges. Nodes represent entities, and edges specify the relationships between these entities. Both nodes and edges can have properties that provide information about that node or edge, similar to columns in a table. Edges can also have a direction indicating the nature of the relationship.
The purpose of a graph data store is to allow an application to efficiently perform queries that traverse the network of nodes and edges, and to analyze the relationships between entities. The following diagram shows an organization’s personnel data structured as a graph. The entities are employees and departments, and the edges indicate reporting relationships and the department in which employees work. In this graph, the arrows on the edges show the direction of the relationships.
Relational and Non-Relational databases
Relational databases tend to make one set of trade-offs, and non-relational tend to make a different set of trade-offs. For massive distributed datasets, non-relational sometimes makes more sense.
There is also a sense in which non-relational databases can eliminate a lot of the ORM pain, but again there are always tradeoffs. In some use cases, non-relational storage can be faster, because all the data for a particular hierarchy can be stored closer together on the disk. Also note that non-relational databases do still have query capabilities.
In the end, it’s about making the appropriate set of trade-offs for your particular use-case.
NoSQL
HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. An RDBMS can scale well, but only up to a point – specifically, the size of a single database server – and for the best performance requires specialized hardware and storage devices. HBase features of note are:
- Strongly consistent reads/writes: HBase is not an “eventually consistent” DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
- Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
- Automatic RegionServer failover
- Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
- MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
- Java Client API: HBase supports an easy to use Java API for programmatic access.
- Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
- Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
- Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.
HBase Application
HBase isn’t suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be “ported” to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop – but this should be considered a development configuration only.
http://www.vskills.in/certification/Certified-HBase-Professional