Data model and terms

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Cassandra is essentially a hybrid between a key-value and a row-oriented (or tabular) database.

A column family resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.

Each key in Cassandra corresponds to a value which is an object. Each key has values as columns, and columns are grouped together into sets called column families. Also, each column family can be grouped in super column families.

Thus, each key identifies a row of a variable number of elements. These column families could be considered then as tables. A table in Cassandra is a distributed multi dimensional map indexed by a key.

Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family.

Cassandra’s data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table’s primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key.

Tables may be created, dropped, and altered at runtime without blocking updates and queries.

Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Rather, Cassandra emphasizes denormalization through features like collections.

Data model details
On the surface, Cassandra’s data model seems to be quite relational. With this in mind, diving deeper into ColumnFamilies, SuperColumns and the likes, will make Cassandra look like an unfinished RDBMS, lacking features like JOINS and most rich-query capabilities.

To understand why databases like Cassandra, HBase and BigTable (I’ll call them DSS, Distributed Storage Services, from now on) were designed the way they are, we’ll first have to understand what they were built to be used for.

DSS were designed to handle enormous amounts of data, stored in billions of rows on large clusters. Relational databases incorporate a lot of things that make it hard to efficiently distribute them over multiple machines. DSS simply remove some or all of these ties. No operations are allowed, that require scanning extensive parts of the dataset, meaning no JOINS or rich-queries.

There are only two ways to query, by key or by key-range. The reason DSS keep their data model to the bare minimum is the fact, that a single table is far easier to distribute over multiple machines, than several, normalized relations or graphs.

Think of the ColumnFamily model as a (distributed Hash-)Map with up to three dimensions. The two-dimensional setup consists of just a ColumnFamily with some columns in it, “some” meaning a couple of billion if you so wish. So a ColumnFamily is just a map of columns.

I have yet to figure out why, but it seems as if all these terms are just names for different dimensions of a map. A three-dimensional Cassandra “table” would be achieved by putting SuperColumns into a ColumnFamily, thus making it a SuperColumnFamily (please hold back any cries of astonishment), a map of a map of columns.

In this setup, the SuperColumnFamily would represent the highest dimension and the SuperColumn would represent the two remaining dimensions, taking the place of the ColumnFamily in the previous example. This multi-dimensional map contains columns, triplets consisting of a name, a value and a timestamp.

Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key. As discussed in greater detail here, some other limitations apply, that in most cases should not concern you.

Various terminologies in Apache Cassandra is summarized as –

Aggregation – a process of searching, gathering and presenting data
Algorithms – a mathematical formula that can perform certain analyses on data
Analytics – the discovery of insights in data
Anomaly detection – the search for data items in a dataset that do not match a projected pattern or expected behaviour. Anomalies are also called outliers, exceptions, surprises or contaminants and they often provide critical and actionable information.
Anonymization – making data anonymous; removing all data points that could lead to identify a person
Application – computer software that enables a computer to perform a certain task
Artificial Intelligence – developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions.

B

Behavioural Analytics – analytics that informs about the how, why and what instead of just the who and when. It looks at humanized patterns in the data
Big Data Scientist – someone who is able to develop the algorithms to make sense out of big data
Big data startup – a young company that has developed new big data technology
Biometrics – the identification of humans by their characteristics
Brontobytes – approximately 1000 Yottabytes and the size of the digital universe tomorrow. A Brontobyte contains 27 zeros
Business Intelligence – the theories, methodologies and processes to make data understandable

C

Classification analysis – a systematic process for obtaining important and relevant information about data, also meta data called; data about data.
Cloud computing – a distributed computing system over a network used for storing data off-premises
Clustering analysis – the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.
Cold data storage – storing old data that is hardly used on low-power servers. Retrieving the data will take longer
Comparative analysis – it ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.
Complex structured data – data that are composed of two or more complex, complicated, and interrelated parts that cannot be easily interpreted by structured query languages and tools.
Computer generated data – data generated by computers such as log files
Concurrency – performing and executing multiple tasks and processes at the same time
Correlation analysis – the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).
Customer Relationship Management – managing the sales and business processes, big data will affect CRM strategies

D

Dashboard – a graphical representation of the analyses performed by the algorithms
Data aggregation tools – the process of transforming scattered data from numerous sources into a single new one.
Data analyst – someone analysing, modelling, cleaning or processing data
Database – a digital collection of data stored via a certain technique
Database-as-a-Service – a database hosted in the cloud on a pay per use basis, for example Amazon Web Services or
Database Management System – collecting, storing and providing access of data
Data centre – a physical location that houses the servers for storing data
Data cleansing – the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency
Data custodian – someone who is responsible for the technical environment necessary for data storage
Data ethical guidelines – guidelines that help organizations being transparent with their data, ensuring simplicity, security and privacy
Data feed – a stream of data such as a Twitter feed or RSS
Data marketplace – an online environment to buy and sell data sets
Data mining – the process of finding certain patterns or information from data sets
Data modelling – the analysis of data objects using data modelling techniques to create insights from the data
Data set – a collection of data
Data virtualization – a data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)
De-identification – same as anonymization; ensuring a person cannot be identified through the data
Discriminant analysis – cataloguing of the data; distributing data into groups, classes or categories. A statistical analysis used where certain groups or clusters in data are known upfront and that uses that information to derive the classification rule.
Distributed File System – systems that offer simplified, highly available access to storing, analysing and processing data
Document Store Databases – a document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.

E

Exploratory analysis – finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Exabytes – approximately 1000 petabytes or 1 billion gigabytes. Today we create one Exabyte of new information globally on a daily basis.
Extract, Transform and Load (ETL) – a process in a database and data warehousing meaning extracting the data from various sources, transforming it to fit operational needs and loading it into the database

F

Failover – switching automatically to a different server or node should one fail
Fault-tolerant design – a system designed to continue working even if certain parts fail

G

Gamification – using game elements in a non game context; very useful to create data therefore coined as the friendly scout of big data
Graph Databases – they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
Grid computing – connecting different computer systems from various location, often via the cloud, to reach a common goal

H

Hadoop – an open-source framework that is built to enable the process and storage of big data across a distributed file system
HBase –an open source, non-relational, distributed database running in conjunction with Hadoop
HDFS – Hadoop Distributed File System; a distributed file system designed to run on commodity hardware
High-Performance-Computing (HPC) – using supercomputers to solve highly complex and advanced computing problems

I

In-memory – a database management system stores data on the main memory instead of the disk, resulting is very fast processing, storing and loading of the data
Internet of Things – ordinary devices that are connected to the internet at any time any where via sensors

J

Juridical data compliance – relevant when you use cloud solutions and where the data is stored in a different country or continent. Be aware that data stored in a different country has to oblige to the law in that country.

K

KeyValue Databases – they store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.

L

Latency – a measure of time delayed in a system
Legacy system – an old system, technology or computer system that is not supported any more
Load balancing – distributing workload across multiple computers or servers in order to achieve optimal results and utilization of the system
Location data – GPS data describing a geographical location
Log file – a file automatically created by a computer to record events that occur while operational

M

Machine2Machine data – two or more machines that are communicating with each other
Machine data – data created by machines via sensors or algorithms
Machine learning – part of artificial intelligence where machines learn from what they are doing and become better over time
MapReduce – a software framework for processing vast amounts of data
Massively Parallel Processing (MPP) – using many different processors (or computers) to perform certain computational tasks at the same time
Metadata – data about data; gives information about what the data is about.
MongoDB – an open-source NoSQL database
Multi-Dimensional Databases – a database optimized for data online analytical processing (OLAP) applications and for data warehousing.
MultiValue Databases – they are a type of NoSQL and multidimensional databases that understand 3 dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly

N

Natural Language Processing – a field of computer science involved with interactions between computers and human languages
Network analysis – viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.
NewSQL – an elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL
NoSQL – sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling.

O

Object Databases – they store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows object to be found with a declarative programming approach.
Object-based Image Analysis – analysing digital images can be performed with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.
Operational Databases – they carry out regular operations of an organisation and are generally very important to a business. They generally use online transaction processing that allows them to enter, collect and retrieve specific information about the company.
Optimization analysis – the process of optimization during the design cycle of products done by algorithms. It allows companies to virtually design many different variations of a product and to test that product against pre-set variables.
Ontology – ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts
Outlier detection – an outlier is an object that deviates significantly from the general average within a dataset or a combination of data. It is numerically distant from the rest of the data and therefore, the outlier indicates that something is going on and generally therefore requires additional analysis.

P

Pattern Recognition – identifying patterns in data via algorithms to make predictions of new data coming from the same source.
Petabytes – approximately 1000 terabytes or 1 million gigabytes. The CERN Large Hydron Collider generates approximately 1 petabyte per second
Platform-as-a-Service – a services providing all the necessary infrastructure for cloud computing solutions
Predictive analysis – the most valuable analysis within big data as they help predict what someone is likely to buy, visit, do or how someone will behave in the (near) future. It uses a variety of different data sets such as historical, transactional, social or customer profile data to identify risks and opportunities.
Privacy – to seclude certain data / information about oneself that is deemed personal
Public data – public information or data sets that were created with public funding

Q

Quantified Self – a movement to use application to track ones every move during the day in order to gain a better understanding about ones behaviour
Query – asking for information to answer a certain question

R

Re-identification – combining several data sets to find a certain person within anonymized data
Regression analysis – to define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.
RFID – Radio Frequency Identification; a type of sensor using wireless non-contact radio-frequency electromagnetic fields to transfer data
Real-time data – data that is created, processed, stored, analysed and visualized within milliseconds
Recommendation engine – an algorithm that suggests certain products based on previous buying behaviour or buying behaviour of others
Routing analysis – finding the optimized routing using many different variables for a certain means of transport in order to decrease fuel costs and increase efficiency.

S

Semi-structured data – a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records.
Sentiment Analysis – using algorithms to find out how people feel about certain topics
Signal analysis – it refers to the analysis of measurement of time varying or spatially varying physical quantities to analyse the performance of a product. Especially used with sensor data.
Similarity searches – finding the closest object to a query in a database, where the data object can be of any type of data.
Simulation analysis – a simulation is the imitation of the operation of a real-world process or system. A simulation analysis helps to ensure optimal product performance taking into account many different variables.
Smart grid – refers to using sensors within an energy grid to monitor what is going on in real-time helping to increase efficiency
Software-as-a-Service – a software tool that is used of the web via a browser
Spatial analysis – refers to analysing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.
SQL – a programming language for retrieving data from a relational database
Structured data – data that is identifiable as it is organized in structure like rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified.

T

Terabytes – approximately 1000 gigabytes. A terabyte can store up to 300 hours of high-definition video
Time series analysis – analysing well-defined data obtained through repeated measurements of time. The data has to be well defined and measured at successive points in time spaced at identical time intervals.
Topological Data Analysis – focusing on the shape of complex data and identifying clusters and any statistical significance that is present within that data.
Transactional data – dynamic data that changes over time
Transparency – consumers want to know what happens with their data and organizations have to be transparent about that

U

Un-structured data – unstructured data is regarded as data that is in general text heavy, but may also contain dates, numbers and facts.

V

Value – all that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data.
Variability – it means that the meaning of the data can change (rapidly). In (almost) the same tweets for example a word can have a totally different meaning
Variety – data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data
Velocity – the speed at which the data is created, stored, analysed and visualized
Veracity – organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Veracity refers to the correctness of the data
Visualization – with the right visualizations, raw data can be put to use. Visualizations of course do not mean ordinary graphs or pie-charts. They mean complex graphs that can include many variables of data while still remaining understandable and readable
Volume – the amount of data, ranging from megabytes to brontobytes

W

Weather data – an important open public data source that can provide organisations with a lot of insights if combined with other sources

X

XML Databases – XML Databases allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed.

Y

Yottabytes – approximately 1000 Zettabytes, or 250 trillion DVD’s. The entire digital universe today is 1 Yottabyte and this will double every 18 months.

Z

Zettabytes – approximately 1000 Exabytes or 1 billion terabytes. Expected is that in 2016 over 1 zettabyte will cross our networks globally on a daily basis.