Data are characteristics or information, usually numerical, that are collected through observation to create information suitable for making decisions. Data is measured, collected and reported, and analyzed, to create information suitable for making decisions.
Types of data
Discrete and Continuous
Attribute or discrete data – It is based on counting like the number of processing errors, the count of customer complaints, etc. Discrete data values can only be non-negative integers such as 1, 2, 3, etc. It includes
- Count or percentage
- Binomial data
- Attribute-Nominal
- Attribute-Ordinal
Variable or continuous data – They are measured on a continuum or scale. Data values for continuous data can be any real number: 2, 3.4691, -14.21, etc. Continuous data can be recorded at many different points and are typically physical measurements like volume, length, size, width, time, temperature, cost, etc.
Data are said to be discrete when they take on only a finite number of points that can be represented by the non-negative integers. An example of discrete data is the number of defects in a sample.
Data could easily be presented as variables data like 10 scratches could be reported as total scratch length of 8.37 inches. The ultimate goal for the data collection and the type of data are the most significant factors in the decision to collect attribute or variables data.
Cross-sectional and Time series data – Mostly financial analysts are interested in particular types of data such as time-series data or cross-sectional data002E
- Firstly, the Time-series data – It is a set of observations collected at usually discrete and equally spaced time intervals. E.g. the daily closing price of a certain stock recorded over the last six weeks is an example of time-series data.
- Secondly the Cross-sectional data – are observations that coming from different individuals or groups at a same point of time. E.g. if one considered the closing prices of a group of 20 different tech stocks on December 15, 1986 this would be an example of cross-sectional data.
Population and Sample Data
When it comes to the term “population,” we all usually think of people in our town, region, state or country. And their respective characteristics such as gender, age, marital status, ethnic membership, religion and so forth. While in statistics the term “population” takes on a slightly different meaning. The “population” in statistics comprises all members of a defined group that we are studying or collecting information on for data driven decisions.
A segment of the population is called a sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics.
A population includes all of the elements from a set of data. A sample consists of one or more observations from the population.
Converting Data Types – Continuous data, tend to be more precise due to decimal places but, need to be converted into discrete data. As continuous data contains more information than discrete data hence, during conversion to discrete data there is loss of information.
Discrete data cannot be converted to continuous data as instead of measuring how much deviation from a standard exists, the user may choose to retain the discrete data as it is easier to use. Converting variable data to attribute data may assist in a quicker assessment, but the risk is that information will be lost when the conversion is made.
Data Structuring – It refers to structuring of data elements and is classified as
- Firstly, Structured data – Any data that resides in a fixed field within a record or file. This comprises data contained in relational databases and spreadsheets. Structured data is dependent on creating a data model.
- Secondly Semi-structured data – Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables. But nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure like XML, JSON
- Next Unstructured data – Information that doesn’t reside in a traditional row-column database. Examples comprise e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents.
Data collection methods
Data collection is based on crucial aspects of what to know, from whom to know and what to do with the data. Factors which ensure that data is relevant to the project includes
- Person collecting data like team member, associate, subject matter expert, etc.
- Type of Data to collect like cost, errors, ratings etc.
- Time Duration like hourly, daily, batch-wise etc.
- Data source like reports, observations, surveys etc.
- Cost of collection
Few types of data collection methods include:
- Check sheets – It is a structured, well-prepared form for collecting and analyzing data consisting of a list of items and some indication of how often each item occurs. There are several types of check sheets like confirmation check sheets for confirming whether all steps in a process have been completed, process check sheets to record the frequency of observations with a range of measurement, defect check sheets to record the observed frequency of defects and stratified check sheets to record observed frequency of defects by defect type and one other criterion. It is easy to use, provides a choice of observations and good for determining frequency over time. It should be used to collect observable data when the collection is managed by the same person or at the same location from a process.
- Coded data- It is used when presence of too many digits are to be recorded into small blocks or during data capturing of large sequences of digits from a single observation or rounding off errors are observed whilst recording large digit numbers. It is also used if numeric data is used to represent attribute data or data quantity is not enough for a statistical significance in the sample size. Various types of coded data collection are
- Truncation coding for storing only 3,2 or 9 for 1.0003, 1.0002, and 1.0009
- Substitution coding – It stores fractional observation, as integers like expressing the number 32 for 32-3/8 inches with 1/8 inch as base.
- Category coding – Using a code for category like “S” for scratch
- Adding/subtracting a constant or multiplying/dividing by a factor – It is usually used for encoding or decoding
- Automatic measurements – In it a computer or electronic equipment performs data gathering without human intervention like radioactive level in a nuclear reactor.
Data Management
Few important data management related terms are
- Data quality – refers to the level of quality of Data. Data is generally considered high quality if, they are appropriate for their intended uses in operations, decision making and planning.
- Data cleansing – Data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
- Data validation – It is the process of ensuring that a program operates on clean, correct and useful data. It is often called “validation rules” “validation constraints” or “check routines”, that check for correctness, meaningfulness, are utilized. And security of data that are input to the system.
- Data integrity – It refers to maintaining and assuring the accuracy and consistency of data over its entire life-cycle, and is a critical aspect to the design, implementation and usage of any system which stores, processes, or retrieves data.
- Data governance – It is a control ensuring that the data entry by an operations team member or by an automated process meets precise standards. Such as a business rule, a data definition and data integrity constraints in the data model.
Techniques for Assuring Data Accuracy and Integrity
Data integrity and accuracy have a crucial in the data collection process as they ensure the usefulness of data being collected. Data integrity determines whether the information being measured truly represents the desired attribute and data accuracy determines the degree to which individual or average measurements agree with an accepted standard or reference value.
Data integrity is doubtful if the data collected does not fulfill the purpose like data collected on finished good departure gathers data from truck departures but if the data is recorded on computing device present in the warehouse then integrity is doubtful. Similarly, data accuracy is doubtful if the measurement device does not conforms to the laid down device standards.
By following few precautions like avoiding emotional bias relative to tolerances, avoiding unnecessary rounding and screening data to detect and remove data entry errors bad data can be avoided.
Digital Data
With change and spread of technology, companies are moving towards digital marketing as consumers are moving towards e-commerce and mobile commerce. Availability of low-cost internet access and devices has also spurned this shift amongst consumers. Digital data like html footprints that consumers leave behind when they visit a website or social media data, have significant value over these traditional tools of analytics in multiple ways.
Big Data
Big data is a circumscribing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.
Big data is a large volume unstructured data which can’t be handled by standard database management systems like DBMS, RDBMS or ORDBMS. Big Data is very large, loosely structured data set that defies traditional storage. Few examples are as
- Facebook : has 40 PB of data and captures 100 TB / day
- Yahoo : 60 PB of data
- Twitter : 8 TB / day
- EBay : 40 PB of data, captures 50 TB / day
In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information.
- Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Good examples of unstructured data are Metadata, Twitter tweets, and other social media posts.
- Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.
Big Data is usually characterized by following “V” attributes
- Volume – Data being handled is so voluminous that it frequently exceeds a server’s storage and processing capacity. When vertical scalable solutions due to costs or zero downtimes are not acceptable options, horizontal scalable.
- Variety – Data from different sources is aggregated i.e. from online, mobile, and social media; and from ubiquitous sensors.
- Veracity It refers to the lack of clarity or certainty. Data is not well-structured relational data such as transactions hence, companies must be able to store any data in a form that can be analyzed
- Velocity — It refers to the speed needed to analyze and make decisions in tandem to the data being generated.
Big data can come from multiple sources, as
- Web Data — still it is big data
- Click stream data – Click stream data is important in on line advertising and E-Commerce
- Sensor Data – sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of data
- Connected Devices – Smart phones are a great example.
- Social network profiles or Social media data – Sites like Facebook, Twitter, LinkedIn generate a large amount of data.
- Social influencers — Editor, analyst and subject-matter expert blog comments, user forums, Twitter & Facebook “likes,” Yelp-style catalog and review sites, and other review-centric sites like Apple’s App Store, Amazon, etc.
- Activity-generated data—Computer and mobile device log files, aka “The Internet of Things.” This category includes web site tracking information, application logs, and sensor data.
- Public—Microsoft Azure Market Place/ Data Market, The World Bank, SEC/Edgar, Wikipedia, IMDb, etc.