Test data management (TDM) is the process of planning, designing, storing and managing software quality-testing processes and methodologies. It allows the software quality and testing team to have control over the data, files, rules and policies produced during the entire software-testing life cycle. Test data management is also known as software test data management.
The main objective of the test data management is to check and test the quality of the software. During the entire software testing life cycle, it controls the files, rules, etc. that are produced during processing.
It separates the test data from production data. It minimizes and optimizes the size of software testing data and creates the testing reports. To implement the process of the test data management, test data tool is used.
The TDM Process
A Tester cannot simply claim “there are probably defects” in a system and never attempt to identify and report the defects. They must interact with the system and replicate potential defects that have been found. Similarly, a tester can’t provide adequate results if they do not have access to relevant systems and an appropriate sample of data the system utilizes. For data to return the most value, it must be managed using quality processes. The key phases involved in a TDM process are:
- Planning
- Analysis
- Design
- Build
- Maintenance
Planning Phase
- Assign Test Data Manager (TDM)
- Define data requirements and templates for data management
- Prepare documentation including list of tests and data landscape reference
- Establish a service level agreement
- Set up the test data management team
- Appropriate plans and papers signed off
Analysis Phase
- Initial set up and synch exercises involve data profiling for each individual data store assignment/recording of version numbers for existing data in all environments
- Collection/consolidation of data requirements
- Update project lists
- Analyze data requirements and latest distribution log
- Asses for gaps and impact of data modification
- Define data security, back up, storage, and access policy
- Prepare reports
Design Phase
- Decide strategy for data preparation
- Identify regions needing data to be loaded/refreshed
- Identify appropriate methods
- Identify data sources and providers
- Identify tools
- Data Distribution plans
- Coordination/communication plan
- Test activities plan
- Document for data plan
Build Phase
- Execute plans
- Execute masking/de-identification where applicable
- Back up data
- Update logs
Maintenance Phase
- Support change requests, unplanned data needs, problems/incidents
- Prioritize requests where applicable
- Analyze requirements and consider if they can be met from existing/modified current data including data assigned to other projects
- Required data modification
- Back up new data
- Assign version markers and log with appropriate description
- Review status of ongoing projects
- Data profile exercises
- Assess/address gaps
- Refresh data where needed
- Schedule and communicate maintenance
- If necessary, redirect requests
- Documentation and reports
TDM Checklist
- Identify common test data elements
- Aging, masking and archiving of test data
- Prioritization and allocation of test data
- Generating reports and dashboards for metrics
- Creating and implementing business rules
- Building an automation suite for master data preparation
- Masking, archiving and versioning aging of data
TDM Challenges and Solutions
There are many challenges that can complicate the TDM process such as sensitive data masking and resource consumption. An overlooked challenge can cause major setbacks. Several common topics for consideration have been listed below.
Challenges of Test Data Management include:
- Additional time for data set up/management instead of actual testing
- Additional administrative efforts in test data management
- Additional expense including personnel and hardware
- Inaccurate/difficult to access data negatively impacts testing
- Sensitivity of private information (credit cards, medical records, etc.)
- Storage required for test data
- Potential for data loss
- Use of real data versus fake data generated from scratch
- Data requests poorly communicated result in inadequate data returns
- Identification of data anomalies
- Test priority confliction
- Timely data reversions
Data masking and de-identification – Data masking and de-identification is essential to comply with privacy laws and standards. There are several approaches that may be taken to use realistic data without betraying the confidentiality of sensitive data:
- You could go through and remove all sensitive information, such as credit cards or social security numbers, but this may not always be the correct method to accurately cover test requirements.
- One method is to generate fake data from scratch that fits the appropriate format. This can be time consuming for personnel; however, an automated script can be used to quickly generate required data.
- If you need to return the data to its original format, in some circumstances, a reversible algorithm can be used to alter the data. However, if the algorithm is known or discovered this could potentially allow for the private data to be compromised.
- A numeric variance, such as +/- 10%, can be used to change information (finance, demographics, etc.) just enough to make it untrue but still valid enough for appropriate use.
- Data Encryption is a very extensive approach that may not be as effective as it appears if access rights are carelessly given out.
- Masking out with viewed values being changed, such as with XX or **, can allow systems to still use the data without making the data available for easy access.
Solutions to reduce challenge impact include:
- Ensure connectivity of relevant parties before data set up
- Testing environments and data requirements are well-defined
- Smaller data sets that accurately sample full data coverage
- Involved parties meet and confirm requirements are fully addressed
- Back up data and assign versions
- Log the versions with relevant details for quick reference and conversions
- Data partitions are assigned to entire teams/projects, not to individual members
- Maintain records of data distribution
- Unused data/partitions made available for other relevant projects
- Masking and de-identification of sensitive information
- Scope of project defines masking tools for complete and consistent masking with realistic representation
- Masking tools jointly decided by relevant parties
- Standard request and documentation templates
- Refresh test data as needed, including periodic updates with new extracts, to accurately cover customer data
- Subset of metadata to accommodate changes
- Regular scheduled maintenance
- Insert row and database editing changes with multilevel undo capabilities
- Cloud storage (may violate privacy protection)
- Outsourcing of processes to expert companies
- Networking with other professionals
- Automation can be used to expedite processes and lower resource cost, including:
- Masking/De-identification of sensitive information
- comparisons between baseline and successive test runs
TDM techniques that empower software testing.
- Exploring the test data
- Validating your test data
- Building test data for Reusability
- Automating TDM tasks to accelerate the process
Exploring the Test Data
Data can be present in diverse forms and different formats, which can be spread across multiple systems as well. The respective teams need to search for the right data sets on the basis of their requirements and the test cases. Locating the right data in the required format and within the time constraints is absolutely critical. This intensifies the need for a robust test management tool that can deal with end-to-end business requirements for testing an application.
It is evident that manually locating data and retrieving it is a tedious task and might bring down the efficiency of the process. Hence, it is important to bring into play a test data management solution that ensures effective coverage analysis and data visualization. Exploring the data sets and analyzing them further is absolutely critical, which helps establish an effective Test Data Management approach.
Validating your test data
In the current scenario where organizations are implementing agile methodologies, the data can be sourced even from actual users. This data mostly comes via the application, which is followed as a practice for generating and exploring test data that gets leveraged for conducting test cases by QA teams. Hence, the test data must be protected against any breach in the development process, where sensitive personal data such as names, addresses, financial information, and contact details must not get exposed.
This test data can be further simulated to generate a real environment, which can further influence the end results. Real data is necessary for testing applications, which is sourced from production databases and later masked for safeguarding the data. It is critical that the test data is validated and the resulting test cases give a real picture of production environment when application goes live.
Eventually, test data will determine where the application breaks in the actual (real world) set-up.
Building test data for Reusability
Reusability is key for ensuring cost-effectiveness and maximizing the testing efforts. Test Data must be built and segmented to make it more and more reusable. It should be accessible from a central repository and the objective should be to use it as much as possible and optimize the value of work that has been done.
By making the data reusable, the bottlenecks and issues within the data are removed and it is fully versioned. Ultimately, no time is wasted in resolving any unseen issues with the data. Data sets get stored as reusable assets in the central repository and are provided to the respective teams for further use and validation. In this way, the test data is available for building test cases within no time and at ease.
Automation can accelerate the process
Test Data Management entails scripting, data masking, data generation, cloning, and provisioning. Automation of all these activities can prove to be absolutely effective. It will not only accelerate the process, but also make it much more efficient.
During the Data Management process, the test data gets linked to a specific test, which can be fed into an automation tool that ensures that the data is provided in the expected format whenever required. Automating the process ensures quality of the test data during the development and testing process.
Similar to regression testing or any kind of recurring tests, even production of test data can be automated. It helps in replicating massive traffic and number of users for an application to create a production scenario for testing. It helps save time in the longer run, reduces efforts, and helps expose any error with the data on an ongoing basis. Ultimately, the QA team would be in a better position to streamline and validate its test data management efforts.
Test Data Management Tools
- Informatica
- CA Test Data Manager (Datamaker)
- InfoSphere Optim
- Solix EDMS
- vTestcenter
- TechArcis
- SAP Test Data Migration Server
Best Practices
Implementing a test data management approach involves a few steps that can help simplify the testing process by applying five best practices to test data management before going to production after testing is complete.
Discover and understand the test data – Data is scattered across systems and resides in different formats. In addition, different rules may be applied to data depending on its type and location. Organizations should identify their test data requirements based on the test cases, which means they must capture the end-to-end business process and the associated data for testing. Capturing the proper test data could involve a single application or multiple applications. For example, a business may have a customer relationship management (CRM) system, an inventory management application, and a financial application that are all related and require test data.
Extract a subset of production data from multiple data sources – Extracting a subset of data is designed to ensure realistic, referentially intact test data from across a distributed data landscape without added cost or administrative challenges. In addition, the best approaches to collecting a data subset include obtaining metadata in the subset to accommodate data model changes quickly and accurately. In this way, obtaining a subset creates realistic test databases small enough to support rapid test runs but large enough to accurately reflect the variety of production data. Part of an automated subset gathering process involves creating test data to force error and boundary conditions, which includes inserting rows and editing database tables along with multilevel undo capabilities.
Mask or de-identify sensitive test data – Masking helps secure sensitive corporate, client, and employee information and also helps ensure compliance with government and industry regulations. Capabilities for de-identifying confidential data must provide a realistic look and feel, and should consistently mask complete business objects such as customer orders across test systems.
Automate expected and actual result comparisons – The ability to identify data anomalies and inconsistencies during testing is essential in measuring the overall quality of the application. The most efficient way to achieve this goal is by employing an automated capability for comparing the baseline test data against results from successive test runs—speed and accuracy are essential. Automating these comparisons helps save time and identify problems that might otherwise go undetected.
Refresh test data – During the testing process, test data often diverges from the baseline, which can result in a less-than-optimal test environment. Refreshing test data helps improve testing efficiencies and streamline the testing process while maintaining a consistent, manageable test environment.
Case study: The importance of test data management
Proper test data management can be an essential process for cost-effective continuous testing. Consider the following scenario in a US insurance company.1 The director of software quality was fed up because lead project managers and quality assurance (QA) staff were complaining almost daily about the amount of time they spent acquiring, validating, organizing, and protecting test data.
Complicated front-end and back-end systems in this scenario consistently caused budget overruns. Contingency plans were being built into project schedules because the team expected test data failures and reworking. Project teams added 15 percent to all test estimates to account for the effort to collect data from back-end systems, and 10 percent of all test scenarios were not executed because of missing or incomplete test data. Costly production defects were the result.
With 42 back-end systems needed to generate a full end-to-end system test, the organization in this example could not confidently launch new features. Testing in production was becoming the norm. In fact, claims could not be processed in certain states because of application defects that the teams skipped over during the testing process. Moreover, IT was consuming an increasing number of resources, yet application quality was declining rapidly.
The insurance company in this scenario clearly lacked a test data management strategy aligned to business results. Something had to change. The director of software quality assembled a cross-functional team and asked the following tough questions:
- What is required to create test data?
- How much does test data creation cost?
- How far does the problem extend?
- How is the high application defect rate affecting the business?
Finding the answers to these questions was an involved process. No one had a complete understanding of the full story.
Through the analysis process, the team in this scenario discovered that requests for test data came too late, with too many redundancies. There were no efficient processes to provide test data for all of them. Teams would use old test data because of the effort involved in getting new test data, but using old test data often resulted in a high number of defects. In addition, the security risks of exposing sensitive data during testing were rampant.
After fully analyzing the problems, the team in this example concluded that with every new USD14 million delivery, a hidden USD3 million was spent on test data management. Hidden costs were attributed to the following sources:
- Labor required to move data to and from back-end systems and to identify the right data required for tests
- Time spent manipulating data so it would work for various testing scenarios
- Storage space for the test data
- Production defects not tested because test data was not available
- Masking sensitive data to protect privacy
- Skipped test scenarios
After implementing a process to govern test data management, the insurance company in this scenario was able to reduce the costs of testing by USD400,000 annually. The organization also implemented IBM solutions to help deliver comprehensive test data management capabilities for creating fictionalized test databases that accurately reflect end-to-end business processes.
The insurance company in this example can today easily refresh test systems from across the organization in record time while finding defects in advance. The organization now has the enhanced ability to process claims across all 50 states cost-effectively. Testing in production is no longer the norm. In this scenario, implementing test data management not only helped the organization achieve significant cost savings, it helped reduce untested scenarios by 44 percent during a 90-day period and minimize required labor by 42 percent annually.
The insurance company in this case study scenario now has an enterprise test data process that helps reduce costs, improve predictability, and enhance testing—including enabling automation, cloud testing, mobile testing, and more. People, processes, and technologies came together to make a real change.