At Grantbook, we are often engaged in data cleaning tasks either as stand-alone projects or as a part of a larger grants management system (GMS) implementation. Even though data is a key component of an organization’s story, we often find that data cleaning or migration work is not given the time or resources that it needs. Accurate and well governed data is often the best way for grantmakers to track their funding, know who they're giving to, and begin to unpack their impact story.
If this is the first time your organization is taking on a data cleansing project, it can definitely be daunting knowing where to start. The good news? We’ve been in your shoes and having worked on several data prep and cleansing projects over the years, we’ve picked up some tips and tricks along the way.
We’ve broken this information out into five sections:
- What is Clean Data? - an overview of what we mean by “clean data” and why it’s important
- Data Strategy & Governance - the most important first step to ensure that your organizational strategy, roles, and processes are all aligned to ensure that your cleaned data is informative and reliable
- Cleaning Checklist - how to go about the process of data cleaning
- Data Maintenance & Hygiene - best practices to ensure your data continues to stay clean on an ongoing basis
- Frequently Used Data Cleaning Tools - available tools (or helpers if you will) that will allow you to speed up the process
What is clean data?
Before jumping into the how-to’s, we feel it is important to establish what we mean by “clean data”. In broad strokes, we define clean data as data that is accurate, complete, consistent, and unique.
Different types of data will have different requirements for keeping it accurate and up to date. For instance, information about organizations is more static than information about individuals.
Staff can draw more reliable inferences and have a more holistic picture from datasets that are complete.
Where possible there should be uniformity between similar records in the system. Data entry conventions for record names and fields makes for easier reporting and segmentation of the database.
Too many duplicate records can skew information in reports. It can also provide inaccurate and confusing information to those who need it on the programming side. Keeping duplicates low or out of the system gives users confidence in the data being tracked.
When migrating to a new grants management system or implementing a new system, clean data is integral to the success of your implementation. Clean data will enable you to migrate the appropriate unique organizations, contacts and grants in the new system, as well as establish the correct relationships between these data points. Clean data in fields that are used to drive functional logic is also important.
Data Strategy & Governance: The Most Important First Step
We have seen clients pawn off data tasks to interns without organizational context, migrate every single field from an old system to a new one only to realize 30% were never populated, or underestimate the larger strategy and planning discussions involved in the process.
From our experiences with data in cleaning and migration contexts, we have distilled a few best practices for planning and expectation setting that should be considered before starting any data project.
- Organize a data discovery session: Use this session as an opportunity to review your organization’s data, flag risks, and find common issues and patterns (duplicates, incorrect data, data sitting in the wrong fields, fields used incorrectly, unused fields, etc.). This is also a good setting to discuss post-cleaning governance tasks and ensure that leadership is on board with the data vision.
- Know the goals of the data: Are there fields that should not be carried over to the new system? What are the reports or impact dashboards you want to create from your grant data and which data points need to be collected in order to make this happen?
- Set time expectations: Establish the time commitment required to complete data tasks at the outset of a project. If the data is cleaner, the time commitment for the data project will be lighter. If the data is messier, you will need more time and might need to bring in individuals who have more organizational context to help sort through the data.
- Establish roles: It is important for everyone to have a role. The projects that we’ve seen have the most success always have a data champion, someone assigned to QA, decision makers, and organizational context holders.
- Create a task list of cleaning activities: What gets measured gets done. Having a task list ensures that each person knows what they are responsible for.
Cleaning Checklist: How to Clean your Data
- Start Small: Begin small by cleaning just a subset of your data. This will help you identify a standardized way to go about the process, establish a more accurate sense of time required and unearth potential issues earlier on.
- Data Extraction: Export out the raw data required from your system for each data model (i.e. Organizations, Contacts, Grants etc).
- File Format & Encoding: Pay attention to the file format and file encoding when exporting out data. CSV files are a commonly used file format for large datasets (it will take up less storage space) and UTF-8 is considered a safe encoding language across OS and multilingual languages
- File Labeling: When exporting out your data, make sure to clearly label every data file. Include the data model, the date of export and the label ‘Raw Data’ to signify that the file represents the original data before it was scrubbed.
- Create a copy: Always create a copy of your ‘raw data’ file and use this copied file (once relabeled to say ‘Scrubbed Data’) as the file to use when you begin scrubbing the data. The goal is to always maintain a record of your original raw data so you have a frame of reference if needed.
- Log all changes: Data cleaning always involves a variety of tasks (often more than anticipated) and it can become very easy to lose track of the changes made. Keeping a log (a simple spreadsheet or google doc will do the trick) is a good way to track the history of changes made and can be especially useful when you have multiple people working on cleansing the data as opposed to just one individual person.
- Check for duplicates: Use this as an opportunity to check for duplicates (based on name, email address, phone number etc) and do so at the start of the process. This will reduce both the amount of data you have to scrub as well as making data validation (at the end of the process) much simpler.
- Standardize Formats: Review the formats being used for applicable data types (dates, phone numbers, postal codes etc) and set and apply a standard format to all.
- Special Characters & Accents: Depending on the file format and encoding chosen, special characters can display weirdly once the data is exported. For example, the term “Açaí” might display as “AÃ§aÃ” so make sure to do a check and clean as needed.
- Maintain backups: Always maintain a backup of all files (both raw and scrubbed data files). Having an ongoing backup for all versions of scrubbed data files allows you to have version history and helps avoid those grey hairs in case anything were to happen to your master files. Avoid storing data locally though (both master files and backups) as it can be a security risk
- Clean workspace: Data scrubbing is the type of task that requires chunks of uninterrupted time and a tidy digital and mental workspace so avoid having other documents open and other tasks/meetings on the go as you work on the data as it only increases the chance of error
- Data Import: Once you’ve finalized a scrub of all your data, import it back into your system
- Data Validation: Once the scrubbed data is back in your system, validate its accuracy by conducting a quality review. You can do this by (i) reviewing a handful of records for each data model to verify accuracy (ii) reviewing a sample of records with special characters and accents to ensure proper formatting and (iiI) running reports for each data model to extract out the clean data and cross reference against the data in your master (scrubbed data) spreadsheets
Data Maintenance & Hygiene: Keeping Your Data Clean
- Create a list of data protocols supported by a data glossary to uphold data cleanliness
- Schedule trainings/workshops with staff to help get them informed and engaged on the new data policies
- Conduct frequent data validation (e.g. every quarter) to ensure data policies are being followed and clean data is continuing to be maintained
- Leverage any available existing functionality of your system in maintaining the integrity of your data e.g. turn on duplicate management (if available)
Frequently Used Data Cleaning Tools
There are several tools that can help speed up data cleansing but it can be hard knowing which to use so listed below are some of the tools we’ve had success using in the past (and it includes some free options!):
- Duplicate Check for Salesforce (has free trial options)
- Native duplicate rules functionality in Salesforce (org functionality is better than contacts)
- OpenRefine (free)
- Fuzzy Lookup for Excel from Microsoft (helps identify potential duplicates based on close matches) : https://www.microsoft.com/en-ca/download/details.aspx?id=15011
- If viewing data in a spreadsheet, use the filter function on fields that should have values from a certain set. The filter dropdown will show all the values in the column and you can spot anomalies. The filter also has options to look for blank values across fields that should have values