The Mobenzi tools include a number of features that allow you to gather accurate information and streamline data cleaning protocols. In this post, we’ll cover a few of the ways you can harness these features to maintain good data quality, and highlight some of the recent enhancements that provide even greater control.
Preventing data input errors
The most important place to start with data quality is at the source – i.e. during data collection. This in turn is influenced by your form’s design. It makes sense to invest upfront in this process: the more accurately you can gather data initially, the less time, resources and energy you’ll spend cleaning up afterwards.
Mandatory vs optional fields
By default, most fields you add to your form are required – the form cannot be submitted until all required fields have been captured. It is possible however to set a field to “optional” instead. This is sometimes necessary (for example, if you aren’t sure whether an answer can always be provided) but generally it’s best to set all fields to required and use skip logic to hide fields that don’t always apply (more on that later).
By implementing validation you can avoid many data input errors before they ever reach your dataset. Most field types can have both simple and advanced validation rules applied. Simple validation examples include static range checks (e.g. fixed min and max values for a respondent’s age) or validating one captured value against another (e.g. ensuring a child’s age is less than their parent’s).
If the input needs to conform to a specific format (e.g. exactly 13 digits), you can use a regular expression to enforce this.
When validating dates, the
=today() function is useful as it allows you to implement relative constraints. For instance, to ensure the date specified falls within the last week, you can add a validation rule that the date entered is after
today() - 7.
Double data entry is another way to ensure key fields have been captured correctly. It’s very simple to configure a validation rule on a confirmation field that ensures the value entered (e.g. ID number) matches the value entered in the first field.
Applying appropriate skip logic can ensure you don’t end up with data anomalies – such as pregnancy-related information being captured for male respondents. You should decide which questions are applicable/inapplicable based on previous inputs and build this skip logic into your form. As a reminder, you can apply skip logic to an individual field, a group of fields or an entire page.
Advanced validation strategies
For more advanced scenarios, you can write custom expressions that leverage built-in functions, including “IF” statements, and reference multiple fields if necessary to evaluate whether to display a warning or confirmation message when potentially invalid data is captured. An expression can either be used in a validation rule directly, or you can store the result in a Calculated Value and use the result to show a validation warning, such as the example below.
And if you want to implement even more sophisticated types of validation that can’t be constructed using the built-in functions, you can build your own widget that can be embedded into a form to execute completely custom code and return a validation result. For example, based on a child’s entered weight and age, you could build a widget like the example below that uses the WHO growth guidelines to compare against population norms. If the values entered fall significantly outside the normal range, you could prompt to double-check the values entered.
Making data corrections
Once a completed form has been uploaded, the next phase of data quality management can begin. Should any errors in captured data be detected or reported, it’s possible to correct them by editing the submission. All changes are automatically logged, recording the date/time the change was made, who the change was made by, what the value was changed from and to, and (optionally) why the change was made. This audit history ensures full traceability back to the original source.
It’s worth noting another feature at this point related to form versioning and data integrity. When a new version of a form is published, a snapshot of the entire form is stored. Any data captured using that version of the form is linked to that snapshot. This ensures that, if you edit a submission, the corresponding version of the form that was in use during initial data entry can be retrieved. This prevents data integrity problems arising where data that was originally captured no longer maps up to later versions of the forms (e.g. where option lists may have changed or fields have been added or removed).
Tracking data cleaning progress
In January 2021 we released a powerful new feature that allows you to “hide” a field during data collection. This feature has several use cases, but one of the most powerful is how it can be used to track data cleaning progress.
Regardless of the data cleaning process you decide to follow and the checks you perform, a common challenge is how to keep track of “clean” data, i.e. data that has been reviewed, corrected (if necessary) and can be included in analysis, versus data that is of questionable or unknown quality.
By adding one or two hidden fields to your form, you can easily solve this challenge. In the simplest case, you would just add a single select field with three options: “Pending”, “Cleaned” and “Rejected”. Set the default value of the field to “Pending” and make the field hidden – this will result in the field being hidden during data entry and its value set to “Pending” when the form uploads.
Although the field (named “qc_status” in the example above) is not shown during data collection, it is visible and can be edited once the form has been uploaded to the backend. During data review, a quality manager can use this field to record the current status of the record. Like any other field, its value can be modified and the change will be tracked. For example, if they picked up an issue they might set the value from Pending to Rejected.
If the query can be resolved, the data can be corrected and the status could be set to Cleaned. This could be used to indicate that the data was ready for analysis.
Using this approach, you can also easily pull a list of records based on their current status. For example, to get a “clean” dataset, you would simply create a filter to return submissions where qc_status = “Cleaned”. You could have similar filters for other statuses to keep track of records that needed to be reviewed or excluded from analysis. A useful feature to remember here is the ability to save a filter as a “published report” – allowing you to easily retrieve the list (filter) again or to share it with a colleague.