Data Cleaning in Machine Learning

Data Cleaning:

In an ideal world, you’d receive data and put it straight into the system for processing. Then your favourite actor or actress would hand you your favourite drink and pat you on the back for a job well done.

In the real world, data is messy, usually unclean, and error-prone. The following sections offer some basic checks you should do, and I’ve included some sample data so you can see clearly what to look for.

1. Presence Checks:
First things first, check that data has been entered at all. Within web-based businesses, registration usually involves at least an e-mail address, first name, and last name. It’s amazing how many times users will try to avoid putting in their names.

The presence check is simple enough. If the field length is empty or null, and that piece of data is important in the analysis, then you can’t use records from which the data is missing.

The first name and e-mail are missing from the example, so the record should be fixed or rejected. In theory, the data could be used if knowing the customer was not important.

2. Type Checks: With relational databases you have schemas created, so there’s already an expectation of what type of data is going where. If incorrect data is written to a field of a different data type, then the database engine will throw an error and complain at you.

In-text data, such as CSV files, that’s not the case, so it’s worth looking at each field and ensuring that what you’re expecting to see is valid.

#firstname, lastname, email, age
Jason,Bell,[email protected],42
42,Bell,[email protected],Jason

3. Length Checks:
Field lengths must be checked, too; once again, relational databases exercise a certain amount of control, but textual data can be error-prone if people don’t go with the general rules of the schema.

4. Range Checks:
Range or reasonableness checks are used with numeric or date ranges. Age ranges are the main talking point here. Until there are advances in scientific medicine to prolong life, you can make a fairly good assumption that the upper lifespan of someone is about 120. You can even play it safe and extend the upper range to 150; anyone who is older than that is lying or just trying to put a false value into the trip up the system.

5. Format Checks:
When you know that certain data must follow a given format then it’s always good to check it. Regular expression knowledge is a big advantage here if you know it. E-mail addresses can be used and abused in web forms and database
tables, so it’s always a good idea to validate what you can at the source.

There’s much discussion in the developer world about what a correct e-mail regular expression is. The official standard for the e-mail address specification is RFC 5322. Correctly matching the e-mail address as a regular expression is a huge pattern. What you’re looking for is something that will catch the majority of e-mail addresses:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@ (?:[az0-
9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

The main thing to do is create a run of test cases with all the eventualities of an e-mail address you think you will come across. Don’t just test it once; keep retesting it over time.