How Data is used in Machine Learning?

1. Raw Text: Basic raw text files are used in many publications. If you look at the likes of the Guttenberg Project, you’ll see that you can download works in a raw text file. The data is unstructured, so it rarely has a proper form with which you can work.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse eget metus quis erat tempor hendrerit. Vestibulum turpis ante, bibendum vitae nisi non, euismod blandit dui. Maecenas tristique consectetur est nec elementum. Maecenas porttitor, arcu sed gravida tempus, purus tellus lacinia erat, dapibus euismod felis enim eget nisl. Nunc mollis volutpat ligula. Etiam interdum porttitor nulla non lobortis.

Common formats for text files are Unicode, ASCII, or UTF-8. If there’s any international encoding required, UTF-8 or Unicode are most common. Note that PDF documents, Rich Text Format files, and Word documents are not raw text files. Microsoft Office documents (such as Word files) are particularly troublesome because of “smart quotes” and other non-text extraneous characters that wreak havoc in Java programs.

2. Comma Separated Variables: The CSV format is widely used across the data landscape. The comma character is used between each field of data. You might find that other delimiters are used, such as tabulation (TSV) and the pipe (|) symbol (PSV). Delimiters are not limited to one character either. If you look at something like the USDA Food Database you’ll see ~^~ used as a delimiter. The following CSV file is generated from a fake name generator site. It’s always good to use fake data when you’re testing things.

1,male,Mr.,Joe,L,Perry,50 Park Row,EDERN,,LL53 2SQ,GB,United Kingdom,[email protected],Annever,eiThahph9Ah,077 6473 7650,Fry, 7/4/1991,Visa,4539148712302735,342,2/2018,YB 20 98 60 A,1Z 23F 389

3. JSON: JavaScript Object Notation (JSON) is a commonly used data format that utilizes key/value pairs to communicate data between machines and the web. It was designed as an alternative to XML. Don’t be fooled by the use of the word JavaScript; you don’t need JavaScript to use this data format.

There are JSON parsers for various languages. The earlier CSV example used fake name data; here’s the first entry of the CSV in JSON notation:

"StreetAddress":"50 Park Row",
"ZipCode":"LL53 2SQ",
"CountryFull":"United Kingdom",
"EmailAddress":"[email protected]",
"TelephoneNumber":"077 6473 7650",
"NationalID":"YB 20 98 60 A",
"UPS":"1Z 23F 389 61 4167 727 1",
"Occupation":"Nephrology nurse",
"Company":"Friendly Advice",
"Vehicle":"1999 Alfa Romeo 145",
"FeetInches":"5' 10\"",

Many application programming interfaces (APIs) use JSON to send response data back to the requesting program. Some parsers might take the JSON data and represent it as an object. Others might be able to create a hash map of the data for you to access.

4. YAML: Whereas JSON is a document markup format, YAML (meaning “YAML Ain’t Markup Language”) is most certainly a data format. It’s not as widely used as JSON but from a distance looks very similar.

date: 2014-01-02
bill-to: &id001
given: Jason
family: Bell
lines: |
458 Some Street Somewhere
In Some Suburb
city: MyCity
state: CA
postal: 55555

5. XML:
The extensible markup language (XML) followed on from the popular use of Standard Generalized Markup Language (SGML) for document markup. The idea was for XML to be easily read by humans and also by machines. On the first inspection, XML is like Hypertext Markup Language (HTML); later versions of HTML use strict XML formatting types. XML gets criticism for its complexity, especially when reading large structures.

That’s one reason it’s popular for web-based APIs to use JSON data as its response. There are a large number of APIs delivering XML response data, so it’s worthwhile to look at how it works:

<?xml version="1.0" encoding="UTF-8" ?>
<StreetAddress>50 Park Row</StreetAddress>
<ZipCode>LL53 2SQ</ZipCode>
<CountryFull>United Kingdom</CountryFull>
<EmailAddress>[email protected]</EmailAddress>
<TelephoneNumber>077 6473 7650</TelephoneNumber>
<NationalID>YB 20 98 60 A</NationalID>
<UPS>1Z 23F 389 61 4167 727 1</UPS>
<Occupation>Nephrology nurse</Occupation>
<Company>Friendly Advice</Company>
<Vehicle>1999 Alfa Romeo 145</Vehicle>
<FeetInches>5' 10"</FeetInches>

Most of the common languages have XML parsers available using either a document object model (DOM) parser or the Simple API for XML (SAX) parser. Both types come with advantages and disadvantages depending on the size and the complexity of the XML document with which you are working.

6. Spreadsheets: Talk to any finance person in your organization, and you’ll discover that their entire world revolves around spreadsheets. Programmers tend to shun spreadsheets in favour of data formats that make their lives easier. You can’t ignore them, though. Spreadsheets are the lifeblood of an organization, and they probably hold most of the organization’s data.

There are lots of different spreadsheet programs, but the most commonly used applications are Microsoft Excel, Google Docs Spreadsheet, and LibreOffice. Fortunately, there are programming APIs that you can use to extract the data from spreadsheets directly, which saves a lot of work in converting the spreadsheet to the likes of CSV files. It’s worth studying the formulas in the spreadsheets because there might be some algorithms lurking there that are worth their weight in gold.

7. Databases:
If you’ve been brought up with web programming, then you might have had some exposure to databases and database tables. Common ones are MySQL, Postgres, Microsoft SQL Server, and Oracle.

Recently, there’s been an explosion of NoSQL (meaning Not Only SQL), such as MongoDB, CouchDB, Cassandra, Redis, and HBase, which all bring their flavours to data storage. These document and key/value stores move away from the rigid table-like structures of traditional databases.

8. Images:
The common data formats previously mentioned mainly deal with text or numbers in different shades, but you can’t discount images. There are several things you can learn from images. Whether you’re trying to use facial recognition or emotion tracking or you’re trying to determine whether an image is a cat or dog (yes, it has been done), there are several APIs that will help.

The most popular formats are portable network graphics (PNG) and JPEG images; which are regularly used on the web. If processing power is freely available then TIFF or BMP are much larger files, but they contain more image information.