Machine learning works well where there is lots of data. Nowadays data is almost everywhere and in different formats both structured and unstructured. One of the biggest challenge in machine learning is the quality of data. Data does not come on a silver plate and you have to work extremely hard to clean and understand which kind of data is useful for your analysis and which one is misleading. The quality of prediction is also affected by the quality of data. Most of the time in your machine learning tasks you will be working with data from external sources such as csv file, excel files, APIs, databases and more interesting streaming data from social sites. In this post we are going to learn how to load data set for machine learning and train our model. We will use the famous iris data set which can be downloaded here iris data set (127 downloads) . This post assumes that you are familiar with Pandas.

Loading Machine Learning Data

There are many ways of loading data for use in machine learning but perhaps the best approach is to use the Pandas library. Pandas is an open source Python tool for data analysis. It is ideal for data wrangling, cleaning and analysis. If you are not familiar with Pandas you can visit my posts series to learn about Pandas. We will be using the iris data set for our example. Iris data set  is a famous data set that is often used in the introduction to machine learning as a supervised machine learning example for classification problem. It consists of 150 samples of flowers from three species namely; Iris setosa, Iris virginica and Iris versicolor. The data set has four features which include; petal length, petal width, sepal length and sepal width.

Our Project Setup

In the first part of our project we are going to load the iris data set that comes with the sci-kit-learn and do simple analysis of it then train our model and make prediction. In the second part we are going to load data from a csv file do some analysis the train our model and make prediction. For more details on how to load data using Pandas from different sources such as databases, APIs, excel spreadsheets and more you can visit my previous post on Pandas Data Analysis.

Loading Iris Data Set Using load_iris() Function

Load Iris data set

Get Features

Labels/Target

Model Training

Loading Data Using Pandas

We import the data sets library from sci-kit-learn then use the Pandas.read_csv(filename) function to load the iris data set as shown below . First you need to download the Iris data set in csv format here  iris data set (127 downloads) .

Analyzing Data Set

Let’s now load the Iris data from a csv file. Download the data set from here then drop it in your working directory. To import csv data using Pandas we use the Pandas.read_csv(filename) function. This returns a Pandas DataFrame. With Pandas we can do many analysis on our Iris csv data set, which is recommended process in machine learning so that we understand our data better. Let’s do simple analysis using Pandas. For a comprehensive analysis you can visit my previous post on Pandas Data Analysis series.

Model Training

Now we can go ahead and train our model and make prediction.

We follow the standard procedure for machine learning; which include; importing the necessary libraries which include Pandas and the K-Nearest Neighbors algorithm, then we load data using Pandas, we prepare our data to feature which is X (Note the caps on the X) and label/target which is y (This is lower case y), we then train our model using fit() function, and finally we make the prediction of our given data samples.

Conclusion

As a data practitioner you will at any given time work with data from external sources such as csv, databases, APIs and many more. We have learned about how to use Pandas an excellent tool for data analysis tasks. We have only looked at how to load csv data for our project. To learn how to load different data sources you can visit my previous post on Pandas Data Analysis to learn more. There are different data sources of data and data comes in different forms both structured and unstructured therefore it needs to be organized in a standard format so that machine learning model can learn from it. Data cleaning forms a significant part of machine learning process and it largely contributes to the overall outcome of the process.

What’s Next

In this post we have learned about how to load data for machine learning using Pandas. In the next post we are going to learn how to split our data for train, test and validation sets.

Loading Machine Learning Data

Post navigation