Skip to main content

Data upload in Azure ML platform,New experiment and Data cleaning,ETL

In this blog, we are going to download a free data and then work on it using Azure platform. We start by exploring some of the places where free data can be availed. There are a numerous sites which provide free data to download and work on. One which we are going to use is https://data.gov.uk/. You can use the same type of site from your country also, like for example, https://data.gov.in/ represents the data source for indian government's data. For today's work we are going to work on a data from uk about educations .
We expect a free data to be containing noisy, erroneous and not work ready. That's why we will work on the first step on data processing which is called ETL, extract transform and load.

Download of the data from source:

First, enter the website name https://data.gov.uk in your url or search space of your search engine.
The link will open to the UK government's data source. The page will have all these options available as in the picture below:

Clearly there are many departments on which we can find data for analyzing. But for this example part we will go to download from Educations. So please click on education and proceed to next level.
It will look like:

Now on clicking Education a lots of data with their links and meta-description comes on screen.
The corresponding page looks like the above.

Lastly, we select the format on the left as CSV, apply the filter and then click on the data named "Maintained schools and academies inspections and outcomes". 

This opens a description of the data,summary and links.  Now we will download the data for up to 31st march,2017. 
Now, the data is downloaded in your machine. Lets move onto uploading the data on Azure platform. 

Uploading and knowing the data:

First, open the Azure ML platform. Then open +NEW on the left corner. It introduces a screen as the below:


Now click on the option named DATASET.  This opens option to upload from local file. Clicking this will pop a dialogue box. This dialogue box looks like below:
Now search your PC using the choose file option and open the specific file, in this case which is named Maintained_schools_and.... Now select and click open, and that will upload it, you will have to close the dialogue box clicking the right tick on the right bottom corner.
Now, it will lead to the following uploading option which looks like:

This will end showing the data is uploaded. You will have to click ok and the work of uploading is complete.

Opening the dataset in a new experiment:

It follows in the similar way. Again going to +NEW, open EXPERIMENT. This gives a lot of experiment templates, many of which leads to different teaching experiments with different machine learning and other aspects already implemented. But for using our dataset, we will go with BLANK EXPERIMENT. Hence select blank experiment. 
This opens directly the environment of an experiment. A new experiment environment looks like the following:

Now, on the stark left, you can see all the options available from the main homepage. One can still open projects, web services, dataset, trained models and settings from this page. On the right to that one can see a lot many options like saved datasets, trained models, data models, machine learning and other things.

Understanding the experiment environment:

A experiment environment works in the following way. One have to search in this left bar of all functions like stated above, and then drag the correct function to the experiment field where it is written that "drag and drop items here". So, as the heading of the post suggests our main goal is to work on a dataset here. Hence click on saved dataset. 
It opens in two options, i.e. My datasets and samples. Now, any dataset you open will be saved in my dataset and any data already saved in the cloud given from the Microsoft, will be under samples.
As we had uploaded the data ourselves, it will be under the My dataset tag. 
Now tap on My datasets. It will drop down list of all the datasets uploaded from your PC.  Now go on and select dataset named as Maintained_schools_... and drag it and drop it on the top of the experiment environment. 
Now let us see what comes in built with this data. go to the data and select the box, then right click on the box. This opens a list of options to do. It looks like the following:

"Delete" option, on clicking that will delete this box from the environment. Copy,cut,paste bears the obvious meanings. The dataset option again opens in a number of options, which are:(1) download
(2) Visualise (3) generate data access code (4) open in a new notebook. download lets one download the data in .dataset form. Visualise gives the option to see a basic summary of the data. To get a view of the data and understand basic items about it, we will click on this visualisation option. 
Now, visualisation will lead to the following box:
Clearly, this gives all the columns and also provide extra visualisations. With this it provides the histogram of each column to begin with. Now, selecting each column provides a bunch of summary statistics and visualisations. As default the visualisation is histogram. The default statistics provided are mean,median,min,max,unique values and missing values and feature types. 
Unique values gives a view about how many values the feature will have. So, lets observe the columns. Columns named Organisation ID, URN, LAESTAB have all values unique. So these features can be concerned as unique features associated with each entries.
One important aspect of the data is missing value. With missing values, doing analysis is not a good option. If you inspect the 41 columns, you can see that two columns, i.e. number of warnings received in previous year and previous category of concern has too many missing value. Some other columns also have some amount of missing values. 
Hence comes our second action, called cleaning of data. 

Cleaning of data:

Now there is a readymade tool for cleaning missing data. Write "clean" on the search bar for the function bar. A function called "Clean missing data" pops up under the manipulation category of Data transformation. Drag and drop this under the main data box. 
Now observe that, there is a round portion in the middle bottom of the databox. If you click on that and then drag the cursor out, a connector will emerge. Hold the connector. 
Now, on each function which has to be applied, there are ports on the top. Connect this connector to the top port. One can actually observe that these bottom and above ports, on bringing the mouse above them, describes what comes out or has to be connected to, i.e. dataset, processes etc. 
Now observe that once you connect to the top port some properties which are of course default at the begining, pops up on the right. You can change these or manipulate these to get different values. 
Important properties for missing data manipulation are "launch column selector" and replacement values. We do not go into details here and run the box on default.

To run a experiment, observe that on the below, there are options as such Run, Save, Save As and others. On clicking run it will run the whole experiment. Here it may not be a big difference but in a big enough experiment it may take up a really good time. Hence we will do a bit different. We will select the missing data cleaning box and click Run selected. 

Now once finish running pops up and a green tick arrives on the side of the box, it means the experiment has run successfully. 

This new dataset can be now used for various analysis, SQL transformation and other interesting works. We will work further on this data in the next blog. Please stay tuned for more actions. Thanks for reading.



Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle