Skip to main content

Data upload in Azure ML platform,New experiment and Data cleaning,ETL

In this blog, we are going to download a free data and then work on it using Azure platform. We start by exploring some of the places where free data can be availed. There are a numerous sites which provide free data to download and work on. One which we are going to use is https://data.gov.uk/. You can use the same type of site from your country also, like for example, https://data.gov.in/ represents the data source for indian government's data. For today's work we are going to work on a data from uk about educations .
We expect a free data to be containing noisy, erroneous and not work ready. That's why we will work on the first step on data processing which is called ETL, extract transform and load.

Download of the data from source:

First, enter the website name https://data.gov.uk in your url or search space of your search engine.
The link will open to the UK government's data source. The page will have all these options available as in the picture below:

Clearly there are many departments on which we can find data for analyzing. But for this example part we will go to download from Educations. So please click on education and proceed to next level.
It will look like:

Now on clicking Education a lots of data with their links and meta-description comes on screen.
The corresponding page looks like the above.

Lastly, we select the format on the left as CSV, apply the filter and then click on the data named "Maintained schools and academies inspections and outcomes". 

This opens a description of the data,summary and links.  Now we will download the data for up to 31st march,2017. 
Now, the data is downloaded in your machine. Lets move onto uploading the data on Azure platform. 

Uploading and knowing the data:

First, open the Azure ML platform. Then open +NEW on the left corner. It introduces a screen as the below:


Now click on the option named DATASET.  This opens option to upload from local file. Clicking this will pop a dialogue box. This dialogue box looks like below:
Now search your PC using the choose file option and open the specific file, in this case which is named Maintained_schools_and.... Now select and click open, and that will upload it, you will have to close the dialogue box clicking the right tick on the right bottom corner.
Now, it will lead to the following uploading option which looks like:

This will end showing the data is uploaded. You will have to click ok and the work of uploading is complete.

Opening the dataset in a new experiment:

It follows in the similar way. Again going to +NEW, open EXPERIMENT. This gives a lot of experiment templates, many of which leads to different teaching experiments with different machine learning and other aspects already implemented. But for using our dataset, we will go with BLANK EXPERIMENT. Hence select blank experiment. 
This opens directly the environment of an experiment. A new experiment environment looks like the following:

Now, on the stark left, you can see all the options available from the main homepage. One can still open projects, web services, dataset, trained models and settings from this page. On the right to that one can see a lot many options like saved datasets, trained models, data models, machine learning and other things.

Understanding the experiment environment:

A experiment environment works in the following way. One have to search in this left bar of all functions like stated above, and then drag the correct function to the experiment field where it is written that "drag and drop items here". So, as the heading of the post suggests our main goal is to work on a dataset here. Hence click on saved dataset. 
It opens in two options, i.e. My datasets and samples. Now, any dataset you open will be saved in my dataset and any data already saved in the cloud given from the Microsoft, will be under samples.
As we had uploaded the data ourselves, it will be under the My dataset tag. 
Now tap on My datasets. It will drop down list of all the datasets uploaded from your PC.  Now go on and select dataset named as Maintained_schools_... and drag it and drop it on the top of the experiment environment. 
Now let us see what comes in built with this data. go to the data and select the box, then right click on the box. This opens a list of options to do. It looks like the following:

"Delete" option, on clicking that will delete this box from the environment. Copy,cut,paste bears the obvious meanings. The dataset option again opens in a number of options, which are:(1) download
(2) Visualise (3) generate data access code (4) open in a new notebook. download lets one download the data in .dataset form. Visualise gives the option to see a basic summary of the data. To get a view of the data and understand basic items about it, we will click on this visualisation option. 
Now, visualisation will lead to the following box:
Clearly, this gives all the columns and also provide extra visualisations. With this it provides the histogram of each column to begin with. Now, selecting each column provides a bunch of summary statistics and visualisations. As default the visualisation is histogram. The default statistics provided are mean,median,min,max,unique values and missing values and feature types. 
Unique values gives a view about how many values the feature will have. So, lets observe the columns. Columns named Organisation ID, URN, LAESTAB have all values unique. So these features can be concerned as unique features associated with each entries.
One important aspect of the data is missing value. With missing values, doing analysis is not a good option. If you inspect the 41 columns, you can see that two columns, i.e. number of warnings received in previous year and previous category of concern has too many missing value. Some other columns also have some amount of missing values. 
Hence comes our second action, called cleaning of data. 

Cleaning of data:

Now there is a readymade tool for cleaning missing data. Write "clean" on the search bar for the function bar. A function called "Clean missing data" pops up under the manipulation category of Data transformation. Drag and drop this under the main data box. 
Now observe that, there is a round portion in the middle bottom of the databox. If you click on that and then drag the cursor out, a connector will emerge. Hold the connector. 
Now, on each function which has to be applied, there are ports on the top. Connect this connector to the top port. One can actually observe that these bottom and above ports, on bringing the mouse above them, describes what comes out or has to be connected to, i.e. dataset, processes etc. 
Now observe that once you connect to the top port some properties which are of course default at the begining, pops up on the right. You can change these or manipulate these to get different values. 
Important properties for missing data manipulation are "launch column selector" and replacement values. We do not go into details here and run the box on default.

To run a experiment, observe that on the below, there are options as such Run, Save, Save As and others. On clicking run it will run the whole experiment. Here it may not be a big difference but in a big enough experiment it may take up a really good time. Hence we will do a bit different. We will select the missing data cleaning box and click Run selected. 

Now once finish running pops up and a green tick arrives on the side of the box, it means the experiment has run successfully. 

This new dataset can be now used for various analysis, SQL transformation and other interesting works. We will work further on this data in the next blog. Please stay tuned for more actions. Thanks for reading.



Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...