Skip to main content

SQL queries in Azure ml studio part 1

So, I am back and we are going to continue with some new works in our previously loaded data about the school inspections.
So in the last post I uploaded a data from uk data in Azure ml studio. If you have not seen that post you can refer to it here. If you are not familiar to Azure ml platform, then you can find a brief introduction here. Now let's explore what SQL queries we can perform here. 
For our works, I will at first mention some of the standard and easy SQL functions.

SQL Basics:

First of all, the syntax of any query is like:

select #something ,#something from #table name
#conditions;

So, first three easy parts are, you have to write what to select in the select portion. Then you have to mention the table name of the database from which you are going to retrieve the data and lastly to make the query happen you have to provide the conditions.

Now, I will mention some of the functions which we are going to use in a minute. Here I will refer tables as t1, t2 and t3 but in real databases, table names are provided.

(1) selecting the whole table:

    code: select * from t1;
    result: Gives the whole table. 
    Caution: This query may get a server kill, if your server is weak and/or your data is too big. Also, you can be denied permission from this type of query if you are working on someone else's data, i.e. you are a temporary data analyst or something.
    Advantage: In a database, where you do not have a visualise option like Azure ml provides; you can view the table and get some basic understanding about each columns. If not allowed even, well, you can work it really well. You will need information about the schema though.

(2) grouping by some column:
   
    code syntax: select aggregate function(columns), grouping column from t1
                          group by column name;
    results: gives the grouped data.
   
   Explanation: aggregate functions are quite in number. Aggregate functions creates a summary or aggregate of the columns on which they are applied. I will mention three of them below which I will use frequently:
(a) Count : Count(X) where X is a column, counts the number of appearances in that column. Its used with a conditioning or grouping and therefore gives the count on the basis of the condition or the grouping. This is often used with the categorical variables as for them the counting is the most sensible aggregate function.
(b) Sum: Sum(X) where X is a column, sums up a columns all the data. This is good for balance sheets or tally marked data kind of variables where the sum is a sensible aggregate of the undertaken column's variable. One can not use this in case of a non-number variable. Its a strict aggregate for the numerical variables.
(c) Average: avg(X) where X is a column, sums up a columns all the data. This is good for marks, trends of scores or values, temperatures and many different type of data. Again one have to be cautious, its strictly for numerical variables.

(3) conditioning:

    code syntax: select #something from t1 where #conditions;

    Conditioning is one of the most important criteria of queries. What is a query without a condition!
  Important points are:
  (a) SQL supports "and" links for multiple conditions.
  (b) You can provide mathematical relations like factor<20 and relations>60;
  (c) You can not use conditions using where when there is a grouping working. For grouped data the conditioning goes like the following:
      select #some aggregate functions of columns from t1 
       group by column name
      having conditions;
  Here you have to use conditions by having. Another mistake someone can do is that using a condition about individual data points while having column names grouping have already merged many of them together. Here it returns an error. One can only use conditions about the aggregate functions in the having conditions for this reason. A bit of thought for the abstract concept of grouping will help to understand this concept.
 
I think this is enough of talking. I now, will explore some of the basic things in the uploaded data using the apply SQL transformation function.

Apply SQL transformation: 


This is the screen from the last experiment session where we uploaded and cleaned the data. Now, go the left side search bar and type "apply sql" and the Apply SQL transformation function pops up, which again comes under the data transformation's manipulation portion.

Drag and drop the function to the experiment area. Now observe that there are 3 ports on the top of the Apply SQL function and there is one output port in the below. Now from the left to right, the top three ports are available to connect to some datasets which will be used as t1, t2 and t3 in the SQL queries. Also there is one output, and one can use the output to view the result of the dataset or to save the same. More about that later.

Now, connect the cleaned dataset to the t1 port for this purpose. If you intend to use a new dataset, you have to take the lower port, which is the output port for the dataset, and connect it to the t1 port of the apply SQL transformation.

This is trivial but a thing to observe that you can connect a multiple Apply SQL function to one port of a dataset, but not multiple dataset in one port of a apply SQL function.

Also, if you are a beginner in SQL, you can just use a lot many SQL functions in a network manner to ease up your queries and do the work in step by step rather than in conventional databases where you have to put up the queries together to build a frighteningly long SQL queries to do the same.

Ok, so in the right side of the experiment screen, under properties, a dialogue box opens up when you connect a dataset or more than one dataset to the apply SQL function. Here you are supposed to write your sql queries. One of the disadvantages in Azure ml platform's sql dialogue box is that you do not get automated fill ups using the existing columns or you don't get a suggestion. But that's ok. If someone is using the Azure ml platform then they can simply write down the column names in a text file and then use time to time when needed.

Now, that's all talk and now we will work on some real questions.

Let's for our work's sake, I will note down the column names here with a bit of description from the early inspection of the data:

(1) Web Link (2)Organisation ID  (3)URN  (4)LAESTAB  (5)School name
(6)Region      (7)Ofsted region (8)Local authority (9)Parliamentary constituency   (10)Postcode (11)Type of education (12)Phase of education (13)Sixth form (14)Predecessor URN
(15)Number of warning notices issued in 2016/17 academic year
(16)Total number of pupils (17)Deprivation index (18)Inspection number
(19)Inspection start date (20)Inspection end date  (21)Publication date (22)Inspection type
(23)Inspection type grouping (24)Event type grouping   (25)Overall effectiveness
(26)Category of concern
(27)Outcomes for pupils  (28)Quality of teaching, learning and assessment
(29)Effectiveness of leadership and management (30)Personal development, behaviour and welfare
(31)Early years provision (where applicable) (32)16 - 19 study programmes
(33)Previous inspection number (34)Previous inspection start date
(35)Previous inspection end date (36)Previous publication date
(37)Previous overall effectiveness  (38)Previous category of concern
(39)Previous outcomes for pupils (40)Previous quality of teaching, learning and assessment
(41)Previous effectiveness of leadership and management
(42)Previous personal development, behaviour and welfare
(43)Previous early years provision (where applicable)
(44)Previous 16 - 19 study programmes

So, as we previously observed, first 5 columns are more or less identification related and more about securing the uniqueness of the result and maintaining the  official record. But questions which are of first level depth will originate from the variable 6.
One can want to distinguish the schools according to their regions, ofsted regions or constituency. Also one can want to count how many schools are there par postcode or par constituency and may want to find out the weaker portions in the country where there had been a record of less number of schools for years.
Now columns from 18 to 26 totally deals with inspection, previous inspections and effects and records of those inspection results or notices issued. These can be used to deal on categorical distinction of schools, like, how many different schools come under which concern and which concerns are increasing and so on. Also one can measure the overall effectiveness of the inspection services from the overall effectiveness numbers as they reflect somewhat the efficiency of the inspection system itself. So we can see that different columns will give rise to different concepts and insights as we think and read more of the data.

There are a bunch of previous data records in the table from 33 to 44. These are clearly reference data. One can clearly compare the data and identify schools which have in some or other sense done better or worse in a years' difference.

Such a data can give rise to concerns, insights and maybe prescriptions even alone on using SQL queries. The main concern about a SQL query is not the writing the query, but for a data-science aspect, its more of the questions to search are important.

We will now delve on to form concrete questions and answer them and visualise them using SQL in the part 2.
This is the link for the part 2. Enjoy reading!



    

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle