Machine learning and statistics with python

Posts

Showing posts from October, 2020

what is docker?

Prelogue: I was hearing about kubernetes application in cloud for quite a bit of time now and have been feeling low about not understanding what kubernetes is. So I headed over to linkedin learning, the awesome platform to learn about kubernetes. But kubernetes courses showed understanding of docker as a pre-requisite. So I started to learn about docker. Introduction: Photo by Andy Li on Unsplash In this post, I want to provide you a preliminary understanding of what is docker and a container; and maybe a basic introduction of few usage and concepts. This is the first part of the learn-docker series I am writing, along with my own learning curve in understanding docker, container, containerization and other processes. Now let's get started with our learning. what is docker? Docker is a utility which carves up your computer into small sealed containers which are isolated from outside worlds. It contains its own code, env and all. It also builds

2 ways to optimize your aws machine operations

Introduction: Many times, as a data scientist, you will be working on cloud machines which are on either aws, gcp or something else. These machines are costly, and so when you run your codes on these machines, you are actively increasing the project spending. During my last session in aws, I encountered a moderately large program to run; which according to my estimation would take 48-72 hrs to run. And to decrease the time needed as well as optimize the operations, I took 2 steps. This post will briefly describe these processes. (1) cythonizing my codes: Let's face it. Python is SLOW. yes, python is slow and that's why most of the standard computation libraries are written on cython or c++ background. But being a python and pandas dependent data scientist, I write most of my codes in pure python. These codes, are very slow to run; when let's say compared to c++ or cython equivalent of the same codes. So the easiest way to reduce operation time; is to create cython li

word similarity using spacy

Introduction: In text mining algorithms, as well as nlp based data modeling, word similarity is a very common feature. Word similarity in nlp context refers to semantic similarity between two words, phrases or even two documents. We will discuss how to calculate word similarity using spacy library. what is similarity in NLP and how is it calculated? In NLP, lexical similarity refers between two texts refer to the degree by which the texts has same literal and semantic meaning. i.e. how much similar the texts mean; is calculated by similarity metrics in NLP. There are many different ways to create word similarity features; but the core logic is mostly same in all cases. The core logic in all these cases is to create two representative vectors of the two items; using either universal vectors created from pre-trained models like word2vec, glove, fasttext, bart and others; or using the present document and using different methods like tf-idf match, pagerank procedures, etc. whatever

spacy exploration part 4: neural network model training using spacy

Introduction: We have discussed different aspects of spacy in part 1 , part 2 and part 3 . Now, up to this point, we have used the pre-trained models. Now, in many cases, you may need to tweak or improve models; enter new categories in the tagger or entity for specific projects or tasks. In this part, we will discuss how to modify the neural network model or train our own models also; as well as different technical issues which arise in these cases. How does training works: 1. initialize the model using random weights, with nlp.begin_training. 2. predict a bunch of samples using the current model by nlp.update 3. Compare prediction with true labels, calculate change of weight based on those predictions and finally update the weights. 4. finally reiterate from 2. Refer to the following diagram for a better understanding: Now in practice, the goal is to create annotated data using prodigy, brat or easily using phraseMatcher feature of spacy. Using phraseMatcher we can quickly label

spacy exploration part 3: spacy data structures and pipelines

Introduction: We discussed about dependency parsing in part 2 of spacy exploration series. Now, in the part 3 of our spacy exploration, we will explore some more concepts of NLP usages by spacy pipelines and utilities. Let's dive in. How does spacy work internally? Spacy uses all types of optimizations possible to make the processing as fast as possible. One of the main trick in doing so is to use hash code for the strings, and turn them into string as late as possible. The way it helps is that, digits take fixed spaces and can be processed faster than them in most of the operation. For this reason, all strings are hash coded and the vocabulary object behaves like a double dictionary, in which using the hash you can find the string, and using the string, you can find the hash. See the following examples to get the idea about hashing: Now, let's go over the data structures of the main objects in nlp. First we will see how to create a doc object manually to understand the r

Psychopy builder running silently but not working

Recently I started to use the psychopy builder version in linux ubuntu version 18.04. After learning how to use psychopy as a python library, I needed to use it as a software with builder version. Now after writing the builder test, when I tried to run it, it didn't run without any error.I went through a painstaking procedure to solve this and that's why I am writing this post about how I solved it. Enough of prologue, let's dig in to the solutions possible. Solution no 1: One of the reason this issue occur is that the psychopy builder version gets confused with the graphic cards. The solution of this issue is in details and variations mentioned in this github issue thread. But if you want to avoid going through it, here is the solution mentioned there: First of all, search for /etc/X11/xorg.conf.d/20-intel.conf Now if you have this file, then that is good. Put this following content inside the file. Section "Device" Identifier "Intel Graphics"

Sorting guide in C

Sorting Written by: Manjula Sharma Sorting means arranging a set of data in the same order. In our daily life,we can see so many applications using the data in sorted order as telephone directory,merit,list,roll number etc.So , sorting is nothing but storage of data in sorted order(it can be in ascending or descending order). Some Sorting techniques: Bubble sort Selection sort Insertion sort Merge sort Quick sort Heap sort Radix sort Shell sort Bucket Sort Counting sort The main component in any sorting is the key comparison because most of the time we need to compare the key with elements in a list.The number of comparisons are more,then time for executing the program is also more. Let ‘n’ be the input size of an array. If n increases,then time for execution also increases,So, execution time will be varying with different volumes of data. The efficiency of any sorting technique can be represented using mathematical notations. All the sorting technique efficiencies are in-b