Skip to main content

2 ways to optimize your aws machine operations

 Introduction:

Many times, as a data scientist, you will be working on cloud machines which are on either aws, gcp or something else. These machines are costly, and so when you run your codes on these machines, you are actively increasing the project spending. During my last session in aws, I encountered a moderately large program to run; which according to my estimation would take 48-72 hrs to run. And to decrease the time needed as well as optimize the operations, I took 2 steps. This post will briefly describe these processes.

(1) cythonizing my codes:

Let's face it. Python is SLOW. yes, python is slow and that's why most of the standard computation libraries are written on cython or c++ background. But being a python and pandas dependent data scientist, I write most of my codes in pure python. These codes, are very slow to run; when let's say compared to c++ or cython equivalent of the same codes. So the easiest way to reduce operation time; is to create cython libraries from your existing python libraries. I will not write how to exactly do that, as paperspace has an awesome guide on how to turn python scripts into cython. 

Although on following the guide, I noticed one error which may occur if you don't follow coding style standards. That is, python codes may run with tabs and spaces mixed usage; even though it is a clear violation of pep-8 style guides. But when turning your python code into cython module; it creates an error message. Solution is to obviously go to the line manually and fixing it. But if you have a bigger file; and lots of errors are coming like this; then you may want to use autopep8 module to clean the errors out.

Once you cythonize your helper modules and libraries, you can expect around 40-50% speed increase. Now, if you don't have developer time restraint; i.e. your project time is not very tight and you can spend an extra day or two; maybe you should also consider writing proper cython syntaxes into your code. Again, I will not guide writing cython syntaxes or writing cython augmentation files in this post; but you can start using it from this official cython documentation.

Using this, there was, what I will say significant improvement in performance of the code in multiple sample files. There was around 40% decrease in total operation timing in top 3 performances. But there was one more problem left. 

If you have worked with cloud machines; you must have seen memoryError in the middle of your code one time at the very least. I used very simple tactics to solve this issue. And that is the second point I am going to discuss about.

(2) zipping bigger files and cleaning your "virtual home":

You run lots of code in cloud. If you write df in terminal; you see that there is not enough memory left and that's why your code stopped. If you are a beginner; chances are that,

(1) you didn't put stop points in your code and save intermediate outputs. 

(2) If you are an amateur like me, you did save intermediate outputs; but they clogged more memories and stopped your code in that path!

Whatever the reason maybe now you gotta clean a hell lot of things from the machine's permanent memory. Steps to follow here are:

(1) check if there is any resource left from code which you don't need for the current operations. Save these files in your relevant cloud storage in zipped formats; to both optimize redownload process as well as to reduce storage bucket costs.

(2) Delete the intermediate outputs you don't need if you have to rerun the process. It maybe possible in your code-flow to break the flow and run it from a place most near to where it stopped; and that's what intermediate outputs are for. So judge, delete and keep most important intermediate output only. Also, modify your code to make the most of previously used machine time.

(3) zip the moderate ( >100MB) to big(>GB) files. Consider the fact that if you are reading csv file; for example say using pd.read_csv; then it also accepts .gz versions of the same file. Advantage you get is that lesser memory is occupied in the original storage and the chance of your code stopping dramatically decrease. Change your code to store the intermediate stores in zipped form too. 

Now, even if you do all that; let's assume the worst possible scenario. Your code still gets memoryerror in one or more threads. [ obviously you have multiprocessed your code if you are using multiple cores]. In this scenario, the issue is that, your code, exceeds the possible virtual memory it is getting assigned to. That means, your hand is forced and you are bound to select a machine with possibly both more cores and bigger virtual memory assigned with it. 

For ec2 instances, check this page; and decide to which instance type; you want to change your machine to. For changing machine; 

from the machine name >(right click) > select "instance settings"                         > select "change instance type"

which opens up a dialogue box with a drop down of different instance types to choose from. Choose best suitable one and go with it. 

Increasing say a 4x times more core machine in the same category; helps you reducing the thread wise load by 4x bound increase * 4x time more threading = 16x times. So calculate like that and change it accordingly. 

Unfortunately I don't have google cloud or other environment experiences to provide such details guide in this instance issue. 

Anyway, those are my 2 steps to optimize my aws operations. Share if you liked these suggestions; comment to suggest changes, unintended mistakes if any; and/or add something more which you would like to read in this line.

Thanks for reading!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...