Introduction:
Many times, as a data scientist, you will be working on cloud machines which are on either aws, gcp or something else. These machines are costly, and so when you run your codes on these machines, you are actively increasing the project spending. During my last session in aws, I encountered a moderately large program to run; which according to my estimation would take 48-72 hrs to run. And to decrease the time needed as well as optimize the operations, I took 2 steps. This post will briefly describe these processes.
(1) cythonizing my codes:
Let's face it. Python is SLOW. yes, python is slow and that's why most of the standard computation libraries are written on cython or c++ background. But being a python and pandas dependent data scientist, I write most of my codes in pure python. These codes, are very slow to run; when let's say compared to c++ or cython equivalent of the same codes. So the easiest way to reduce operation time; is to create cython libraries from your existing python libraries. I will not write how to exactly do that, as paperspace has an awesome guide on how to turn python scripts into cython.
Although on following the guide, I noticed one error which may occur if you don't follow coding style standards. That is, python codes may run with tabs and spaces mixed usage; even though it is a clear violation of pep-8 style guides. But when turning your python code into cython module; it creates an error message. Solution is to obviously go to the line manually and fixing it. But if you have a bigger file; and lots of errors are coming like this; then you may want to use autopep8 module to clean the errors out.
Once you cythonize your helper modules and libraries, you can expect around 40-50% speed increase. Now, if you don't have developer time restraint; i.e. your project time is not very tight and you can spend an extra day or two; maybe you should also consider writing proper cython syntaxes into your code. Again, I will not guide writing cython syntaxes or writing cython augmentation files in this post; but you can start using it from this official cython documentation.
Using this, there was, what I will say significant improvement in performance of the code in multiple sample files. There was around 40% decrease in total operation timing in top 3 performances. But there was one more problem left.
If you have worked with cloud machines; you must have seen memoryError in the middle of your code one time at the very least. I used very simple tactics to solve this issue. And that is the second point I am going to discuss about.
(2) zipping bigger files and cleaning your "virtual home":
You run lots of code in cloud. If you write df in terminal; you see that there is not enough memory left and that's why your code stopped. If you are a beginner; chances are that,
(1) you didn't put stop points in your code and save intermediate outputs.
(2) If you are an amateur like me, you did save intermediate outputs; but they clogged more memories and stopped your code in that path!
Whatever the reason maybe now you gotta clean a hell lot of things from the machine's permanent memory. Steps to follow here are:
(1) check if there is any resource left from code which you don't need for the current operations. Save these files in your relevant cloud storage in zipped formats; to both optimize redownload process as well as to reduce storage bucket costs.
(2) Delete the intermediate outputs you don't need if you have to rerun the process. It maybe possible in your code-flow to break the flow and run it from a place most near to where it stopped; and that's what intermediate outputs are for. So judge, delete and keep most important intermediate output only. Also, modify your code to make the most of previously used machine time.
(3) zip the moderate ( >100MB) to big(>GB) files. Consider the fact that if you are reading csv file; for example say using pd.read_csv; then it also accepts .gz versions of the same file. Advantage you get is that lesser memory is occupied in the original storage and the chance of your code stopping dramatically decrease. Change your code to store the intermediate stores in zipped form too.
Now, even if you do all that; let's assume the worst possible scenario. Your code still gets memoryerror in one or more threads. [ obviously you have multiprocessed your code if you are using multiple cores]. In this scenario, the issue is that, your code, exceeds the possible virtual memory it is getting assigned to. That means, your hand is forced and you are bound to select a machine with possibly both more cores and bigger virtual memory assigned with it.
For ec2 instances, check this page; and decide to which instance type; you want to change your machine to. For changing machine;
from the machine name >(right click) > select "instance settings" > select "change instance type"
which opens up a dialogue box with a drop down of different instance types to choose from. Choose best suitable one and go with it.
Increasing say a 4x times more core machine in the same category; helps you reducing the thread wise load by 4x bound increase * 4x time more threading = 16x times. So calculate like that and change it accordingly.
Unfortunately I don't have google cloud or other environment experiences to provide such details guide in this instance issue.
Anyway, those are my 2 steps to optimize my aws operations. Share if you liked these suggestions; comment to suggest changes, unintended mistakes if any; and/or add something more which you would like to read in this line.
Thanks for reading!
Comments
Post a Comment