I'm writing a Python script to load, filter, and transform a large dataset using pandas. Iteratively changing and testing the script is very slow due to the load time of the dataset: loading the parquet files into memory takes 45 minutes while the transformation takes 5 minutes.
Is there a tool or development workflow that will let me test changes I make to the transformation without having to reload the dataset every time?
Here are some options I'm considering:
Develop in a jupyter-notebook: I use notebooks for prototyping and scratch work, but I find myself making mistakes or accidentally making my code un-reproducible when I develop in them. I'd like a solution that doesn't rely on a notebook if possible, as reproducibility is a priority.
Use Apache Airflow (or a similar tool): I know Airflow lets you define specific steps in a data pipeline that flow into one another, so I could break my script into separate "load" and "transform" steps. Is there a way to use Airflow to "freeze" the results of the load step in memory and iteratively run variations on the transformation step that follows?
Store the dataset in a proper Database on the cloud: I don't know much about databases, and I'm not sure how to evaluate if this would be more efficient. I imagine there is zero load time to interact with a remote database (because it's already loaded into memory on the remote machine), but there would likely be a delay in transmitting the results of each query from the remote database to my local machine?
Thanks in advance for your advice on this open ended question.
For a lot of work like that, I'll break it up into intermediate steps and pickle the results. I'll check if the pickle file exists before running the data load or transformation.
Related
I have a blob storage trigger inside of a function app where I'm taking parquet files from a storage container, transforming the data into a dataframe and inserting the data into an Azure SQL DB.
I've eliminated any parallelism by restricting the function to one machine and setting "batchSize": 1 in my host.json file.
I chose to do so because when I allowed the script to run in parallel, sometimes the instances would clash and try to access the same table at the same time, causing problems.
This works fine, but it's obviously fairly slow. Especially with the amount of files It's initially running on. I've also decided to not even run bigger files (files more than about 2mb) because I'd rather side load them than to keep the function held up.
I'm just looking for suggestions/direction on how I could make this process go faster? Also, is it normal for a parquet file size of 2mb or greater to take a long time for what I'm doing? I feel like 2mb is not a huge file size, but I don't have any reference to go off of.
Some additional information: My main dependencies for my script are pandas(dataframe) and SQL Alchemy (specifically to_sql() to insert data from dataframe into the DB)
I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.
In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.
In the components where the data frame is used for training or validating the models, download from S3 the data frame.
The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.
Thanks
As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".
However, there are certain considerations worth spelling out in your case.
Saving to S3 in between pipeline steps.
This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.
So the questions are:
Are the steps idempotent (restartable)?
How often the pipeline fails?
Is it easy to restart the processing from some mid-point?
Do you care about the processing time more than the risk of loosing some work?
Do you care about the incurred cost of S3 storage/transfer?
The question is about if this is a best practice
The best practice is to use the file-based I/O and built-in data-passing features. The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components. This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).
or is better to share the data frame directly between components
How can you "directly share" in-memory python object between different python programs running in containers on different machines?
because the upload to the S3 can fail, and then fail the pipeline.
The failed pipeline can just be restarted. The caching feature will make sure that already finished tasks won't be re-executed.
Anyways, what is the alternative? How can you send the data between distributed containerized programs without sending it over the network?
I'm trying to load a ~67 gb dataframe (6,000,000 features by 2300 rows) into dask for machine learning. I'm using a 96 core machine on AWS that I wish to utilize for the actual machine learning bit. However, Dask loads CSVs in a single thread. It has already taken a full 24 hours and it hasn't loaded.
#I tried to display a progress bar, but it is not implemented on dask's load_csv
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
df = dd.read_csv('../Larger_than_the_average_CSV.csv')
Is there a faster way to load this into Dask and make it persistent? Should I switch to a different technology (Spark on Scala or PySpark?)
Dask is probably still loading it as I can see a steady 100% CPU utilization in top.
The code you show in the question probably takes no time at all, because you are not actually loading anything, just setting up the job prescription. How long this takes will depend on the chunksize you specify.
There are two main bottlenecks to consider for actual loading:
getting the data from disc into memory, raw data transfer over a single disc interface,
parsing that data into in-memory stuff
There is not much you can do about the former if you are on a local disc, and you would expect it to be a small fraction.
The latter may suffer from the GIL, even though dask will execute in multiple threads by default (which is why it may appear only one thread is being used). You would do well to read the dask documentation about the different schedulers, and should try using the Distributed scheduler, even though you are on a single machine, with a mix of threads and processes.
Finally, you probably don't want to "load" the data at all, but process it. Yes, you can persist into memory with Dask if you wish (dask.persist, funnily), but please do not use many workers to load the data just so you then make it into a Pandas dataframe in your client process memory.
I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?
Parallel Python is one option for distributing things over multiple machines in a cluster.
This example shows how to do a MapReduce like script, using processes on a single machine. Secondly, if you can, try caching intermediate results. I did this for a NLP task and obtained a significant speed up.
My package, jug, might be very appropriate for your needs. Without more information, I can't really say how the code would look like, but I designed it for sub-hadoop sized problems.
Is there a way to reduce the I/O's associated with either mysql or a python script? I am thinking of using EC2 and the costs seem okay except I can't really predict my I/O usage and I am worried it might blindside me with costs.
I basically develop a python script to parse data and upload it into mysql. Once its in mysql, I do some fairly heavy analytic on it(creating new columns, tables..basically alot of math and financial based analysis on a large dataset). So is there any design best practices to avoid heavy I/O's? I think memcached stores a everything in memory and accesses it from there, is there a way to get mysql or other scripts to do the same?
I am running the scripts fine right now on another host with 2 gigs of ram, but the ec2 instance I was looking at had about 8 gigs so I was wondering if I could use the extra memory to save me some money.
By IO I assume you mean disk IO... and assuming you can fit everything into memory comfortably. You could:
Disable swap on your box†
Use mysql MEMORY tables while you are processing, (or perhaps consider using an Sqlite3 in memory store if you are only using the database for the convenience of SQL queries)
Also: unless you are using EBS I didn't think Amazon charged for IO on your instance. EBS is much slower than your instance storage so only use it when you need the persistance, ie. not while you are crunching data.
†probably bad idea
You didn't really specify whether it was writes or reads. My guess is that you can do it all in a mysql instance in a ramdisc (tmpfs under Linux).
Operations such as ALTER TABLE and copying big data around end up creating a lot of IO requests because they move a lot of data. This is not the same as if you've just got a lot of random (or more predictable queries).
If it's a batch operation, maybe you can do it entirely in a tmpfs instance.
It is possible to run more than one mysql instance on the machine, it's pretty easy to start up an instance on a tmpfs - just use mysql_install_db with datadir in a tmpfs, then run mysqld with appropriate params. Stick that in some shell scripts and you'll get it to start up. As it's in a ramfs, it won't need to use much memory for its buffers - just set them fairly small.