I have a blob storage trigger inside of a function app where I'm taking parquet files from a storage container, transforming the data into a dataframe and inserting the data into an Azure SQL DB.
I've eliminated any parallelism by restricting the function to one machine and setting "batchSize": 1 in my host.json file.
I chose to do so because when I allowed the script to run in parallel, sometimes the instances would clash and try to access the same table at the same time, causing problems.
This works fine, but it's obviously fairly slow. Especially with the amount of files It's initially running on. I've also decided to not even run bigger files (files more than about 2mb) because I'd rather side load them than to keep the function held up.
I'm just looking for suggestions/direction on how I could make this process go faster? Also, is it normal for a parquet file size of 2mb or greater to take a long time for what I'm doing? I feel like 2mb is not a huge file size, but I don't have any reference to go off of.
Some additional information: My main dependencies for my script are pandas(dataframe) and SQL Alchemy (specifically to_sql() to insert data from dataframe into the DB)
Related
I'm writing a Python script to load, filter, and transform a large dataset using pandas. Iteratively changing and testing the script is very slow due to the load time of the dataset: loading the parquet files into memory takes 45 minutes while the transformation takes 5 minutes.
Is there a tool or development workflow that will let me test changes I make to the transformation without having to reload the dataset every time?
Here are some options I'm considering:
Develop in a jupyter-notebook: I use notebooks for prototyping and scratch work, but I find myself making mistakes or accidentally making my code un-reproducible when I develop in them. I'd like a solution that doesn't rely on a notebook if possible, as reproducibility is a priority.
Use Apache Airflow (or a similar tool): I know Airflow lets you define specific steps in a data pipeline that flow into one another, so I could break my script into separate "load" and "transform" steps. Is there a way to use Airflow to "freeze" the results of the load step in memory and iteratively run variations on the transformation step that follows?
Store the dataset in a proper Database on the cloud: I don't know much about databases, and I'm not sure how to evaluate if this would be more efficient. I imagine there is zero load time to interact with a remote database (because it's already loaded into memory on the remote machine), but there would likely be a delay in transmitting the results of each query from the remote database to my local machine?
Thanks in advance for your advice on this open ended question.
For a lot of work like that, I'll break it up into intermediate steps and pickle the results. I'll check if the pickle file exists before running the data load or transformation.
I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.
I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.
As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.
Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:
OOM command not allowed when used memory > 'maxmemory'.
I have two questions:
Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?
Your thoughts on the best solution are much appreciated!
Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:
Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.
Using the method above, you'll be able to:
Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.
For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.
Hope this helps!
It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.
Part of a project I am involved in consists of developing a scientific application for internal usage, working with large collection of files (about 20000) each of which is ~100Mb in size. Files are accompanied with meta information, used to select subsets of the whole set.
Update after reading response
Yes, processing is located in the single server room.
The application selects two subsets of these files. On the first stage it processes each file individually and independently yielding up to 30 items from one file, for the second stage. Each resulting item is also stored in a file, and file size varies from 5 to 60Kb.
On the second stage the app processes all possible pairs of results that have been produced on the first stage where the first element of a pair has came from the 1st subset and 2nd - from the 2nd - a cross-join or Cartesian product of two sets.
Typical amount of items in the first subset is thousands, and in the 2nd - tens of thousands. Therefore amount of all possible pairs in the second stage is hundreds of millions.
Typical time for processing of a single source 100Mb file is about 1 second, of a single pair of 1st stage results - microseconds. The application is not for real-time processing, its general use case would be to submit a job for overnight calculation and obtain results in the morning.
We have got already a version of an application, developed earlier, when we have had much less data. It is developed in Python and uses file system and data structures from the Python library. The computations are performed on 10 PCs, connected with self-developed software, written with Twisted. Files are stored on NAS and on PCs' local drives. Now the app performs very poorly, especially on the second stage and after it, during aggregation of results.
Currently I am looking at MongoDB to accomplish this task. However, I do not have too much experience with such tools and open for suggestions.
I have conducted some experiments with MongoDB and PyMongo and have found that loading the whole file from the database takes about 10 seconds over the Gigabit ethernet. Minimal chunk size for processing is ~3Mb and it is retrieved for 320 msec. Loading files from a local drive is faster.
MongoDB config contained a single line with a path.
However, very appealing feature of the database is its ability to store metainformation and support search for it, as well as automatic replication.
This is also a persistent data storage, therefore, the computations can be continued after accidental stop (currently we have to start over).
So, my questions are.
Is MongoDB a right choice?
If yes, then what are the guide lines for a data model?
Is it possible to improve retrieval time for files?
Or, is it reasonable to store files in a file system, as before, and store paths to them in the database?
Creation of a list of all possible pairs for the 2nd stage has been performed in the client python code and also took rather long time (I haven't measured it).
Will MongoDB server do better?
In this case you could go for sharded gridfs as a part of mongo more here
That will allow for faster file retrieval process and still have metadata together with file record.
other way to speed up when using only replica set is to have a kind of logic balancer and get files one time from master other time form slave (or other slave in a king of roundRobin way).
Storing files in file system will be always a bit faster and as long this is about one server room (-> processed locally) - I probably will stick to it, but with huge concern of backup.
I'm working on a Python app I want to be scalable to accommodate about 150 writes per second. That's spread out among about 50 different sources.
Is Mongodb a good candidate for this? I'm split on writing to a database, or just making a log file for each source and parsing them separately.
Any other suggestions for logging a lot of data?
I would say that mongodb very good fit for the logs collection, because of:
Mongodb has amazing fast writes
Logs not so important, so it's okay to loose some of them in case of server failure. So you can run mongodb without journaling option to avoid writes overhead.
In additional you can use sharding to increase writes speed, in same time you can just move oldest logs to separate collection or into file system.
You can easy export data from database to the json/csv.
Once you will have everything in a database you will able to query data in order to find log that you need.
So, my opinion is that mongodb perfectly fit for such things as logs. You no need manage a lot of logs files in the file system. Mongodb does this for you.
Is there a way to reduce the I/O's associated with either mysql or a python script? I am thinking of using EC2 and the costs seem okay except I can't really predict my I/O usage and I am worried it might blindside me with costs.
I basically develop a python script to parse data and upload it into mysql. Once its in mysql, I do some fairly heavy analytic on it(creating new columns, tables..basically alot of math and financial based analysis on a large dataset). So is there any design best practices to avoid heavy I/O's? I think memcached stores a everything in memory and accesses it from there, is there a way to get mysql or other scripts to do the same?
I am running the scripts fine right now on another host with 2 gigs of ram, but the ec2 instance I was looking at had about 8 gigs so I was wondering if I could use the extra memory to save me some money.
By IO I assume you mean disk IO... and assuming you can fit everything into memory comfortably. You could:
Disable swap on your box†
Use mysql MEMORY tables while you are processing, (or perhaps consider using an Sqlite3 in memory store if you are only using the database for the convenience of SQL queries)
Also: unless you are using EBS I didn't think Amazon charged for IO on your instance. EBS is much slower than your instance storage so only use it when you need the persistance, ie. not while you are crunching data.
†probably bad idea
You didn't really specify whether it was writes or reads. My guess is that you can do it all in a mysql instance in a ramdisc (tmpfs under Linux).
Operations such as ALTER TABLE and copying big data around end up creating a lot of IO requests because they move a lot of data. This is not the same as if you've just got a lot of random (or more predictable queries).
If it's a batch operation, maybe you can do it entirely in a tmpfs instance.
It is possible to run more than one mysql instance on the machine, it's pretty easy to start up an instance on a tmpfs - just use mysql_install_db with datadir in a tmpfs, then run mysqld with appropriate params. Stick that in some shell scripts and you'll get it to start up. As it's in a ramfs, it won't need to use much memory for its buffers - just set them fairly small.