I try to build a REST API in Python which relies on large data to be loaded dynamically in memory and processed. The data is loaded in Pandas DataFrames, but my question is not specific to Pandas and I might need other data structures.
After a request to the API, I would like to load the useful data (e.g., read from disk or from a DB) and keep it in memory because other requests relying on the same data should follow. After some time, I would need to drop the data in order to save memory.
In practice, I would like to keep a list of Pandas DataFrames in memory. The DataFrames in the list would be the DataFrames needed to fulfill the latest requests. Some DataFrames can be very large (e.g, several GBs), so I think that I cannot afford to retrieve them every time from a DB without a big overhead. This is why I want to keep them in memory for the next requests.
I started with Flask when the API relied on a single, fixed DataFrame. But now I cannot find a way to load dynamically new DataFrames and make them persistent across multiple requests. The loading of a new DataFrame should be triggered inside a request when necessary, and the new DataFrame should be available for the following requests. I do not know how to achieve that with Flask or with any other framework.
Related
We use the google-cloud-bigquery python library to query Bigquery and process the results in our python script. The processing portion transforms the data and enriches it and in the end creates JSON objects.
This is how we use the BQ library in our script (simplified):
import google.cloud.bigquery
client = bigquery.Client()
query = "SELECT col1,col2,... FROM <table>"
queryjob = client.query(query)
result_set = queryjob.result(page_size=50000)
for page in result_set.pages:
transform_records()
In general and for moderate sized tables, this works just fine. However, we run into a performance issue when querying a table that returns 11 mio records of ~3,5 GB size in total. Even if we leave out the processing, just fetching the pages takes ~ 80 minutes (we did not really observe any significant differences when running it locally or in a VM / cluster that resides in the same region as the bigquery dataset).
Any ideas on how to reduce the loading time?
What we tried:
Varying the page size: The obvious assumption that larger pagesizes hence less pages reduce http overhead holds true. However we noticed that setting page size to above 8.500 did not have any effect (the max number of records returned by the API per page were ~8.500). Still this does only account for improvement in range of a few percent of loading time
Iterating over the result set records instead of pages: Gave us roughly same performance
Separating the data loading and the processing from each other by putting the loading portion into a background thread, using a multiprocessing queue for sharing the data with the processing workers - obviously no impact on the pure time spent on receiving the data from BQ
Trying to fetch multiple pages in parallel - we think this could help reducing the loading time drastically, but did not manage to do so
What we did not try:
Using the BQ storage API, or rather a method that fetches data from BQ using this API (i.e. result_set.to_arrow_iterable / to_dataframe_iterable): We like to avoid the mess of having to deal with data type conversions, as the output of the processing part will be a JSON object
Using the BQ Rest API directly w/o comfort that the bigquery lib offers in order to be able to fetch multiple pages of the result set simultaneously: This seems somewhat complicated and we are not even sure if the API itself allows for this simultaneous access of pages
Exporting the data to GCS first by using client.extract_table-method: We used this approach in other use cases and are aware that fetching data from GCS is way faster. However, as we get acceptable performance for most of our source tables, we'd rather avoid this extra step of exporting to GCS
The approach that you have mentioned should be avoided considering the size of data.
One of the following approaches can be applied:
Transform table data in BigQuery using in-built functions or UDF or Remote Functions and save the transformed data to another table
Export the transformed table data into Cloud Storage in one or more CSV or JSON files.
Load CSV / JSON files to non-GCP system using compute service.
If the transformation is not feasible in BigQuery, then
Export the raw table data in to Cloud Storage in one or more CSV or JSON files.
Load each CSV / JSON file on compute service, transform the data and load the transfomed data to non-GCP system.
This question already has an answer here:
Store large data or a service connection per Flask session
(1 answer)
Closed 2 years ago.
I have a 5GB Dataframe (with thousands of columns), every time a user makes an input in my web app, I load and grab a column from this Dataframe and return some calculations.
My frontend is a Vue app and the backend is in Flask.
The Dataframe is a large matrix, so it exceeds the column limit of a database (at least the ones I try).
Where I could store this Dataframe to be able to load it in a fast way?
Where did the data come from to populate the dataframe?
If it was a database for example, you could just grab the data the user needed at that time rather than putting all your data into memory.
On the other hand, if your data didn't come from a database then that would be my first suggestion. Put the data into a database, these things were built for that purpose. You can create a dataframe from an sql query (pandas.read_sql).
If you really must have 5g of data in memory, then perhaps an in memory database would be suitable. For this purpose Redis comes to mind.
You could either pickle the dataframe and store the whole thing as an object, or you could break it down to records and store the individually and pull down only the data the user requests. This would remove the requirement for your application to store 5g of data in memory all the time.
I'm looking for the most effective way of parallel loading Google Analytics data, which is represented in JSON files with nested objects structure, into Relational Database, in order to collect and analyze this statistics later.
I've found pandas.io.json.json_normalize which can flatten nested data into flat structure, also there is a pyspark solution with converting json to dataframe as described here, but not sure about performance issues.
Can you describe best ways of loading data from Google Analytics API into RDBMS?
I think this answer can be best answered when we have more context about what data you want to consume and how you'll be consuming them. For example, if you would be consuming only few of the all fields available - then it make sense to store only those, or if you'll be using some specific field as index then maybe we can index that field also.
One thing that I can recall from on top my head is JSON type of Postgres, as it's inbuilt and have several helper methods to do operation later on.
References :
https://www.postgresql.org/docs/9.3/datatype-json.html
https://www.postgresql.org/docs/9.3/functions-json.html
If you can update here what decision you take - it would be great to know.
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.
I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!