I am working on a real-time information retrieval system which performs large queries on a local SQL Server database with around 4.5 million rows.
Each query returns, on average, 400.000 rows and is parameterized, like the following example:
SELECT Id, Features, Edges, Cluster, Objects FROM db.Image
WHERE Cluster = 16
AND Features IS NOT NULL
AND Objects IS NOT NULL
These are the times I am getting with my current approach:
Query time: 4.52361000 seconds
Query size: 394048 rows, 5 columns
While not necessarily unusable, it is expected that query sizes will grow quickly, and as such I need a more efficient way to read large amounts of rows into a DataFrame.
Currently, I'm using pyodbc for building a connecting to SQL Server and pd.read_sql to parse the query directly into a DataFrame which is then manipulated. I am looking at ways to improve the query times significantly while still allowing me to work with DataFrame operations after the data is fetched. So far, I have tried dask DataFrames, connectorX, as well as failed attempts at parallelizing the queries with multithreading, but to no avail.
How can one, relying on other solutions, multithreading, or even entirely different file formats, improve the time it takes to read this amount of data?
Code Sample
conn = connection() # I have a function that returns a connector
filter = 16
command = '''SELECT Id, Features, Edges, Cluster, Objects FROM Common.Image
WHERE Cluster = {} AND Features IS NOT NULL AND Objects IS NOT NULL'''.format(filter)
result = pd.read_sql(command, conn)
EDIT
Following #tadman's comment:
Consider caching this if practical, like once you've fetched the data you could save it in a more compact form (Google Protobuf, Parquet, etc.) Reading in that way can be considerably faster, as you're usually just IO bound, not server/CPU bound.
I looked at Parquet caching and landed on a considerably faster way to fetch my data:
Created compressed parquet files for each of my data clustes (1 to 21).
Using pyarrow, read the necessary cluster file with
df_pq = pq.read_table("\\cluster16.parquet")
Convert the parquet file to a pandas DataFrame with df = df_pq.to_pandas()
Proceed as usual
With this method, I reduced the total time to 1.12400 seconds.
Pyspark dataframes run faster than pandas dataframes and should provide more memory.
If you already have the dataframe in pandas, you can convert like this:
spark_df = spark.createDataFrame(df)
modified_df=spark_df.filter("query here").collect()
You can convert back to pandas if you need after main sql querying.
link:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html
Related
I do have the following problem:
The downside:
I do have a large amount of data that does not fit a pandas df in memory.
The upside:
The data is mostly independent from each other. The only restriction is, that all elements with the same id have to be calculated together in a chunk.
What I tried:
Dask looked like a perfect fit for this problem.
I have a dask kubernetes cluster with multiple workers and no problem loading the data from sql database into a dask df.
The calculation itself is not that easy to implement in dask because some functions are missing or problematic with dask (e.g Multiindex, pivot). However because the data is mostly independent I tried to make the calculations chunkwise in pandas. When I call the .set_index(id) function on the dask df all equal id's are in the same partition. Thereby I wanted to iterate over the partitions and make the calculations on a (temporary) pandas df and store the result right away.
The code basically looks like this:
from dask import dataframe as dd
from distributed import Client
client = Client("kubernetes adress")
#Loading:
for i in range(x):
df_chunk = load_from_sql
future = client.scatter(df_chunk)
futures.append(future)
dask_df = dd.from_delayed(futures, meta=df_chunk)
dask_df = dask_df.set_index("id")
dask_df.map_partitions(lambda part: calculation(part).compute()
where
def calculation(part):
part = # do something in pandas
part.to_csv/to_sql # store data somwhere
del part / client.cancel(part) # release memory of temporary pandas df
With small amounts of data this runs smoothly however with a lot data the memory of the workers becomes full and the process stops with a cancelled error.
What am I doing wrong or are there any alternatives with dask to work memory efficient with chunkwise data from a dask df?
Loading the data directly in chunks from the database is currently not an option.
I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.
If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.
I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).
Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.
The Pseudocode is given below:
for row in parentdf.itertuples(): #insert into <tbl> .
mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
try:
df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1)
df.insert(0,'NewCol',sym)
df = df.assign(calculatedCol = functions1(params))
df = df.assign(calculatedCol1 = functions2(params),20))
df = df.assign(calculatedCol3 = functions3(More params),20))
df[20:].to_csv(outfile, mode='a', header=False, index=False)
The category-wise reading and storing in csv takes 2 Mins-Per cycle*. This is close to .34 Seconds for each writing of the 350 Categories incrementally.
I am wondering whether I can make the above process faster & efficient by using dask dataframes.
I looked up dask.org and didn't get any clear answers, looked at the use cases as well.
Additional details: I am using Python 3.7 and Pandas 0.25,
Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.
My key function i.e. FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.
Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk. The question doesn't mention some key facts:
how much data is ingested every 5 minutes (10MB or 10TB)?
where is the code being executed (AWS Lambda or a big cluster of machines)?
what data operations does FnforExtractingNParsingData perform?
Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask is overkill for a tiny dataset. Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. For large datasets, Dask will help a lot. Feel free to provide more details in your question and I can give more specific suggestions.
Setup: I have a pre-processed dataset on an MS SQL Server that is about 500.000.000 rows and 20 columns, where one is a rather long text column (varchar(1300)), which amounts to about 35gb data space on the SQL database. I'm working on the physical machine where the MS SQL Server is running, so no network traffic needed, and it has 128gb RAM. MS SQL Server is set to take 40gb RAM at maximum. I want to import the dataset into Python for further processing. Assume some deep learning experimentation, which is important, because I need to be able to transfer the text column as is.
Anecdote:
For testing the import code, I used a small subsample of the dataset of about 700.000 rows. This takes about 1 min to run, Python goes up to 700mb RAM usage, and saving the variable to filesystem after importing amounts to an about 250mb file in size. By extrapolation, importing the full dataset should take about 700 minutes and result in a 175gb file. Which is quite a lot, especially compared to say copying the full 31gb table within SQL, which takes a few minutes at most. I let it run for a day to see what happens to no avail.
Alternatives: I tried not using pandas and sqlalchemy but pyodbc directly, which led me to believe that the problem lies with how pyodbc deals with data import, as it stores the queried data in a rows object, which I only managed to read row-wise in a loop, which seems very inefficient to me. I don't know if pandas and sqlalchemy manage to do that differently. I also tried not importing the full dataset with a single select statement, but splitting it up into lots of smaller ones, which resulted in the small test dataset taking 30 minutes instead of 1 minute to load.
Question: How do I load this large (but not-so-large, compared to other databases) dataset into Python, at all? Also, there has to be a way to do this efficiently? As in it should not take significantly longer than copying the full table within SQL and it should not take significantly more space than the table in the SQL database. I do not understand why the data size blows up so much during the process.
The solution should not need extraction of the table to any other mediums than Python first (i.e. no .csv files or the like), though the use of any other Python packages is fine.
import pyodbc
import pandas as pd
import pandas.io.sql as pdsql
import sqlalchemy
def load_data():
query = "select * from data.table"
engine = sqlalchemy.create_engine('mssql+pyodbc://server/database?driver=SQL+Server+Native+Client+11.0?trusted_connection=yes')
dat = pdsql.read_sql(query, engine)
dat = dat.sort_values(['id', 'date'])
return dat
I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?
Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).