I am using the influxdb Python library to get a list of series in my database. There are about 20k series. I am then trying to build a Pandas dataframe out of the series'. I am rounding on 15S (I'd like to get rid of the nanoseconds, and I wonder why InfluxDB's Python library's documented get_list_series() call is missing in all versions I've tried, but those are other questions...); I'd like to end up with one big dataframe.
Here's the code:
from influxdb import DataFrameClient
.... get series list ...
temp_df = pd.DataFrame()
for series in series_list:
df = dfclient.query('select time,temp from {} where "series" = \'{}\''.format(location, temp))[location].asfreq('15S')
df.columns = [series]
if temp_df.empty:
temp_df = df
else:
temp_df = temp_df.join(df, how='outer')
This starts out fine but after a few hundred series, slows down quickly, grinding nearly to a halt. I am sure I'm not using Pandas the right way, and I'm hoping you can tell me how to do it the right way.
For what it's worth, I'm running this on relatively powerful hardware (which is why I believe I'm doing this the wrong way.)
One more thing: the time series' for each series I pull from InfluxDB may be different than all the others, which is why I'm using join. I'd like to end up with a DF with a column for each series, with the datetimes appropriately sorted in the index; join does that.
Related
I've been trying to adapt my code to utilize Dask to utilize multiple machines for processing. While the initial data load is not time-consuming, the subsequent processing takes roughly 12 hours on an 8-core i5. This isn't ideal and figured that using Dask to help spread the processing across machines would be beneficial. The following code works fine with the standard Pandas approach:
import pandas as pd
artists = pd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype("str")
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip()
)
Converting to Dask seemed straightforward, but I'm hitting hiccups along the way. The following Dask-adapted code throws a ValueError: cannot reindex from a duplicate axis error:
import dask.dataframe as dd
from dask.distributed import Client
artists = dd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype(str).compute()
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip().compute()
)
if __name__ == '__main__':
client = Client()
The best I can discern is that Dask won't allow reassignment to an existing Dask DataFrame. So this works:
...
artists_new = artists["name"].astype("str").compute()
...
However, I really don't want to create a new DataFrame each time. I'd rather replace the existing DataFrame with a new one, mainly because I have multiple data cleaning steps before processing.
While the tutorial and guides are useful, they are pretty basic and don't cover such use cases.
What are the preferred approaches here with Dask DataFrames?
Every time you call .compute() on Dask dataframe/series, it converts it into pandas. So what is happening in this line
artists["name"] = artists["name"].astype(str).compute()
is that you are computing the string column and then assigning pandas series to a dask series (without ensuring alignment of partitions). The solution is to call .compute() only on the final result, while intermediate steps can use regular pandas syntax:
# modified example (.compute is removed)
artists["name"] = artists["name"].astype(str).str.lower()
I have a large dataset with multiple date columns that I need to clean up, mostly by removing the time stamp since it is all 00:00:00. I want to write a function that collects all columns if type is datetime, then format all of them instead of having to attack one each.
I figured it out. This is what I came up with and it works for me:
def tidy_dates(df):
for col in df.select_dtypes(include="datetime64[ns, UTC]"):
df[col] = df[col].dt.strftime("%Y-%m-%d")
return df
got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)
I searched almost all over the internet and somehow none of the approaches seem to work in my case.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.
I even performed certain minor operations on this data like new column generation, filtering, etc.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
Thank you.
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
It might be a better way but you can try this.
*for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
this will save merged data as df3.
The reason you might be getting MemoryError: Unable to allocate.. could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:
df.drop_duplicates(subset ='column_name', keep = False, inplace = True)
Then re-run your python/pandas code. This worked for me.
In general chunk version suggested by #T_cat works great.
However, memory exploding might be caused by joining on columns that have Nan values.
So you may want to exclude those rows from the join.
see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153
Maybe the left data frame has NaN in the merging columns which causes the final merged dataframe to bloat.
Fill the merging column in the left data frame with zeros if ok.
df['left_column'] = df['left_column'].fillna(0)
Then do the merge. See what you get.
I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna