How to save a dask series to hdf5

How to save a dask series to hdf5 - python

Here is what I tried first
df = dd.from_pandas(pd.DataFrame(dict(x=np.random.normal(size=100),
y = np.random.normal(size=100))), chunksize=40)
cat = df.map_partitions( lambda d: np.digitize(d['x']+d['y'], [.3,.9]), meta=pd.Series([], dtype=int, name='x'))
cat.to_hdf('/tmp/cat.h5', '/cat')
This fails with cannot properly create the storer...
I next tried to save cat.values instead:
da.to_hdf5('/tmp/cat.h5', '/cat', cat.values)
This fails with cannot convert float NaN to integer which I am guessing to be due to cat.values not having nan shape and chunksize values.
How do I get both of these to work? Note the actual data would not fit in memory.

This works fine:
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(dict(x=np.random.normal(size=100),
y=np.random.normal(size=100)))
ddf = dd.from_pandas(df, chunksize=40)
cat = ddf.map_partitions(lambda d: pd.Series(np.digitize(d['x'] + d['y'], [.3,.9])),
meta=('x', int))
cat.to_hdf('cat.h5', '/cat')
You were missing the pd.Series wrapper around the call to np.digitize, which meant the output of map_partitions was a numpy array instead of a pandas series (an error). In the future when debugging it may be useful to try computing a bit of data from steps along the way to see where the error is (for example, I found this issue by running .head() on cat).

Related

type conversion in pandas on assignment of DataFrame series

I am noticing something a little strange in pandas (1.4.3). Is this the expected behaviour? The result of an optimization, or a bug? Basically I'd like to guarantee the type does not change unexpectedly, I'd at least like to see an error raised, so any tips are welcome.
If you assign all values of a series in a DataFrame this way, the dtype is altered
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df1.iloc[:, df1.columns.get_loc("a")] = 0
>>> df1["a"].dtype
dtype('int64')
and if you index the rows in a different way pandas does not convert the dtype
>>> df2 = pd.DataFrame({"a": np.array([1,2,3], dtype="int32")})
>>> df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0
>>> df2["a"].dtype
dtype('int32')

Not really an answer but some thoughts that might help you in your quest. My guess as to what is happening is this. In your multiple choice question above I am picking option A - optimization.
I think when 'pandas' sees df1.iloc[:, df1.columns.get_loc("a")] = 0 it is thinking full column(s) replacement of all rows. No slicing - even though df1.iloc[: ... ] is involved. [:] gets translated into all-rows-not-a-slice mode. When it sees = 0 it sees that (via broadcast) as full column(s) of int64. And since it is full replacement then the new column has the same dtype as the source.
But when it sees df2.iloc[0:len(df2.index), df2.columns.get_loc("a")] = 0 it goes into index-slice mode. Even though it is a full-column index slice it doesn't know that and makes an early decision to go into index-slice mode. Index-slice mode then operates on the assumption that only part of the column is going to be updated - not a replacement. Then in update mode the column is assumed to be partially updated and retains its existing dtype.
I got the above hypothesis from looking around at this: https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/base.py
If I didn't have a day job I might have the time to actually find the smoking gun in those 6242 lines of code.

If you look at this code ( I wrote your code little differently to see what is
happening in the middle)
from pandas._libs import index
import pandas as pd
import numpy as np
dfx= pd.DataFrame({"x": np.array([4,5,6], dtype="int32")}
P=dfx.iloc[:, dfx.columns.get_loc("x")] = 0
P1=dfx.iloc[:, dfx.columns.get_loc("x")]
print(P1)# here you are automatically changing the datatype to int64 ( while
keep the value 0 , as int64 is default access mechanism for the hardware to
process the data.
print(P)
print(dfx["x"].dtype)
dfy= pd.DataFrame({"y": np.array([4,5,6], dtype="int32")})
Q=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")] = 0
print(Q)
Q1=dfy.iloc[0:len(dfy.index), dfy.columns.get_loc("y")]
print(Q1)
print(dfy["y"].dtype)
print(len(dfx.index))
print(len(dfy.index))

Don't know why this is happening, but adding square brackets seem to solve the issue:
df1.iloc[:, [df1.columns.get_loc("a")]] = 0
An other solution seems to be:
df1.iloc[range(len(df1.index)), df1.columns.get_loc("a")] = 0

Dask Running Out Of Memory (16GB) When using apply

I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).
**
**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used : Dask**
Shape of Combined Df : 6 Million rows and 57 columns
**
I have a method that just eliminates unwanted characters from essential columns like:
def stripper(x):
try:
if type(x) != float or type(x) != pd._libs.missing.NAType:
x = re.sub(r"[^\w]+", "", x).upper()
except Exception as ex:
pass
return x
And I am applying above method to certain columns as ::
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)
And also i am filling null values of a column with the values from another column as:
df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])
These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).
temp_df = df.head(10000)
But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.
How can i solve this issue ?? Any help would be appreciated.

I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row, and and go for a more vectorized solution:
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)

To expand on #richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper, which should be faster than using an apply. For example:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame(
{'a': [1, 'kdj821', '* dk0 '],
'b': ['!23d', 'kdj821', '* dk0 '],
'c': ['!23d', 'kdj821', None]}),
npartitions=2)
ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()
It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).

How to save outputs (from xarray) from python dask delayed into a pandas dataframe

I am very new to to trying to parallelize my python code. I am trying to perform some analysis on an xarray, then fill in a pandas dataframe with the results. The columns of the dataframe are independent, so I think it should be trivial to parallelise using dask delayed, but can't work out how. My xarrays are quite big, so this loop takes a while, and is big in memory. It could also be chunked by time, instead, if that's easier (this might help with memory)!
Here is the un-parallelized version:
from time import sleep
import time
import pandas as pd
import dask.dataframe as dd
data1 = np.random.rand(4, 3,3)
data2=np.random.randint(4,size=(3,3))
locs1 = ["IA", "IL", "IN"]
locs2 = ['a', 'b', 'c']
times = pd.date_range("2000-01-01", periods=4)
xarray1 = xr.DataArray(data1, coords=[times, locs1, locs2], dims=["time", "space1", "space2"])
xarray2= xr.DataArray(data2, coords=[locs1, locs2], dims=[ "space1", "space2"])
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
final_df=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df[column]=delayed_where(xarray1,xarray2,column)
I would like to parallelize the for loop, but have tried:
final_df_delayed=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df_delayed[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df.compute()
Or maybe something with dask delayed?
final_df_dd=dd.from_pandas(final_df, npartitions=2)
for column in final_df:
final_df_dd[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df_dd.compute()
But none of these work. Can anyone help?

You're using delayed correctly, but it's not possible to construct a dask dataframe in the way you specified.
from dask import delayed
import dask
#delayed
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
#delayed
def form_df(list_col_results):
final_df=pd.DataFrame(columns=range(4),index=times)
for n, column in enumerate(final_df):
final_df[column]=list_col_results[n]
return final_df
delayed_cols = [delayed_where(xarray1,xarray2, col) for col in final_df.columns]
delayed_df = form_df(delayed_cols)
delayed_df.compute()
Note that the enumeration is a clumsy way to get correct order of the columns, but your actual problem might guide you to a better way of specifying this (e.g. by explicitly specifying each column as an individual argument).

Using pd.DataFrame.sample on dask dataframe with groupby

I have a very large dataframe that I am resampling a large number of times, so I'd like to use dask to speed up the process. However, I'm running into challenges with the groupby apply. An example data frame would be
import numpy as np
import pandas as pd
import random
test_df = pd.DataFrame({'sample_id':np.array(['a', 'b', 'c', 'd']).repeat(100),
'param1':random.sample(range(1, 1000), 400)})
test_df.set_index('sample_id', inplace=True)
which I can normally groupby and resample using
N = 5;i=1
test = test_df\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
Which I wrap into a method that iterates over an N gradient i times. The actual dataframe is very large with a number of columns, and before anyone suggests, this method is a little bit faster than an np.random.choice approach on the index-- it's all in the groupby. I've run the overall procedure through a multiprocessing method, but I wanted to see if I could get a bit more speed out of a dask version of the same. The problem is the documentation suggests that if you index and partition then you get complete groups per partition-- which is not proving true.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df, npartitions=8)
df1=df1.persist()
df1.divisions
creates
('a', 'b', 'c', 'd', 'd')
which unsurprisingly results in a failure
N = 5;i=1
test = df1\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
ValueError: Metadata inference failed in groupby.apply(sample).
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
ValueError("Cannot take a larger sample than population when 'replace=False'")
I have dug all around the documentation on keywords, dask dataframes & partitions, and groupby aggregations and simply am simply missing the solution if it's there in the documents. Any advice on how to create a smarter set of partitions and/or get the groupby with sample playing nice with dask would be deeply appreciated.

It's not quite clear to me what you are trying to achieve and why you need to add replace=False (which is default) but the following code work for me. I just need to add meta.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df.reset_index(), npartitions=8)
N = 5
i = 1
test = df1\
.groupby(['sample_id'])\
.apply(lambda x: x.sample(n=N),
meta={"sample_id": "object",
"param1": "f8"})\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
If you then want to drop sample_id you just need to add
df = df.drop("sample_id", axis=1)

panda read_csv() converting imaginary to real

After calling a file using pandas by this two lines:
import pandas as pd
import numpy as np
df = pd.read_csv('PN_lateral_n_eff.txt', header=None)
df.columns = ["effective_index"]
here is my output:
effective_index
0 2.568393573877396+1.139080496494329e-006i
1 2.568398351899841+1.129979376397734e-006i
2 2.568401556986464+1.123872317134941e-006i
after that, i can not use the numpy to convert it into a real number. Because, panda dtype was object. I tried this:
np.real(df, dtype = float)
TypeError: real() got an unexpected keyword argument 'dtype'
Any way to do that?

Looks like astype(complex) works with Numpy arrays of strings, but not with Pandas Series of objects:
cmplx = df['effective_index'].str.replace('i','j')\ # Go engineering
.values\ # Go NumPy
.astype('str')\ # Go string
.astype(np.complex) # Go complex
#array([ 2.56839357 +1.13908050e-06j, 2.56839835 +1.12997938e-06j,
# 2.56840156 +1.12387232e-06j])
df['effective_index'] = cmplx # Go Pandas again

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to save a dask series to hdf5 - python

Related

type conversion in pandas on assignment of DataFrame series

Dask Running Out Of Memory (16GB) When using apply

How to save outputs (from xarray) from python dask delayed into a pandas dataframe

Using pd.DataFrame.sample on dask dataframe with groupby

panda read_csv() converting imaginary to real

Categories

Resources