How do I improve the performance in parallel computing with dask - python

I have a pandas dataframe and converted to dask dataframe
df.shape = (60893, 2)
df2.shape = (7254909, 2)
df['name_clean'] = df['Name'].apply(lambda x :re.sub('\W+','',x).lower(),meta=('x', 'str'))
names = df['name_clean'].drop_duplicates().values.compute()
df2['found'] = df2['name_clean2'].apply(lambda x: any(name in x for name in names),meta=('x','str')) ~ takes 834 ms
df2.head(10) ~ takes 3 min 54 sec
How can I see the shape of dask dataframe ?
Why it is so much time for .head() ? Am I doing it in the right way ?

You can not iterate over a dask.dataframe or dask.array. You need to call the .compute() method to turn it into a Pandas dataframe/series or NumPy array first.
Note just calling the .compute() method and then forgetting the result doesn't do anything. You need to save the result as a variable.
dask_series = df.Name.apply(lambda x: re.sub('\W+', '', x).lower(),
meta=('x', 'str')
pandas_series = dask_series.compute()
for name in pandas_series:
...

Related

How to add multiple columns to a dataframe based on calculations

I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/

Dask Running Out Of Memory (16GB) When using apply

I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).
**
**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used : Dask**
Shape of Combined Df : 6 Million rows and 57 columns
**
I have a method that just eliminates unwanted characters from essential columns like:
def stripper(x):
try:
if type(x) != float or type(x) != pd._libs.missing.NAType:
x = re.sub(r"[^\w]+", "", x).upper()
except Exception as ex:
pass
return x
And I am applying above method to certain columns as ::
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)
And also i am filling null values of a column with the values from another column as:
df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])
These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).
temp_df = df.head(10000)
But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.
How can i solve this issue ?? Any help would be appreciated.
I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row, and and go for a more vectorized solution:
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)
To expand on #richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper, which should be faster than using an apply. For example:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame(
{'a': [1, 'kdj821', '* dk0 '],
'b': ['!23d', 'kdj821', '* dk0 '],
'c': ['!23d', 'kdj821', None]}),
npartitions=2)
ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()
It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

How to save a dask series to hdf5

Here is what I tried first
df = dd.from_pandas(pd.DataFrame(dict(x=np.random.normal(size=100),
y = np.random.normal(size=100))), chunksize=40)
cat = df.map_partitions( lambda d: np.digitize(d['x']+d['y'], [.3,.9]), meta=pd.Series([], dtype=int, name='x'))
cat.to_hdf('/tmp/cat.h5', '/cat')
This fails with cannot properly create the storer...
I next tried to save cat.values instead:
da.to_hdf5('/tmp/cat.h5', '/cat', cat.values)
This fails with cannot convert float NaN to integer which I am guessing to be due to cat.values not having nan shape and chunksize values.
How do I get both of these to work? Note the actual data would not fit in memory.
This works fine:
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(dict(x=np.random.normal(size=100),
y=np.random.normal(size=100)))
ddf = dd.from_pandas(df, chunksize=40)
cat = ddf.map_partitions(lambda d: pd.Series(np.digitize(d['x'] + d['y'], [.3,.9])),
meta=('x', int))
cat.to_hdf('cat.h5', '/cat')
You were missing the pd.Series wrapper around the call to np.digitize, which meant the output of map_partitions was a numpy array instead of a pandas series (an error). In the future when debugging it may be useful to try computing a bit of data from steps along the way to see where the error is (for example, I found this issue by running .head() on cat).

Using a dataframe to construct an other in a for loop

As the title says, I've been trying to build a Pandas DataFrame from an other df using a for loop and calculating new columns with the last one built.
So far, I've tried :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
This works:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
But when I try building df and df1 at the same time like so :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-1)
It returns a lot of gibberish + the line :
ValueError : Wrong number of items passed 10, placement implies 1
In this example, I am well aware that I could build df1 first and build df after. But it returns the same error if I try :
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-df1[i])
df1[i-1] = df1[i].apply(lambda a: a-df[i])
Which is what i really need in the end.
Any help is much appreciated,
Alex
apply is trying to apply a function along an axis that you specify. It can be 0 (applying the function to each column) or 1 (applying the function to each row). Per default, it is applying the function to the columns. In your first example:
for i in steps:
print(i)
df[i-1] = df[i].apply(lambda a: a-1)
Each column is looped because of your for loop, and your function .apply removes 1 to the entire column. You can see a as being your entire column. It is exactly the same as the following:
for i in steps:
print(i)
df[i - 1] = df[i] - 1
A way you can see .apply is with the following. Assuming I have the following dataframe:
df = pd.DataFrame(np.random.rand(10,4))
df.sum() and df.apply(lambda a: np.sum(a)) yields exactly the same result. It is just a simple example, but you can do more powerful calculations if needed.
Note that .apply is not the fastest method, so try to avoid it if you can.
An example where apply would be useful is if you have a function some_fct() defined that takes int or float as arguments and you would like to apply it to the elements of a dataframe column.
import pandas as pd
import numpy as np
import math
def some_fct(x):
return math.sin(x) / x
np.random.seed(100)
df = pd.DataFrame(np.random.rand(10,2))
Obviously, some_fct(df[0]) would not work as the function takes int or float as arguments. df[0] is a Series. However, using the apply method, you could apply your function to the elements of df[0] that are themselves floats.
df[0].apply(lambda x: some_fct(x))
Found it, I just need to drop the .apply !
Example :
df = pd.DataFrame(np.arange(10))
df.columns = [10]
df1 = pd.DataFrame(np.arange(10))
df1.columns = [10]
steps = np.linspace(10,1,10,dtype = int)
for i in steps:
print(i)
df[i-1] = df[i] - df1[i]
df1[i-1] = df1[i] + df[i]
It does exactly what it should !
I don't have enough knowledge about python, I cannot explain why
pd.DataFrame().apply()
will not use what was out of itself.

Categories