I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).
**
**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used : Dask**
Shape of Combined Df : 6 Million rows and 57 columns
**
I have a method that just eliminates unwanted characters from essential columns like:
def stripper(x):
try:
if type(x) != float or type(x) != pd._libs.missing.NAType:
x = re.sub(r"[^\w]+", "", x).upper()
except Exception as ex:
pass
return x
And I am applying above method to certain columns as ::
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)
And also i am filling null values of a column with the values from another column as:
df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])
These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).
temp_df = df.head(10000)
But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.
How can i solve this issue ?? Any help would be appreciated.
I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row, and and go for a more vectorized solution:
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)
To expand on #richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper, which should be faster than using an apply. For example:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame(
{'a': [1, 'kdj821', '* dk0 '],
'b': ['!23d', 'kdj821', '* dk0 '],
'c': ['!23d', 'kdj821', None]}),
npartitions=2)
ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()
It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).
Related
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:
As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".
So my ultimate plan to deal with this is to :
Strip the last character from all the entries under size.
Convert the convertible entries to floats
Rescale the k entries by dividing them by 1000 so as to represent them properly
Replace the "Varies according to device" entries with the mean of the feature.
I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).
By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!
Any help would be greatly appreciated. Thank you!!
------------------------Edit------------------------
A minimum reproducible example of an attempt would be
import pandas as pd
data = pd.read_csv("playstore-edited.csv",
index_col=("App"),
parse_dates=True,
infer_datetime_format=True)
x = data
var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
if sar[i]=="k":
ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000
This throws the following error:
IndexingError: Too many indexers
I know the last part of the code is off, but I'm not sure how to express what I want.
As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.
df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
df['Scale'] = df['Size'].str[-1]
df['Size'] = df['Size'].str[:-1].astype(int)
df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
df = df.drop('Scale', axis=1)
df
Process size column using regex and then do your conversions:
df = (
df
#extract numeric part
.assign(New_Size = lambda x: x['Size'].str.replace('([A-Za-z]+)', ''))
#extract Scale part
.assign(Scale = lambda x: x['Size'].str.extract('([A-Za-z]+)'))
#convert KB to MB
.assign(Size = lambda x: np.where(x['Scale'] =='K', x['New_Size']/1000,x['New_Size']))
#update converted rows to MB
.assign(Scale = lambda x: np.where(x['Scale'] =='K', 'M',x['Scale']))
#replace those do not have value with mean of the size column
.assign(Size= lambda x: np.where(x['Scale']!='M',mean(x['New_Size']), x['New_Size']))
)
I have a very large dataframe that I am resampling a large number of times, so I'd like to use dask to speed up the process. However, I'm running into challenges with the groupby apply. An example data frame would be
import numpy as np
import pandas as pd
import random
test_df = pd.DataFrame({'sample_id':np.array(['a', 'b', 'c', 'd']).repeat(100),
'param1':random.sample(range(1, 1000), 400)})
test_df.set_index('sample_id', inplace=True)
which I can normally groupby and resample using
N = 5;i=1
test = test_df\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
Which I wrap into a method that iterates over an N gradient i times. The actual dataframe is very large with a number of columns, and before anyone suggests, this method is a little bit faster than an np.random.choice approach on the index-- it's all in the groupby. I've run the overall procedure through a multiprocessing method, but I wanted to see if I could get a bit more speed out of a dask version of the same. The problem is the documentation suggests that if you index and partition then you get complete groups per partition-- which is not proving true.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df, npartitions=8)
df1=df1.persist()
df1.divisions
creates
('a', 'b', 'c', 'd', 'd')
which unsurprisingly results in a failure
N = 5;i=1
test = df1\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
ValueError: Metadata inference failed in groupby.apply(sample).
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
ValueError("Cannot take a larger sample than population when 'replace=False'")
I have dug all around the documentation on keywords, dask dataframes & partitions, and groupby aggregations and simply am simply missing the solution if it's there in the documents. Any advice on how to create a smarter set of partitions and/or get the groupby with sample playing nice with dask would be deeply appreciated.
It's not quite clear to me what you are trying to achieve and why you need to add replace=False (which is default) but the following code work for me. I just need to add meta.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df.reset_index(), npartitions=8)
N = 5
i = 1
test = df1\
.groupby(['sample_id'])\
.apply(lambda x: x.sample(n=N),
meta={"sample_id": "object",
"param1": "f8"})\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
If you then want to drop sample_id you just need to add
df = df.drop("sample_id", axis=1)
I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
I have a pandas dataframe and converted to dask dataframe
df.shape = (60893, 2)
df2.shape = (7254909, 2)
df['name_clean'] = df['Name'].apply(lambda x :re.sub('\W+','',x).lower(),meta=('x', 'str'))
names = df['name_clean'].drop_duplicates().values.compute()
df2['found'] = df2['name_clean2'].apply(lambda x: any(name in x for name in names),meta=('x','str')) ~ takes 834 ms
df2.head(10) ~ takes 3 min 54 sec
How can I see the shape of dask dataframe ?
Why it is so much time for .head() ? Am I doing it in the right way ?
You can not iterate over a dask.dataframe or dask.array. You need to call the .compute() method to turn it into a Pandas dataframe/series or NumPy array first.
Note just calling the .compute() method and then forgetting the result doesn't do anything. You need to save the result as a variable.
dask_series = df.Name.apply(lambda x: re.sub('\W+', '', x).lower(),
meta=('x', 'str')
pandas_series = dask_series.compute()
for name in pandas_series:
...