Related
TL;DR: I want to right-align this df, overwriting NaN's/shifting them to the left:
In [6]: series.str.split(':', expand=True)
Out[6]:
0 1 2
0 1 25.842 <NA>
1 <NA> <NA> <NA>
2 0 15.413 <NA>
3 54.154 <NA> <NA>
4 3 2 06.284
to get it as continuous data with the right-most columns filled:
0 1 2
0 0 1 25.842 # 0 or NA
1 <NA> <NA> <NA> # this NA should remain
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
What I'm actually trying to do:
I've got a Pandas Series of Durations/timedeltas which are roughly in an H:M:S format - but sometimes the 'H' or the 'H:M' parts can be missing - so I can't just pass it onto Timedelta or datetime. What I want to do is convert them to seconds, which I've done but it seems a bit convoluted:
In [1]: import pandas as pd
...:
...: series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
...: t = series.str.split(':') # not using `expand` helps for the next step
...: t
Out[1]:
0 [1, 25.842]
1 <NA>
2 [0, 15.413]
3 [54.154]
4 [3, 2, 06.284]
dtype: object
In [2]: # reverse it so seconds are first; and NA's are just empty
...: rows = [i[::-1] if i is not pd.NA else [] for i in t]
In [3]: smh = pd.DataFrame.from_records(rows).astype('float')
...: # left-aligned is okay since it's continuous Secs->Mins->Hrs
...: smh
Out[3]:
0 1 2
0 25.842 1.0 NaN
1 NaN NaN NaN
2 15.413 0.0 NaN
3 54.154 NaN NaN
4 6.284 2.0 3.0
If I don't do this fillna(0) step then it generates NaN's for the seconds-conversion later.
In [4]: smh.iloc[:, 1:] = smh.iloc[:, 1:].fillna(0) # NaN's in first col = NaN from data; so leave
...: # convert to seconds
...: smh.iloc[:, 0] + smh.iloc[:, 1] * 60 + smh.iloc[:, 2] * 3600
Out[4]:
0 85.842
1 NaN
2 15.413
3 54.154
4 10926.284
dtype: float64
^ Expected end result.
(Alternatively, I could write a small Python-only function to split on :'s and then convert based on how many values each list has.)
You can attack the problem earlier by padding series with '0:' as follows:
# setup
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
# create a padding of 0 series
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) and c > 0 else '' for c in counts], dtype='string')
# apply padding
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
print(t)
Output
0 1 2
0 0 1 25.842
1 <NA> <NA> <NA>
2 0 0 15.413
3 0 0 54.154
4 3 2 06.284
Let's try with numpy to right align the dataframe, the basic idea is to sort the dataframe along axis=1 so that the NaN values appear before the non-NaN values while also keeping the order of non-NaN values intact:
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
0 1 2
0 NaN 1.0 25.842
1 NaN NaN NaN
2 NaN 0.0 15.413
3 NaN NaN 54.154
4 3.0 2.0 6.284
In order to get the total seconds you can multiply the right aligned dataframe by [3600, 60, 1] and take sum along axis=1:
df.mul([3600, 60, 1]).sum(1)
0 85.842
1 0.000
2 15.413
3 54.154
4 10926.284
dtype: float64
1. Using the sorting NA's approach in Shubham's answer, I've come up with this - Utilise Pandas apply and Python sorted :
series = pd.Series(['1:25.842', pd.NA, '0:15.413', '54.154', '3:2:06.284'], dtype='string')
df = series.str.split(':', expand=True)
# key for sorted is `pd.notna`, so False(0) sorts before True(1)
df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
(And then multiply as needed.) But it's quite slow, see below.
2. By pre-padding the '0:'s in Dani's answer, I can then create pd.Timedelta's directly and get their total_seconds:
res = ... # from answer
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
(But doing the expand-split and then multiply+sum is faster across ~10k rows.)
Performance caveats, with 10k rows of data:
Initial code/attempt in my question, row reversal - so maybe I'll stick with it:
%%timeit
t = series.str.split(':')
rows = [i[::-1] if i is not pd.NA else [] for i in t]
smh = pd.DataFrame.from_records(rows).astype('float')
smh.mul([1, 60, 3600]).sum(axis=1, min_count=1)
# 14.3 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy argsort + take_along_axis:
%%timeit
df = series.str.split(':', expand=True)
i = np.argsort(np.where(df.isna(), -1, 0), 1)
df[:] = np.take_along_axis(df.values, i, axis=1)
df.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 30.1 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Padding beforehand:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
t = res.str.split(':', expand=True)
t.apply(pd.to_numeric, errors='coerce').mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 48.3 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Padding beforehand, timedeltas + total_seconds:
%%timeit
counts = 2 - series.str.count(':')
pad = pd.Series(['0:' * c if pd.notna(c) else '' for c in counts], dtype='string')
res = pad.str.cat(series)
pd.to_timedelta(res, errors='coerce').map(lambda x: x.total_seconds())
# 183 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pandas apply + Python sorted (very slow):
%%timeit
df = series.str.split(':', expand=True)
df = df.apply(sorted, axis=1, key=pd.notna, result_type='broadcast')
df.apply(pd.to_numeric).mul([3600, 60, 1]).sum(axis=1, min_count=1)
# 1.4 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import numpy as np
import pandas as pd
ind = [0, 1, 2]
cols = ['A','B','C']
df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
Say you have a pandas dataframe df looking like:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
If you want to capture a single element from each column in cols at a specific index ind the output should look like a series:
A 0
B 4
C 8
What I've tried so far was:
df.loc[ind,cols]
which gives the undesired output:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
Any suggestions?
context:
The next step would be mapping the output of an df.idxmax() call of one dataframe onto another dataframe with the same column names and indexes, but I can likely figure that out if I know how to do the above mentioned transformation .
you can use DataFrame.lookup():
In [6]: pd.Series(df.lookup(df.index, df.columns), index=df.columns)
Out[6]:
A 0
B 4
C 8
dtype: int32
or:
In [14]: pd.Series(df.lookup(ind, cols), index=df.columns)
Out[14]:
A 0
B 4
C 8
dtype: int32
Explanation:
In [12]: df.lookup(df.index, df.columns)
Out[12]: array([0, 4, 8])
Here's a vectorized one with NumPy's advanced-indexing to select one element per column, given the row indices ind per col -
pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Sample run -
In [107]: ind = [0, 2, 1] # different one than sample for variety
...: cols = ['A','B','C']
...: df = pd.DataFrame(np.arange(9).reshape((3,3)),columns=cols)
...:
In [109]: df
Out[109]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
In [110]: pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
Out[110]:
A 0
B 7
C 5
dtype: int64
Runtime test
Let's compare the propose one against the pandas built-in vectorized lookup method proposed in #MaxU's solution and since we are seeing how good the vectorized ones are, let's have greater number of cols -
In [111]: ncols = 10000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [112]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
1000 loops, best of 3: 718 µs per loop
# Proposed in this post
In [113]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
1000 loops, best of 3: 410 µs per loop
In [114]: ncols = 100000
...: df = pd.DataFrame(np.random.randint(0,9,(100,ncols)))
...: ind = np.random.randint(0,100,(ncols)).tolist()
...:
# #MaxU's solution
In [115]: %timeit pd.Series(df.lookup(ind, df.columns), index=df.columns)
100 loops, best of 3: 8.83 ms per loop
# Proposed in this post
In [116]: %timeit pd.Series(df.values[ind, np.arange(len(ind))], df.columns)
100 loops, best of 3: 5.76 ms per loop
There is another way using mutiIndex, if you like using .loc
df1=df.reset_index().melt('index').set_index(['index','variable'])
df1.loc[list(zip(df.index,df.columns))]
Out[118]:
value
index variable
0 A 0
1 B 4
2 C 8
There should be a more direct way but this is what I could think of,
val = [df.iloc[i,i] for i in df.index]
pd.Series(val, index = df.columns)
A 0
B 4
C 8
dtype: int64
You could zip the column and index values you would like to retrieve the values for and then create a series from that:
pd.Series([df.loc[id_, col] for id_, col in zip(ind, cols)], df.columns)
A 0
B 4
C 8
Or if you always just need the diagonal value:
pd.Series(np.diag(df), df.columns)
Will be much faster
Is there an efficient way to delete columns that have at least 20% missing values?
Suppose my dataframe is like:
A B C D
0 sg hh 1 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
After removing the columns, the dataframe becomes like this:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
You can use boolean indexing on the columns where the count of notnull values is larger then 80%:
df.loc[:, pd.notnull(df).sum()>len(df)*.8]
This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:
df.loc[:, (df > 1).sum() > len(df) *. 8]
Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by #EdChum:
df.dropna(thresh=0.8*len(df), axis=1)
The latter will be slightly faster:
df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan
%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop
%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop
You can call dropna and pass a thresh value to drop the columns that don't meet your threshold criteria:
In [10]:
frac = len(df) * 0.8
df.dropna(thresh=frac, axis=1)
Out[10]:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 NaN 6
5 y 8
Having a dataframe in python:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B?
Result would be then:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1, return MIXED otherwise the TYPE by itself:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large).
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE and compute the relative frequencies of TYPE being A or B:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc looks like
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. Cutting said row into bins with pd.cut gives the desired result:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types series:
types.name = 'TYPE'
here is one bit ugly, but not that slow solution:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop
I work with large datasets, making pandas group and groupby functions take a long time/use too much memory. I have heard some people say groupby can be slow, but am having trouble finding a better solution.
If my dataframe has 2 columns similar to:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
a b
1 1
2 1
2 1
4 1
I wish to return a list of values that match to a value in another column:
a b list_of_b
1 1 [1]
2 1 [1,1]
2 1 [1,1]
4 1 [1]
I currently use:
df_group = df.groupby('a')
df['list_of_b'] = df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
The code above works for small stuff, but not on large dataframes ( df > 1,000,000 rows) Does anyone have a faster way to do this?
Shortest solution I can think of:
df = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
df.join(pd.Series(df.groupby(by='a').apply(lambda x: list(x.b)), name="list_of_b"), on='a')
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
On a 4K row df I get the following:
In [29]:
df_group = df.groupby('a')
%timeit df.apply(lambda row: df_group.get_group(row['a'])['b'].tolist(), axis=1)
%timeit df['a'].map(df.groupby('a')['b'].apply(list))
1 loops, best of 3: 4.37 s per loop
100 loops, best of 3: 4.21 ms per loop
Just doing the grouping and then joining back to the original dataframe seems to be quite a bit faster:
def make_lists(df):
g = df.groupby('a')
def list_of_b(x):
return x.b.tolist()
return df.set_index('a').join(
pd.DataFrame(g.apply(list_of_b),
columns=['list_of_b']),
rsuffix='_').reset_index()
This gives me 192ms per loop with 1M rows generated like this:
df1 = pd.DataFrame({'a':[1,2,2,4], 'b':[1,1,1,1]})
low = 1
high = 10
size = 1000000
df2 = pd.DataFrame({'a':np.random.randint(low,high,size),
'b':np.random.randint(low,high,size)})
make_lists(df1)
Out[155]:
a b list_of_b
0 1 1 [1]
1 2 1 [1, 1]
2 2 1 [1, 1]
3 4 1 [1]
In [156]:
%%timeit
make_lists(df2)
10 loops, best of 3: 192 ms per loop