I have a Pandas DataFrame.
LeafId pidx pidy count
1 x y 10
1 x y 20
1 x z 30
3 b q 10
1 x y 20
We can see that there are multiple rows of pidx = x and pidy = y
I want to sum the count column and get dataframe df2 as:
LeafId pidx pidy count
1 x y 50
1 x z 30
3 b q 10
I know one way of doing it:
df2 = df.groupby(['pidx','pidy']).agg({'LeafID':'first',count':'sum'}).reset_index()
But I want the most efficient way of doing it for a huge DataFrame (millions of records), which will take the least amount of time.
Is there any better way of doing this?
Also, instead of putting LeafID inside .agg(), can I do the following?
df2 = df.groupby(['LeafID','pidx','pidy']).agg({count':'sum'}).reset_index()
If need groupby by LeafId , pidx and pidy columns:
df1 = df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum()
print (df1)
LeafId pidx pidy count
0 1 x y 50
1 1 x z 30
2 3 b q 10
I try some timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.choice(L1, N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
#print (df)
print (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
print (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
In [261]: %timeit (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
1 loop, best of 3: 544 ms per loop
In [262]: %timeit (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
1 loop, best of 3: 466 ms per loop
Smaller groups 1000 to 10000:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(10000, size=N),
'pidx': np.random.choice(L1, N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(10000, size=N)})
print (df)
print (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
print (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
In [264]: %timeit (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
1 loop, best of 3: 933 ms per loop
In [265]: %timeit (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
1 loop, best of 3: 775 ms per loop
Related
I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%
I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features.
I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :
possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
Is there a way to vectorize this code ?
Thank you !
Here's one based on vectorized random.choice with a given matrix of probabilities -
def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out
ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
Verify with a large dataset for the probabilities -
In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
In [140]: df
Out[140]:
T1 T2 T3
0 0.80 0.10 0.1
1 0.50 0.20 0.3
2 0.01 0.89 0.1
In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)
In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])
In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
Timings on 1000,000 rows -
# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])
# #gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop
# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop
We can use numpy for this:
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a DataFrame of values where the first False value shows which "bucket" the random value falls in. With idxmax, we can get the index of this bucket, which we can then convert back with pd.get_dummies.
Example:
import numpy as np
import pandas as pd
np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]
df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
print(result)
Output:
0 1 2
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 1 0
8 0 0 1
9 0 1 0
A note:
Most of the slowdown comes from pd.get_dummies; if you use Divakar's method of pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns), it gets a lot faster.
I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop
I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )
Let's say my data frame contains these data:
>>> df = pd.DataFrame({'a':['l1','l2','l1','l2','l1','l2'],
'b':['1','2','2','1','2','2']})
>>> df
a b
0 l1 1
1 l2 2
2 l1 2
3 l2 1
4 l1 2
5 l2 2
l1 should correspond to 1 whereas l2 should correspond to 2.
I'd like to create a new column 'c' such that, for each row, c = 1 if a = l1 and b = 1 (or a = l2 and b = 2). If a = l1 and b = 2 (or a = l2 and b = 1) then c = 0.
The resulting data frame should look like this:
a b c
0 l1 1 1
1 l2 2 1
2 l1 2 0
3 l2 1 0
4 l1 2 0
5 l2 2 1
My data frame is very large so I'm really looking for the most efficient way to do this using pandas.
df = pd.DataFrame({'a': numpy.random.choice(['l1', 'l2'], 1000000),
'b': numpy.random.choice(['1', '2'], 1000000)})
A fast solution assuming only two distinct values:
%timeit df['c'] = ((df.a == 'l1') == (df.b == '1')).astype(int)
10 loops, best of 3: 178 ms per loop
#Viktor Kerkes:
%timeit df['c'] = (df.a.str[-1] == df.b).astype(int)
1 loops, best of 3: 412 ms per loop
#user1470788:
%timeit df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)
1 loops, best of 3: 363 ms per loop
#herrfz
%timeit df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
1 loops, best of 3: 387 ms per loop
You can also use the string methods.
df['c'] = (df.a.str[-1] == df.b).astype(int)
df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
You can just use logical operators. I'm not sure why you're using strings of 1 and 2 rather than ints, but here's a solution. The astype at the end converts it from boolean to 0's and 1's.
df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)