I'm trying to better understand pandas' group operations.
As an example, let's say I have a dataframe which has a list of sets played in tennis matches.
tennis_sets = pd.DataFrame.from_items([
('date', ['27/05/13', '27/05/13', '28/05/13', '28/05/13',
'28/05/13', '29/05/13', '29/05/13']),
('player_A', [6, 6, 2, 6, 7, 6, 6]),
('player_B', [4, 3, 6, 7, 6, 1, 0])
])
Resulting in
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
I'd like to determine the overall score for each match played on a given day. This should look like
date player_A player_B
0 27/05/13 2 0
1 28/05/13 1 2
2 29/05/13 2 0
So, I could do this by creating a new numpy array and iterating as follows:
matches = tennis_sets.groupby('date')
scores = np.zeros((len(matches),2))
for i, (_, match) in enumerate(matches):
a, b = match.player_A, match.player_B
scores[i] = np.c_[sum(a>b), sum(b>a)]
I could then reattach this new scores array to the dates. However, it seems unlikely that this is the preferred way of doing things.
To create a new dataframe with each date and match score as above, is there a better way I can achieve this using pandas' api?
To answer your question, yes there are ways to do this in pandas. There may be a more elegant solution, but here's a quick one which uses pandas groupby to perform a sum over the dataframe grouped by date:
In [13]: tennis_sets
Out[13]:
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
In [14]: tennis_sets["pA_wins"] = tennis_sets["player_A"] > tennis_sets["player_B"]
In [15]: tennis_sets["pB_wins"] = tennis_sets["player_B"] > tennis_sets["player_A"]
In [18]: tennis_sets
Out[18]:
date player_A player_B pA_wins pB_wins
0 27/05/13 6 4 True False
1 27/05/13 6 3 True False
2 28/05/13 2 6 False True
3 28/05/13 6 7 False True
4 28/05/13 7 6 True False
5 29/05/13 6 1 True False
6 29/05/13 6 0 True False
In [21]: matches = tennis_sets.groupby("date").sum()
In [22]: matches[["pA_wins", "pB_wins"]]
Out[22]:
pA_wins pB_wins
date
27/05/13 2 0
28/05/13 1 2
29/05/13 2 0
Related
I'm getting the KeyError: False when I run this line:
df['Eligible'] = df[('DeliveryOnTime' == "On-time") | ('DeliveryOnTime' == "Early")]
I've been trying to find a way to execute this condition using np.where and .loc() as well but neither work. Open to other ideas on how to apply the condition to the new column Eligible using data from DeliveryOnTime
I've tried these:
np.where
df['Eligible'] = np.where((df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early"), 1, 1)
.loc()
df['Eligible'] = df.loc[(df['DeliveryOnTime'] == "On-time") & (df['DeliveryOnTime'] == "Early"), 'Total Orders'].sum()
Sample Data:
data = {'ID': [1, 1, 1, 2, 2, 3, 4, 5, 5],
'DeliveryOnTime': ["On-time", "Late", "Early", "On-time", "On-time", "Late", "Early", "Early", "On-time"],
}
df = pd.DataFrame(data)
#For the sake of example data, the count of `DeliveryOnTime` will be the total number of orders.
df['Total Orders'] = df['DeliveryOnTime'].count()
The right syntax is:
df['Eligible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early")
# OR
df['Eligible'] = df['DeliveryOnTime'].isin(["On-time", "Early"])
Output:
>>> df
ID DeliveryOnTime Total Orders Eligible
0 1 On-time 9 True
1 1 Late 9 False
2 1 Early 9 True
3 2 On-time 9 True
4 2 On-time 9 True
5 3 Late 9 False
6 4 Early 9 True
7 5 Early 9 True
8 5 On-time 9 True
The df references are misplaced.
Please try:
df['Elegible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] =="Early")
Output:
>>> df
ID DeliveryOnTime Elegible
0 1 On-time True
1 1 Late False
2 1 Early True
3 2 On-time True
4 2 On-time True
5 3 Late False
6 4 Early True
7 5 Early True
8 5 Early True
You cannot call these columns directly. Have a look at the solution that detects all rows that are either 'On-time' or 'Early'
df["eligible"] = df.DeliveryOnTime.isin(['On-time', 'Early'])
df['eligible'].groupby(df['ID']).transform('sum')
df
ID DeliveryOnTime eligible TotalOrders
0 1 On-time True 2
1 1 Late False 2
2 1 Early True 2
3 2 On-time True 2
4 2 On-time True 2
5 3 Late False 0
6 4 Early True 1
7 5 Early True 1
I have a dataframe which is called "df". It looks like this:
a
0 2
1 3
2 0
3 5
4 1
5 3
6 1
7 2
8 2
9 1
I would like to produce a cummulative sum column which:
Sums the contents of column "a" cumulatively;
Until it gets a sum of "5";
Resets the cumsum total, to 0, when it reaches a sum of "5", and continues with the summing process;
I would like the dataframe to look like this:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
In the dataframe, the column "a_cumm_summ" contains the results of the cumulative sum.
Does anyone know how I can achieve this? I have hunted through the forums. And saw similar questions, for example, this one, but they did not meet my exact requirements.
You can get the cumsum, and floor divide by 5. Then subtract the result of the floor division, multiplied by 5, from the below row's cumulative sum:
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
df
Out[1]:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
Solution #2 (more robust):
Per Trenton's comment, A good, diverse sample dataset goes a long way to figure out unbreakable logic for these types of problems. I probably would have come up with a better solution first time around with a good sample dataset. Here is a solution that overcomes the sample dataset that Trenton mentioned in the comments. As shown, there are more conditions to handle as you have to deal with carry-over. On a large dataset, this would still be much more performant than a for-loop, but it is much more difficult logic to vectorize:
df = pd.DataFrame({'a': {0: 2, 1: 4, 2: 1, 3: 5, 4: 1, 5: 3, 6: 1, 7: 2, 8: 2, 9: 1}})
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
over = (df['a_cumm_sum'].shift(1) - 5)
df['a_cumm_sum'] = df['a_cumm_sum'] - np.where(over > 0, df['a_cumm_sum'] - over, 0).cumsum()
s = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum']*-1, 0).cumsum()
df['a_cumm_sum'] = np.where((df['a_cumm_sum'] > 0) & (s > 0), s + df['a_cumm_sum'],
df['a_cumm_sum'])
df['a_cumm_sum'] = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum'].shift() + df['a'], df['a_cumm_sum'])
df
Out[2]:
a a_cumm_sum
0 2 2.0
1 4 6.0
2 1 1.0
3 5 6.0
4 1 1.0
5 3 4.0
6 1 5.0
7 2 2.0
8 2 4.0
9 1 5.0
The assignment can be combined with a condition. The code is as follows:
import numpy as np
import pandas as pd
a = [2, 3, 0, 5, 1, 3, 1, 2, 2, 1]
df = pd.DataFrame(a, columns=["a"])
df["cumsum"] = df["a"].cumsum()
df["new"] = df["cumsum"]%5
df["new"][((df["cumsum"]/5)==(df["cumsum"]/5).astype(int)) & (df["a"]!=0)] = 5
df
The output is as follows:
a cumsum new
0 2 2 2
1 3 5 5
2 0 5 0
3 5 10 5
4 1 11 1
5 3 14 4
6 1 15 5
7 2 17 2
8 2 19 4
9 1 20 5
Working:
Basically, take remainder for the cumulative sum for 5. In cases where the actual sum is 5 also becomes zero. So, for these cases, check if the value/5 == int(value/5). Then, remove cases where the actual value is zero.
EDIT:
As Trenton McKinney pointed out in the comments, OP likely wanted to reset it to 0 whenever the cumsum exceeded 5. This makes the definition to be a recurrence which is usually difficult to do with pandas/numpy (see David's solution). I'd recommend using numba to speed up the for loop in this case
Another alternative: using groupby
In [78]: df.groupby((df['a'].cumsum()% 5 == 0).shift().fillna(False).cumsum()).cumsum()
Out[78]:
a
0 2
1 5
2 0
3 5
4 1
5 4
6 5
7 2
8 4
9 5
You could try using this for loop:
lastvalue = 0
newcum = []
for i in df['a']:
if lastvalue >= 5:
lastvalue = i
else:
lastvalue += i
newcum.append(lastvalue)
df['a_cum_sum'] = newcum
print(df)
Output:
a a_cum_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
The above for loop iterates through the a column, and when the cumulative sum is 5 or more, it resets it to 0 then adds the a column's value i, but if the cumulative sum is lower than 5, it just adds the a column's value i (the iterator).
I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1
This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Closed 4 years ago.
I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions.
This can usually be done by concatenating the values in each pd.groupby column. However, it is often the case the number of columns that define a group, their dtype, or the value sizes make concatenation an impractical solution that needlessly uses up memory.
I was wondering if there was an easy way to assign a unique numeric ID to groups in pandas.
You just need ngroup data from seeiespi (or pd.factorize)
df.groupby('C').ngroup()
Out[322]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
More Option
pd.factorize(df.C)[0]
Out[323]: array([0, 0, 1, 2, 2, 2, 2, 1, 1], dtype=int64)
df.C.astype('category').cat.codes
Out[324]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int8
I managed a simple solution that I constantly reference and wanted to share:
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2],'B':[4,3,8,2,6,3,9,1,0], 'C':['a','a','c','b','b','b','b','c','c']})
df = df.sort_values('C')
df['gid'] = (df.groupby(['C']).cumcount()==0).astype(int)
df['gid'] = df['gid'].cumsum()
In [17]: df
Out[17]:
A B C gid
0 1 4 a 1
1 2 3 a 1
2 3 8 b 2
3 4 2 b 2
4 6 6 b 2
5 3 3 b 2
6 7 9 c 3
7 3 1 c 3
8 2 0 c 3
I am trying to clip outliers in the DataFrame based on quantiles for each column. Let's say
df = pd.DataFrame(pd.np.random.randn(10,2))
0 1
0 0.734355 0.594992
1 -0.745949 0.597601
2 0.295606 0.972196
3 0.474539 1.462364
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -1.286660
7 0.713914 0.135179
8 -0.435309 -0.344975
9 1.200617 -0.392945
I currently use
df_clipped = df.apply(lambda col: col.clip(*col.quantile([0.05,0.95]).values))
0 1
0 0.734355 0.594992
1 -0.706865 0.597601
2 0.295606 0.972196
3 0.474539 1.241788
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -0.884488
7 0.713914 0.135179
8 -0.435309 -0.344975
9 0.990799 -0.392945
This works but I am wondering if there is a more elegant pandas/numpy based approach.
You can use clip and align on the first axis:
df.clip(df.quantile(0.05), df.quantile(0.95), axis=1)
Out:
0 1
0 0.734355 0.594992
1 -0.706864 0.597601
2 0.295606 0.972196
3 0.474539 1.241788
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -0.884488
7 0.713914 0.135179
8 -0.435309 -0.344975
9 0.990799 -0.392945