I am trying to summarize values across each group where the types match and apply that to the row where store=1.
The example below for Group A contains one store=1 and three store=2.
I would like to roll up all type 3's in Level=A to the store=1 row
Sample data:
data = {'group':['A','A','A','A','B','B','B','B'],'store':['1','2','2','2','1','2','2','2'],'type':['3','3','1','1','5','0','5','5'],'num':['10','20','30','40','50','60','70','80']}
t1=pd.DataFrame(data)
group store type num
A 1 3 10
A 2 3 20
A 2 1 30
A 2 1 40
B 1 5 50
B 2 0 60
B 2 5 70
B 2 5 80
and the correct output should be a new column ('new_num') containing a list at the store=1 row for each group where the types match.
group store type num new_num
A 1 3 10 ['10','20']
A 2 3 20 []
A 2 1 30 []
A 2 1 40 []
B 1 5 50 ['50','70','80']
B 2 0 60 []
B 2 5 70 []
B 2 5 80 []
IIUC
t1['new_num']=[[] for x in range(len(t1))]
t1.loc[t1.store=='1','new_num']=[y.loc[y.type.isin(y.loc[y.store=='1','type']),'num'].tolist() for x , y in t1.groupby('group',sort=False)]
t1
Out[369]:
group store type num new_num
0 A 1 3 10 [10, 20]
1 A 2 3 20 []
2 A 2 1 30 []
3 A 2 1 40 []
4 B 1 5 50 [50, 70, 80]
5 B 2 0 60 []
6 B 2 5 70 []
7 B 2 5 80 []
Setup
ncol = [[] for _ in range(t1.shape[0])]
res = t1.set_index('group').assign(new_num=ncol)
1) Using some wonky string concats and groupby's
u = t1.group + t1.type
check = u[t1.store.eq('1')]
m = t1.loc[u.isin(check)].groupby('group')['num'].agg(list)
res.loc[res.store.eq('1'), 'new_num'] = m
2) If you'd like to stray even further from the light, use an abomination of a pivot
f = t1.pivot_table(
index=['group', 'type'],
columns='store',
values='num',
aggfunc=list
).reset_index()
m = f[f['1'].notnull()].set_index('group').drop('type', 1).sum(1)
res.loc[res.store.eq('1'), 'new_num'] = m
Both somehow manage to produce:
store type num new_num
group
A 1 3 10 [10, 20]
A 2 3 20 []
A 2 1 30 []
A 2 1 40 []
B 1 5 50 [50, 70, 80]
B 2 0 60 []
B 2 5 70 []
B 2 5 80 []
While a terrible use of pivot, I actually think that solution is pretty neat:
store group type 1 2
0 A 1 NaN [30, 40]
1 A 3 [10] [20]
2 B 0 NaN [60]
3 B 5 [50] [70, 80]
It produces the above aggregation, which you can find the non-null values which are all of the matching group-type combinations that you are after, and summing across those rows gives you the aggregated list you need.
Related
I have a DataFrame which has a column containing these values with % occurrence
I want to convert the value with highest occurrence as 1 and the rest as 0.
How can I do the same using Pandas?
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'availability': np.random.randint(0, 100, 10), 'some_col': np.random.randn(10)})
print(df)
"""
availability some_col
0 9 -0.332662
1 35 0.193257
2 1 2.042402
3 50 -0.298372
4 52 -0.669655
5 3 -1.031884
6 44 -0.763867
7 28 1.093086
8 67 0.723319
9 87 -1.439568
"""
df['availability'] = np.where(df['availability'] == df['availability'].max(), 1, 0)
print(df)
"""
availability some_col
0 0 -0.332662
1 0 0.193257
2 0 2.042402
3 0 -0.298372
4 0 -0.669655
5 0 -1.031884
6 0 -0.763867
7 0 1.093086
8 0 0.723319
9 1 -1.439568
"""
Edit
If you are trying to mask the rows with the values that occur most often instead, try this:
df = pd.DataFrame(
{
'availability': [10, 10, 20, 30, 40, 40, 50, 50, 50, 50],
'some_col': np.random.randn(10)
}
)
print(df)
"""
availability some_col
0 10 0.954199
1 10 0.779256
2 20 -0.438860
3 30 -2.547989
4 40 0.587108
5 40 0.398858
6 50 0.776177 # <--- Most Frequent is 50
7 50 -0.391724 # <--- Most Frequent is 50
8 50 -0.886805 # <--- Most Frequent is 50
9 50 1.989000 # <--- Most Frequent is 50
"""
df['availability'] = np.where(df['availability'].isin(df['availability'].mode()), 1, 0)
print(df)
"""
availability some_col
0 0 0.954199
1 0 0.779256
2 0 -0.438860
3 0 -2.547989
4 0 0.587108
5 0 0.398858
6 1 0.776177
7 1 -0.391724
8 1 -0.886805
9 1 1.989000
"""
Try:
df.availability.apply(lambda x: 1 if x == df.availability.value_counts().idxmax() else 0)
You can use Series.mode() to get the most often value and use isin to check if value in column in list
df['col'] = df['availability'].isin(df['availability'].mode()).astype(int)
You can compare to the mode with isin, then convert the boolean to integer (True -> 1, False -> 0):
df['col2'] = df['col'].isin(df['col'].mode()).astype(int)
example (here, 2 and 4 are tied as most frequent value), as new column "col2" for clarity:
col col2
0 0 0
1 2 1
2 2 1
3 2 1
4 4 1
5 4 1
6 4 1
7 1 0
I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.
Using the following example and code from stackoverflow 25119524
import pandas as pd
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe
cowmast
The output is xmast (number of times mastitis) for each cow
Cow Lact DIM xmast
0 1 1 45 1
1 1 2 25 2
2 1 2 28 3
3 2 2 70 1
4 2 2 95 2
5 2 2 98 3
6 2 2 120 4
7 2 3 80 5
What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.
To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])
That produces the following output
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.
The output I am looking for is called "Adjusted" in the table below.
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.
Thank you for your help
John
Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast to df, you can set up xmast as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow, Lact]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast, Lxmast and Adjusted
I'm trying to make a column in a dataframe depicting a group or bin that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.
I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example
import numpy as np
import pandas as pd
x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
This gives a df of 14 observations. How could I get groups of 5 equal bins?
The desired result would be the following:
x y group
0 1 0.926273 1
1 2 0.678101 1
2 3 0.636875 1
3 4 0.802590 2
4 5 0.494553 2
5 6 0.874876 2
6 7 0.607902 3
7 8 0.028737 3
8 5 0.493545 3
9 45 0.498140 4
10 64545 0.938377 4
11 65 0.613015 4
12 6456 0.288266 5
13 564 0.917817 5
Group by N rows, and find ngroup
df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1
x y group
0 1 0.548801 1
1 2 0.096620 1
2 3 0.713771 1
3 4 0.922987 2
4 5 0.283689 2
5 6 0.807755 2
6 7 0.592864 3
7 8 0.670315 3
8 5 0.034549 3
9 45 0.355274 4
10 64545 0.239373 4
11 65 0.156208 4
12 6456 0.419990 5
13 564 0.248278 5
Another option by generating list of indexes from near_split:
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)
Output:
x y group
0 1 0.313614 1
1 2 0.765079 1
2 3 0.153851 1
3 4 0.792098 2
4 5 0.123700 2
5 6 0.239107 2
6 7 0.133665 3
7 8 0.979318 3
8 5 0.781948 3
9 45 0.264344 4
10 64545 0.495561 4
11 65 0.504734 4
12 6456 0.766627 5
13 564 0.428423 5
You can split evenly with np.array_split(), assign the groups, then recombine with pd.concat():
bins = 5
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
df = pd.concat(splits)
Or as a one-liner using assign():
df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])
x y group
0 1 0.145781 1
1 2 0.262097 1
2 3 0.114799 1
3 4 0.275054 2
4 5 0.841606 2
5 6 0.187210 2
6 7 0.582487 3
7 8 0.019881 3
8 5 0.847115 3
9 45 0.755606 4
10 64545 0.196705 4
11 65 0.688639 4
12 6456 0.275884 5
13 564 0.579946 5
Here is an approach that "manually" computes the extent of the bins, based on the requested number bins:
bins = 5
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
df['group'] = group
Result:
x y group
0 1 0.205168 1
1 2 0.105466 1
2 3 0.545794 1
3 4 0.639346 2
4 5 0.758056 2
5 6 0.982090 2
6 7 0.942849 3
7 8 0.284520 3
8 5 0.491151 3
9 45 0.731265 4
10 64545 0.072668 4
11 65 0.601416 4
12 6456 0.239454 5
13 564 0.345006 5
This seems to follow the splitting logic of np.array_split (i.e. try to evenly split the bins, but add onto earlier bins if that isn't possible).
While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.
Just because I was curious, going to leave this perfplot testing here...
import numpy as np
import pandas as pd
import perfplot
def make_data(n):
x = np.random.rand(n)
y = np.random.rand(n)
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
return df
def repeat(df, bins=5):
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
return group
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
def array_split(df, bins=5):
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
return pd.concat(splits)
perfplot.show(
setup = lambda n : make_data(n),
kernels = [
lambda df: repeat(df),
lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
lambda df: array_split(df)
],
labels=['repeat', 'near_split', 'groupby', 'array_split'],
n_range=[2 ** k for k in range(25)],
equality_check=None)
Under the if-then section of the pandas documentation cookbook, we can assign values in one column, based on a condition being met for a separate column using loc[].
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]})
# AAA BBB CCC
# 0 4 10 100
# 1 5 20 50
# 2 6 30 -30
# 3 7 40 -50
df.loc[df.AAA >= 5,'BBB'] = -1
# AAA BBB CCC
# 0 4 10 100
# 1 5 -1 50
# 2 6 -1 -30
# 3 7 -1 -50
But what if I want to write a condition that involves the previous or subsequent row using .loc[]? For example, say I want to assign df.BBB=5 wherever the difference between the df.CCC of the current row and the df.CCC of the next row is greater than or equal to 50. Then I would like to create a condition that gives me the following data frame:
# AAA BBB CCC
# 0 4 5 100 <-| 100 - 50 = 50, assign df.BBB = 5
# 1 5 5 50 <-| 50 -(-30)= 80, assign df.BBB = 5
# 2 6 -1 -30 <-| 30 -(-50)= 20, don't assign df.BBB = 5
# 3 7 -1 -50 <-| (-50) -0 =-50, don't assign df.BBB = 5
How can I get this result?
Edit
The answer I'm hoping to find is something like
mask = df['CCC'].current - df['CCC'].next >= 50
df.loc[mask, 'BBB'] = 5
because I'm interested in the general problem of how I can access values above or below the current row being considered in a dataframe.(not necessarily solving this one toy example.)
diff() will work on the example I first described, but what of other cases, say, where we want to compare two elements instead of subtracting them?
What if I take the previous data frame and I want to find all rows where the current column entry doesn't match the next in df.BBB and then assign df.CCC based on those comparisons?
if df.BBB.current == df.CCC.next:
df.CCC = 1
# AAA BBB CCC
# 0 4 5 1 <-| 5 == 5, assign df.CCC = 1
# 1 5 5 50 <-| 5 != -1, do nothing
# 2 6 -1 1 <-| -1 == -1, assign df.CCC = 1
# 3 7 -1 -50 <-| -1 != 0, do nothing
Is there a way to do this with pandas using .loc[]?
Given
>>> df
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
you can compute a boolean mask first via
>>> mask = df['CCC'].diff(-1) >= 50
>>> mask
0 True
1 True
2 False
3 False
Name: CCC, dtype: bool
and then issue
>>> df.loc[mask, 'BBB'] = 5
>>>
>>> df
AAA BBB CCC
0 4 5 100
1 5 5 50
2 6 30 -30
3 7 40 -50
More generally, you can compute a shift
>>> df['CCC_next'] = df['CCC'].shift(-1) # or df['CCC'].shift(-1).fillna(0)
>>> df
AAA BBB CCC CCC_next
0 4 5 100 50.0
1 5 5 50 -30.0
2 6 30 -30 -50.0
3 7 40 -50 NaN
... and then do whatever you want, such as:
>>> df['CCC'].sub(df['CCC_next'], fill_value=0)
0 50.0
1 80.0
2 20.0
3 -50.0
dtype: float64
>>> mask = df['CCC'].sub(df['CCC_next'], fill_value=0) >= 50
>>> mask
0 True
1 True
2 False
3 False
dtype: bool
although for the specific problem in your question the diff approach is sufficient.
You can use enumerate function to access row and its index simultaneously. Thus you can obtain previous and next row based on the index of the current row. I provide an example script below for your reference:
import pandas as pd
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]}, index=['a','b','c','d'])
print('row_pre','row_pre_AAA','row','row_AA','row_next','row_next_AA')
for irow, row in enumerate(df.index):
if irow==0:
row_next = df.index[irow+1]
print('row_pre', "df.loc[row_pre,'AAA']", row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
elif irow>0 and irow<df.index.size-1:
row_pre = df.index[irow-1]
row_next = df.index[irow+1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
else:
row_pre = df.index[irow-1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], 'row_next', "df.loc[row_next,'AAA']")
Output as below:
row_pre row_pre_AAA row row_AA row_next row_next_AA
row_pre df.loc[row_pre,'AAA'] a 4 b 5
a 4 b 5 c 6
b 5 c 6 d 7
c 6 d 7 row_next df.loc[row_next,'AAA']
I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df