Why is this Groupby transform not working? - python

For a dummy dataset, which each id corresponds to one match:
df2 = pd.DataFrame(columns=['id', 'score', 'duration', 'user'],
data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'],
[3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'],
[4, 8000, 500, 'bvc']])
If I want to keep only the records where either one of the same id have duration greater than 120 and score greater than 1500, this works fine:
cond = df2['duration'].gt(120) & df2['score'].gt(1500)
out = df2[cond.groupby(df2['id']).transform('all')]
and returns 2 instances of the same id. However, if I want to keep only the pairs of id's where the user is 'abc' it does not work. I have tried:
out = df2[(df2['user'].eq('abc')).groupby(df2['id']).transform('all')]
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('all')]
and they both return blank df's. How to solve this problem? The outcome should be any match that user 'abc' played in.

From the comments, you want 'any', not 'all':
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('any')]

Related

Identify index of all elements in a list comparing with another list

For instance I have a list A:
A = [100, 200, 300, 200, 400, 500, 600, 400, 700, 200, 500, 800]
And I have list B:
B = [100, 200, 200, 500, 600, 200, 500]
I need to identify the index of elements in B with comparison to A
I have tried:
list_index = [A.index(i) for i in B]
It returns:
[0, 1, 1, 5, 6, 1, 5]
But what I need is:
[0, 1, 3, 5, 6, 9, 10]
How can I solve it?
You can iterate through the enumeration of A to keep track of the indices and yield the values where they match:
A = [100,200,300,200,400,500,600,400,700,200,500,800]
B = [100,200,200,500,600,200,500]
def get_indices(A, B):
a_it = enumerate(A)
for n in B:
for i, an in a_it:
if n == an:
yield i
break
list(get_indices(A, B))
# [0, 1, 3, 5, 6, 9, 10]
This avoids using index() multiple times.
You can try something like this. Move over both lists and append the index when they are equal:
A = [100,200,300,200,400,500,600,400,700,200,500,800]
B = [100,200,200,500,600,200,500]
i, j = 0, 0
list_index = []
while j < len(B):
if B[j] == A[i]:
list_index.append(i)
j += 1
i += 1
print(list_index)
Output:
[0, 1, 3, 5, 6, 9, 10]
You can create a list called indices, and get the first index into it. Then iterate rest of the items in B, then take the slice of A from the last index in indices list and get the index of the item, add it to the last index + 1, then append it back to the indices list.
indices = [A.index(B[0])]
for i,v in enumerate(B[1:]):
indices.append(A[indices[-1]+1:].index(v)+indices[-1]+1)
#indices: [0, 1, 3, 5, 6, 9, 10]
Here's what I would use:
A=[100,200,300,200,400,500,600,400,700,200,500,800]
B=[100,200,200,500,600,200,500]
list_index = []
removedElements = 0
for i in B:
indexInA = A.index(i)
A.pop(indexInA)
list_index.append(indexInA+removedElements)
removedElements+=1
print(list_index)
A = np.array([100,200,300,200,400,500,600,400,700,200,500,800])
B = [100, 200, 200, 500, 600, 200, 500]
idx = np.arange(len(A))
indices = {i: idx[A == i].tolist() for i in set(B)}
[indices[i].pop(0) for i in B]
I loop through B and set the checked index to None in A. Thus the code will alter A.
A = [100, 200, 300, 200, 400, 500, 600, 400, 700, 200, 500, 800]
B = [100, 200, 200, 500, 600, 200, 500]
res = []
for i in B:
res.append(A.index(i))
A[A.index(i)] = None
print(res)
Output:
[0, 1, 3, 5, 6, 9, 10]

groupby column if value is less than some value

I have a dataframe like
df = pd.DataFrame({'time': [1, 5, 100, 250, 253, 260, 700], 'qty': [3, 6, 2, 5, 64, 2, 5]})
df['time_delta'] = df.time.diff()
and I would like to groupby time_delta such that all rows where the time_delta is less than 10 are grouped together, time_delta column could be dropped, and qty is summed.
The expected result is
pd.DataFrame({'time': [1, 100, 250, 700], 'qty': [9, 2, 71, 5]})
Basically I am hoping there is something like a df.groupby(time_delta_func(10)).agg({'time': 'min', 'qty': 'sum'}) func. I read up on pd.Grouper but it seems like the grouping based on time is very strict and interval based.
you can do it with gt meaning greater than and cumsum to create a new group each time the time-delta is greater than 10
res = (
df.groupby(df['time_delta'].gt(10).cumsum(), as_index=False)
.agg({'time':'first','qty':sum})
)
print(res)
time qty
0 1 9
1 100 2
2 250 71
3 700 5

Python Pandas exchange column value

While I am using pandas.DataFrame, when I want to inverse whole Column Value, I find that they are different when I use DF.loc[wInd, 'Column'] and DF.loc[:, 'Column'], where the 1st case exchanges the value, but 2nd case gives me same column value. Why are they different? Thank you.
wInd = LineCircPart_df.index
for cWd in ['X', 'Y', 'Angle']:
(LineCircPart_df.loc[wInd,f'Start{cWd}'], LineCircPart_df.loc[wInd,f'End{cWd}']) =
(LineCircPart_df.loc[wInd, f'End{cWd}'], LineCircPart_df.loc[wInd,f'Start{cWd}'])
and i need to modified with .copy() for the value assigned to, like:
wInd = LineCircPart_df.index
for cWd in ['X', 'Y', 'Angle']:
LineCircPart_df.loc[:,f'Start{cWd}'], LineCircPart_df.loc[:,f'End{cWd}'] =
(LineCircPart_df.loc[:, f'End{cWd}'].copy(), LineCircPart_df.loc[:,f'Start{cWd}'].copy())
Any Suggestions?
Example updated as follows:
LineCircPart_df = pd.DataFrame({'StartX': [3000, 4000, 5000], 'StartY': [30, 40, 50], 'StartAngle': [3, 4, 5], 'EndX': [6000, 7000, 8000], 'EndY': [60, 70, 80], 'EndAngle': [6, 7, 8],})
for cWd in ['X', 'Y', 'Angle']:
(LineCircPart_df.loc[:,f'Start{cWd}'], LineCircPart_df.loc[:,f'End{cWd}']) = (LineCircPart_df.loc[:, f'End{cWd}'],LineCircPart_df.loc[:,f'Start{cWd}'])

Rolling sum for a window of 2 days

I am trying to compute a rolling 2 day using trans_date sum against the amount column that is grouped by ID within the table below using python.
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>50</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td> </td></tr></tbody></table>
Ultimately, I am trying to achieve the result below using
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th>2d_Running_Total</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td>700</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>250</td><td>950</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td>350</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td>950</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td>15</td><td> </td></tr></tbody></table>
This hyperlink was very close to solving this, but the issue is for the records that have multiple transactions on the same day, it provides the same value for the same day.
https://python-forum.io/Thread-Rolling-sum-for-a-window-of-2-days-Pandas
This should do it:
import pandas as pd
# create dummy data
df = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount'],
data = [
[1, '03/23/2019', '06:51:03', 100],
[1, '03/24/2019', '12:32:48', 600],
[1, '03/24/2019', '14:15:35', 250],
[1, '06/05/2019', '16:18:21', 75],
[2, '02/01/2019', '18:02:52', 200],
[2, '02/02/2019', '10:03:02', 150],
[2, '02/03/2019', '23:47:51', 800],
[3, '01/18/2019', '11:12:58', 1000],
[3, '01/23/2019', '22:12:41', 15]
]
)
df_out = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount', '2d_Running_Total'],
data = [
[1, '03/23/2019', '06:51:03', 100, 100],
[1, '03/24/2019', '12:32:48', 600, 700],
[1, '03/24/2019', '14:15:35', 250, 950],
[1, '06/05/2019', '16:18:21', 75, 75],
[2, '02/01/2019', '18:02:52', 200, 200],
[2, '02/02/2019', '10:03:02', 150, 350],
[2, '02/03/2019', '23:47:51', 800, 950],
[3, '01/18/2019', '11:12:58', 1000, 1000]
]
)
# convert into datetime object and set as index
df['Trans_DateTime'] = pd.to_datetime(df['Trans_Date'] + ' ' + df['Trans_Time'])
df = df.set_index('Trans_DateTime')
# group by ID and apply rolling window to the amount column
df['2d_Running_Total'] = df.groupby('ID')['Amount'].rolling('2d').sum().values.astype(int)
df.reset_index(drop=True)

NumPy replace values with arrays based on conditions

I am trying to produce a sorter that, for each weekly total (for multiple different products), places them in the right buckets based on the max cumulative allowable less what has already been sorted.
maxes=np.array([100,200,300],[100,400,900])
weeklytotals=np.array([100,150,200,250],[200,400,600,800)]
The desired output would be:
result=np.array([[100,0,0],[100,50,0],[100,100,0],[100,150,0]],[[100,100,0],[100,300,0],[100,400,100],[100,400,300]]
I do not want to use loops but I am racking my mind on how to avoid that method. Thanks in advance, still a Python beginner. I want to use NumPy because the end implementation will need to be extremely fast.
One vectorized approach could be:
result = np.minimum(weeklytotals[:,:,None], maxes.cumsum(1)[:,None,:])
result[...,1:] -= result[...,:-1]
result
#array([[[100, 0, 0],
# [100, 50, 0],
# [100, 100, 0],
# [100, 150, 0]],
# [[100, 100, 0],
# [100, 300, 0],
# [100, 400, 100],
# [100, 400, 300]]])
Firstly calculate the cumulative capacity for the buckets:
maxes.cumsum(1)
#array([[ 100, 300, 600],
# [ 100, 500, 1400]])
calculate the cumulative amount in the buckets by taking the minimum between weekly total and capacity:
result = np.minimum(weeklytotals[:,:,None], maxes.cumsum(1)[:,None,:])
#array([[[100, 100, 100],
# [100, 150, 150],
# [100, 200, 200],
# [100, 250, 250]],
# [[100, 200, 200],
# [100, 400, 400],
# [100, 500, 600],
# [100, 500, 800]]])
Take the difference of amounts between buckets and assign them back (except for the first bucket):
result[...,1:] -= result[...,:-1]
result

Categories