Pandas - Identify non-unique rows, grouping any pairs except in particular case - python

This is an extension to this question.
I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination. Except I need it to ignore that paired grouping if all the rows have Structure = "s".
Here is an example DataFrame that looks like
Index
Cents
Structure
SD_YF
TDateID
GroupID
10
182.5
s
2.1
0
0
11
182.5
s
2.1
0
0
12
153.5
s
1.05
0
1
13
153.5
s
1.05
0
1
14
43
p
11
1
2
15
43
p
11
1
2
4
152
s
21
1
2
5
152
s
21
1
2
21
53
s
13
2
3
22
53
s
13
2
3
24
252
s
25
2
3
25
252
s
25
2
3
In pandas form:
df = pd.DataFrame({'Index': [10, 11, 12, 13, 14, 15, 4, 5, 21, 22, 24, 25],
'Cents': [182.5,
182.5,
153.5,
153.5,
43.0,
43.0,
152.0,
152.0,
53.0,
53.0,
252.0,
252.0],
'Structure': ['s', 's', 's', 's', 'p', 'p', 's', 's', 's', 's', 's', 's'],
'SD_YF': [2.1,
2.1,
1.05,
1.05,
11.0,
11.0,
21.0,
21.0,
13.0,
13.0,
25.0,
25.0],
'TDateID': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
'GroupID': [0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]})
My ideal output would be:
Index
Cents
Structure
SD_YF
TDateID
GroupID
UniID
10
182.5
s
2.1
0
0
1
11
182.5
s
2.1
0
0
2
12
153.5
s
1.05
0
1
3
13
153.5
s
1.05
0
1
4
14
43
p
11
1
2
5
15
43
p
11
1
2
6
4
152
s
21
1
2
5
5
152
s
21
1
2
6
21
53
s
13
2
3
7
22
53
s
13
2
3
8
24
252
s
25
2
3
9
25
252
s
25
2
3
10
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
Using the following code worked great, except it would need to be adapted for the "Structure != "s" for all rows in the grouping" part.
df['UniID'] = (df['GroupID']
+df.groupby('GroupID').ngroup().add(1)
+df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)

Do the IDs need to be consecutive?
If the occurrence of "duplicate" rows is small, looping over just those groups might not be too bad.
First set an ID to all the pairs using the code you have (and add an indicator column that a row belongs in a group). Then select out all the rows in groups (using the indicator column) and iterate over the groups. If the group has all S, then reassign the IDs to be unique for each row.

The tricky thing is to imagine how this should generalize. Here is my understanding: create a sequential count ignoring the p, then back fill those.
m = df['Structure'].eq('s')
df['UniID'] = m.cumsum()+(~m).cumsum().mask(m,0)
Output:
Index Cents Structure SD_YF TDateID GroupID UniID
0 10 182.5 s 2.10 0 0 1
1 11 182.5 s 2.10 0 0 2
2 12 153.5 s 1.05 0 1 3
3 13 153.5 s 1.05 0 1 4
4 14 43.0 p 11.00 1 2 5
5 15 43.0 p 11.00 1 2 6
6 4 152.0 s 21.00 1 2 5
7 5 152.0 s 21.00 1 2 6
8 21 53.0 s 13.00 2 3 7
9 22 53.0 s 13.00 2 3 8
10 24 252.0 s 25.00 2 3 9
11 25 252.0 s 25.00 2 3 10

Related

Averaging every 10 rows of one column within a dataframe, pulling every tenth item from the others?

Let's say I have the following sample dataframe:
df = pd.DataFrame({'depth': list(range(0, 21)),
'time': list(range(0, 21)),
'metric': random.choices(range(10), k=21)})
df
Out[65]:
depth time metric
0 0 0 2
1 1 1 3
2 2 2 8
3 3 3 0
4 4 4 8
5 5 5 9
6 6 6 5
7 7 7 1
8 8 8 6
9 9 9 6
10 10 10 7
11 11 11 2
12 12 12 7
13 13 13 0
14 14 14 6
15 15 15 0
16 16 16 5
17 17 17 6
18 18 18 9
19 19 19 6
20 20 20 8
I want to average every ten rows of the "metric" column (preserving the first row as is) and pulling the tenth item from the depth and time columns. For example:
depth time metric
0 0 0 2
10 10 10 5.3
20 20 20 4.9
I know that groupby is usually used in these situations, but I do not know how to tweak it to get my desired outcome:
df[['metric']].groupby(df.index //10).mean()
Out[66]:
metric
0 4.8
1 4.8
2 8.0
#BENY's answer is on the right track but not quite right. Should be:
df.groupby((df.index+9)//10).agg({'depth':'last','time':'last','metric':'mean'})
You can do rolling with reindex+ffill
df.rolling(10).mean().reindex(df.index[::10]).fillna(df)
depth time metric
0 0.0 0.0 2.0
10 5.5 5.5 5.3
20 15.5 15.5 4.9
Or to match output for depth and time:
out = (df.assign(metric=df['metric'].rolling(10).mean()
.reindex(df.index[::10]).fillna(df['metric']))
.dropna(subset=['metric']))
print(out)
depth time metric
0 0 0 2.0
10 10 10 5.3
20 20 20 4.9
Let us do agg
g = df.index.isin(df.index[::10]).cumsum()[::-1]
df.groupby(g).agg({'depth':'last','time':'last','metric':'mean'})
Out[263]:
depth time metric
1 20 20 4.9
2 10 10 5.3
3 0 0 2.0

Pandas Divide Dataframe by Another Based on Column Values

I want to divide a pandas dataframe by another based on the column values.
For example let's say I have:
>>> df = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [10, 20, 30, 40, 50, 60, 70, 80, 90]})
>>> df
NAME NUM VALUE
0 CA 1 10
1 CA 2 20
2 CA 3 30
3 AZ 1 40
4 AZ 2 50
5 AZ 3 60
6 TX 1 70
7 TX 2 80
8 TX 3 90
>>> states = pd.DataFrame({'NAME': ['CA', "AZ", "TX"], 'DIVISOR': [10, 5, 1]})
>>> states
NAME DIVISOR
0 CA 10
1 AZ 5
2 TX 1
For each STATE and NUM I want to divide the VALUE column in df by the divisor COLUMN of the respective state.
Which would give a result of
>>> result = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [1, 2, 3, 8, 10, 12, 70, 80, 90]})
>>> result
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
5 AZ 3 12
6 TX 1 70
7 TX 2 80
8 TX 3 90
Let us do map
df['NEW VALUE'] = df['VALUE'].div(df['NAME'].map(states.set_index('NAME')['DIVISOR']))
df
Out[129]:
NAME NUM VALUE NEW VALUE
0 CA 1 10 1.0
1 CA 2 20 2.0
2 CA 3 30 3.0
3 AZ 1 40 8.0
4 AZ 2 50 10.0
5 AZ 3 60 12.0
6 TX 1 70 70.0
7 TX 2 80 80.0
8 TX 3 90 90.0
You can use merge as well
result = df.merge(states,on=['NAME'])
result['NEW VALUE'] = result.VALUE/result.DIVISOR
print(result)
NAME NUM VALUE NEW VALUE DIVISOR
0 CA 1 10 1.0 10
1 CA 2 20 2.0 10
2 CA 3 30 3.0 10
3 AZ 1 40 8.0 5
4 AZ 2 50 10.0 5
5 AZ 3 60 12.0 5
6 TX 1 70 70.0 1
7 TX 2 80 80.0 1
8 TX 3 90 90.0 1
I feel like there must be a more eloquent way to accomplish what you are looking for, but this is the rout that I usually take.
myresult = df.copy()
for i in range(len(df['NAME'])):
for j in range(len(states['NAME'])):
if df['NAME'][i] == states['NAME'][j]:
myresult['VALUE'][i] = df['VALUE'][i]/states['DIVISOR'][j]
myresult.head()
Out[10]>>
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
This is a very brute force method. You start by looping through each value in the data frame df, then you loop through each element in the data frame states. Then for each comparison, you look to see if the NAME columns match. If they do, you do the VALUE / DIVISOR.
You will get a warring for using the .copy() method

Reduce the number of rows in a dataframe based on a condition

I have a dataframe which consists of 9821 rows and one column. The values in it are listed in groups of 161 produced 61 times (161X61=9821). I need to reduce the number of rows to 9660 (161X60=9660) by replacing the first 2 values of each group of 161 into an average of those 2 values. In more simple words, in my existing dataframe the following groups of indexes (0, 1), (61, 62) ... (9760, 9761) need to be averaged in order to get a new dataframe with 9660 rows. Any ideas?
this is what I have (groups of 4 produced 3 times - 4X3=12):
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
this is what I want (groups of 3 produced 3 times - 3X3=9):
0 10.5
1 12
2 13
3 14.5
4 16
5 17
6 18.5
7 20
8 21
I'm not super happy with this answer but I'm putting it out there for review.
>>> df[df.index%4 == 0] = df.groupby(df.index//4).apply(lambda s: s.iloc[:2].mean()).values
>>> df = df[:-3]
>>> df
0
0 10.5
1 11.0
2 12.0
3 13.0
4 14.5
5 15.0
6 16.0
7 17.0
8 18.5
rotation - my existing dataframe (161X61)
rot- my new dataframe (161X60)
arr = np.zeros((9821, 1))
rot = pd.DataFrame(arr, index=range(0, 9821))
for i in range(0, 9821):
if i==0:
rot.iloc[i, 0] = (rotation.iloc[i, 0]+rotation.iloc[i+1, 0])/2
elif ((i%61)==0):
rot.iloc[i-1, 0] = (rotation.iloc[i, 0]+rotation.iloc[i+1, 0])/2
rot.iloc[i, 0] = 'del'
else:
if ((i==9820)):
rot.iloc[i, 0] = 'del'
break
rot.iloc[i, 0]=rotation.iloc[i+1, 0]
rot.columns = ['alpha']
rot = rot[~rot['alpha'].isin(['del'])]
rot.reset_index(drop=True, inplace=True)
rot

Using pandas cut function with groupby and group-specific bins

I have the following sample DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2],
'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
to which I add the percentage of the Value using
df['Percent_value'] = df['Value'].rank(method='dense', pct=True)
and add the Order using pd.cut() with pre-defined percentage bins
percentage = np.array([10, 20, 50, 70, 100])/100
df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
which gives
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 2
5 B 9 3 0.230769 3
6 B 4 4 0.307692 3
7 C 13 5 0.384615 3
8 C 6 6 0.461538 3
9 C 18 7 0.538462 4
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 5
My Question
Now, instead of having a single percentage array (bins) for all Tags (groups), I have a separate percentage array for each Tag group. i.e., A, B and C. How can I apply df.groupby('Tag') and then apply pd.cut() using different percentage bins for each group from the following dictionary? Is there some direct-way avoiding for loops as I do below?
percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
'B': np.array([20, 40, 60, 90, 100])/100,
'C': np.array([30, 50, 60, 80, 100])/100}
Desired outcome (Note: Order is now computed for each Tag using different bins):
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
My Attempt
orders = []
for k, g in df.groupby(['Tag']):
percentage = percentages[k]
g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
orders.append(g)
df_final = pd.concat(orders, axis=0, join='outer')
You can apply pd.cut within groupby,
df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4

Joining tables or mapping values from a table to another table [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes where I want to map multiple columns from the smaller df to the bigger df. The bigger df is 5000 rows and I want to join based on conditions from the smaller tables. For example the bigger dataframe is:
status type slot br
2 1 2 5
2 1 1 5
2 1 2 5
2 1 2 5
2 1 56 26
2 1 76 5
The second dataframe is as follows:
slot name from to br
1 4PM 16 19 5
2 7PM 19 22 5
3 10PM 10 12 5
76 1PM 13 16 5
56 Lun 12 14 26
So basically I want to map the columns in the second dataframe to the first one based one the two columns "slot" and "br" so that the end result will have a join between the two as follows:
status type slot br name from to
2 1 2 5 7PM 19 22
2 1 1 5 4PM 16 19
2 1 2 5 7PM 19 22
2 1 2 5 7PM 19 22
2 1 56 26 Lun 12 14
2 1 76 5 1PM 13 16
I tried using if statements but gave me an error. Though I think need a more efficient solution using joins or an if should be fine too
This should work:
df = pd.DataFrame({'status': [2, 2, 2, 2, 2, 2],
'type': [1, 1, 1, 1, 1, 1],
'slot': [2, 1, 2, 2, 56, 76],
'br': [5.0, 5.0, 5.0, 5.0, 26.0, np.nan]})
df2 = pd.DataFrame({'slot': [1, 2, 3, 76, 56],
'name': ['4PM', '7PM', '10PM', '1PM', 'Lun'],
'from': [16, 19, 10, 13, 12],
'to': [19, 22, 12, 16, 14],
'br': [5, 5, 5, 5, 26]})
print(df.merge(df2,on=['slot','br'],how='left'))
status type slot br name from to
0 2 1 2 5.0 7PM 19.0 22.0
1 2 1 1 5.0 4PM 16.0 19.0
2 2 1 2 5.0 7PM 19.0 22.0
3 2 1 2 5.0 7PM 19.0 22.0
4 2 1 56 26.0 Lun 12.0 14.0
5 2 1 76 NaN NaN NaN NaN
This is a simple merge, the how parameter specifies how you want the merge to work, in this instance using the keys on your left frame.
new_df = pd.merge(df,df2,on=['slot','br'],how='left')
print(new_df)
status type slot br name from to
0 2 1 2 5 7PM 19 22
1 2 1 1 5 4PM 16 19
2 2 1 2 5 7PM 19 22
3 2 1 2 5 7PM 19 22
4 2 1 56 26 Lun 12 14
5 2 1 76 5 1PM 13 16
i would caution you to first understand how merges work before going any further, check out Pandas Merging 101

Categories