Pandas: How to print the groupby values - python

I have a following data set from Table_Record:
Seg_ID Lock_ID Code
111 100 1
222 121 2
333 341 2
444 100 1
555 100 1
666 341 2
777 554 4
888 332 5
I am using the sql query to find the Seg_IDs where Lock_ID is repeated:
Select Code,Lock_ID,Seg_ID from Table_Record group by Code, Lock_ID;
Seg_ID Lock_ID Code
111 100 1
444 100 1
555 100 1
222 121 2
333 341 2
666 341 2
777 554 4
888 332 5
How can I achieve the same using Pandas?
Excepted O/P from Pandas is:
eg.
Seg_ID (111,444,555) has Lock_id (1).
Seg_ID (222,333,666) has Lock_ID (2).

First get all codes by filtering only duplicated values and then filter original DaatFrame by boolean indexing with isin:
codes = df.loc[df.duplicated(['Lock_ID']), 'Code'].unique()
df1 = df[df['Code'].isin(codes)]
print (df1)
Seg_ID Lock_ID Code
0 111 100 1
1 222 121 2
2 333 341 2
3 444 100 1
4 555 100 1
5 666 341 2
Then groupby with f-strings:
for k, v in df1.groupby(['Code'])['Seg_ID']:
print (f'Seg_ID {tuple(v)} has Code ({k})')
Seg_ID (111, 444, 555) has Code (1)
Seg_ID (222, 333, 666) has Code (2)
If want output like DataFrame use apply with tuple:
df2 = df1.groupby(['Code'])['Seg_ID'].apply(tuple).reset_index()
print (df2)
Code Seg_ID
0 1 (111, 444, 555)
1 2 (222, 333, 666)

Simply use groupby. As I could understand from your code, you'd want:
grouped= df.groupby(['Code']['LockId'])

Related

Splitting string multiple times in DataFrame

I have a column in a DataFrame that contains a string from which I must retrieve two pieces of information by different separators:
ID STR
280 11040402-38.58551%;11050101-9.29086%;11070101-52.12363%
351 11130203-35%;11130230-65%
510 11070103-69%
655 11090103-41.63463%;11160102-58.36537%
666 11130205-50.00%;11130207-50%
I have been trying to use the .apply method on this series together with a lambda function to make the splitting in one go, to no avail:
df['STR'].apply(lambda x: y.split('-') for y in x.split(';'))
Ideally, not only I would be able to split the string in one go, but also separate the left side of the - from the right side:
ID STR.LEFT STR.RIGHT
280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
351 [11130203, 11130230] [35%, 65%]
510 [11070103] [69%]
655 [11090103, 11160102] [41.63463%, 58.36537%]
666 [11130205, 11130207] [50.00%, 50%]
I believe this could be achievable with .apply and slicing, but any other solution is welcome.
You can try splitting several times:
# set ID as index
df.set_index('ID', inplace=True)
new_series = df.STR.str.split(';', expand=True).stack().reset_index(level=-1,drop=True)
new_df = new_series.str.split('-', expand=True)
new_df.groupby('ID').agg(list).reset_index()
Output:
ID 0 1
-- ---- ------------------------------------ --------------------------------------
0 280 ['11040402', '11050101', '11070101'] ['38.58551%', '9.29086%', '52.12363%']
1 351 ['11130203', '11130230'] ['35%', '65%']
2 510 ['11070103'] ['69%']
3 655 ['11090103', '11160102'] ['41.63463%', '58.36537%']
4 666 ['11130205', '11130207'] ['50.00%', '50%']
str.split
Assuming the pattern always leaves 'l-r;l-r;l-r...'
s = df.STR.str.split('-|;')
df[['ID']].join(pd.concat({'STR.LEFT': s.str[::2], 'STR.RIGTH': s.str[1::2]}, axis=1))
ID STR.LEFT STR.RIGTH
0 280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
1 351 [11130203, 11130230] [35%, 65%]
2 510 [11070103] [69%]
3 655 [11090103, 11160102] [41.63463%, 58.36537%]
4 666 [11130205, 11130207] [50.00%, 50%]
If you want to explode these lists into separate rows
s = df.STR.str.split('-|;')
i = np.arange(len(df)).repeat(s.str.len() // 2)
d = {'STR.LEFT': np.concatenate(s.str[::2]),
'STR.RIGHT': np.concatenate(s.str[1::2])}
df[['ID']].iloc[i].assign(**d).reset_index(drop=True)
ID STR.LEFT STR.RIGHT
0 280 11040402 38.58551%
1 280 11050101 9.29086%
2 280 11070101 52.12363%
3 351 11130203 35%
4 351 11130230 65%
5 510 11070103 69%
6 655 11090103 41.63463%
7 655 11160102 58.36537%
8 666 11130205 50.00%
9 666 11130207 50%
A single str.extractall call will suffice to extract the pairs into separate columns. You can then aggregate them into lists using groupby.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False))
STR.LEFT STR.RIGHT
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%]
1 [11130203, ;11130230] [35%, 65%]
2 [11070103] [69%]
3 [11090103, ;11160102] [41.63463%, 58.36537%]
4 [11130205, ;11130207] [50.00%, 50%]
To join with ID, you use just that: join.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False)
.join(df['ID'])
STR.LEFT STR.RIGHT ID
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%] 280
1 [11130203, ;11130230] [35%, 65%] 351
2 [11070103] [69%] 510
3 [11090103, ;11160102] [41.63463%, 58.36537%] 655
4 [11130205, ;11130207] [50.00%, 50%] 666

Pandas. Group by index and apply max for column

I am trying to group dataframe by 3 elements and want to get rows with the highest column value from the group, but max method applies to all columns. How can I achieve this?
What I do:
In [69]: fr
Out[69]:
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 295
4 1516195008033 233
5 1516196049407 252
In [70]: fr.groupby(fr.index / 3).max()
Out[70]:
ping delta
0 1516192904988 161
1 1516196049407 295
Result I want to get:
ping delta
0 1516190798773 161
1 1516193952748 295
If want first value in ping column and max value in delta:
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
If want max value in delta and all coresponding rows:
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
3 1516193952748 295
Better sample for difference:
print (fr)
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 233 <-swapped values 233
4 1516195008033 295 <-swapped values 295
5 1516196049407 252
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
4 1516195008033 295

Append a new column on a dataframe based on other dataframe with matching rows and fill the non-matching ones with value from the existing column

I have a data frame look like
df1
UserID group day sp PU
213 test 12/11/14 3 311
314 control 13/11/14 4 345
354 test 13/08/14 5 376
and second data frame df2, it has information about the values in df1 column UserID, the matching rows in df2 and df1 are test-red and others should be itself.
df2
UserID
213
And what I am aiming is to append a new column group2 to df1 derived from the group column in df1 using matching values from df2 as well as the values already there in df1 as following,. For instance here UserId 213 is matching in df1 and df2 so it should be added in the newly appended column 'group2' as test-Red and otherwise it should as it is from group column.
df1
UserID group day sp PU group2
213 test 12/11/14 3 311 test-Red
314 control 13/11/14 4 345 control
354 test 13/08/14 5 376 test-NonRed
This is what I tried,
def converters(df2,df1):
if df1['UserId']==df2['UserId']:
val="test-Red"
elif df1['group']== "test":
val="test-NonRed"
else:
val="control"
return val
But it throws error as following,
ValueError: Series lengths must match to compare
Use numpy.where :
df1['new'] = np.where(df1['UserID'].isin(df2['UserID']), 'test-Red',
np.where(df1['group'] == 'test','test-NonRed',df1['group']))
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
Or numpy.select:
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'] == 'test'
df1['new'] = np.select([m1,m2], ['test-Red', 'test-NonRed'],default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
More general solution:
print (df1)
UserID group day sp PU
0 213 test 12/11/14 3 311
1 314 control 13/11/14 4 345
2 354 test 13/08/14 5 376
3 2131 test1 12/11/14 3 311
4 314 control1 13/11/14 4 345
5 354 test1 13/08/14 5 376
df2 = pd.DataFrame({'UserID':[213, 2131]})
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'].isin(df1.loc[m1, 'group'])
df1['new'] = np.select([m1,m2],
[df1['group'] + '-Red', df1['group'] + '-NonRed'],
default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
3 2131 test1 12/11/14 3 311 test1-Red
4 314 control1 13/11/14 4 345 control1
5 354 test1 13/08/14 5 376 test1-NonRed
Can you use pd.merge and specify the how=outer parameter? This would include all the data from both tables being joined
ie:
df1.merge(df2, how='outer', on='UserId')

Reset secondary index in pandas dataframe to start at 1

Suppose I construct a multi-index dataframe like the one show here:
prim_ind=np.array(range(0,1000))
for i in range(0,1000):
prim_ind[i]=round(i/4)
d = {'prim_ind' :prim_ind,
'sec_ind' : np.array(range(1,1001)),
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df= pd.DataFrame(d).set_index(['prim_ind','sec_ind'])
The sec_ind runs sequentially from 1 upwards, but I want to reset this second index so that for each of the prim_ind levels the sec_ind always starts at 1. I have been trying to work out if I can use reset index to do this but am failing miserably.
I know i could iterate over the dataframe to get this outcome but that will be a horrible way to do it and there must be a more pythonic way - can anyone help?
Note: the dataframe i'm working with is actually imported from csv, the code above is just to illustrate this question.
You can use cumcount for count categories.
df.index = [df.index.get_level_values(0), df.groupby(level=0).cumcount() + 1]
Or better if want also index names is use MultiIndex.from_arrays:
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0),
df.groupby(level=0).cumcount() + 1],
names=df.index.names)
print (df)
a b
prim_ind sec_ind
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
So column sec_ind is not necessary, you can use also:
d = {'prim_ind' :prim_ind,
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df = pd.DataFrame(d)
print (df.head(8))
a b prim_ind
0 325 8318 0
1 326 8319 0
2 327 8320 0
3 328 8321 1
4 329 8322 1
5 330 8323 1
6 331 8324 2
7 332 8325 2
df = df.set_index(['prim_ind', df.groupby('prim_ind').cumcount() + 1]) \
.rename_axis(('first','second'))
print (df.head(8))
a b
first second
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
2 332 8325

Pandas: Get highest value from a column for each unique value in another column

How to get the highest value in one column for each unique value in another column and return the same dataframe structure back.
Here is a pandas dataframe example?
reg.nr counter value ID2 categ date
1 37367 421 231385 93 A 20.01.2004
2 37368 428 235156 93 B 21.01.2004
3 37369 408 234251 93 C 22.01.2004
4 37372 403 196292 93 D 23.01.2004
5 55523 400 247141 139 E 24.01.2004
6 55575 415 215818 139 F 25.01.2004
7 55576 402 204404 139 A 26.01.2004
8 69940 402 62244 175 B 27.01.2004
9 69941 402 38274 175 C 28.01.2004
10 69942 404 55171 175 D 29.01.2004
11 69943 416 55495 175 E 30.01.2004
12 69944 407 90231 175 F 31.01.2004
13 69945 411 75382 175 A 01.02.2004
14 69948 405 119129 175 B 02.02.2004
Where i want to return the highest value of column "counter" based on the unique value of column "ID2". After the new pandas dataframe should look like this:
reg.nr counter value ID2 categ date
1 37368 428 235156 93 B 21.01.2004
2 55575 415 215818 139 F 25.01.2004
3 69943 416 55495 175 E 30.01.2004
One way using drop_duplicates
In [332]: df.sort_values('counter', ascending=False).drop_duplicates(['ID2'])
Out[332]:
reg.nr counter value ID2 categ date
2 37368 428 235156 93 B 21.01.2004
11 69943 416 55495 175 E 30.01.2004
6 55575 415 215818 139 F 25.01.2004
For desired output, you could sort on two columns, and reset the index
In [336]: (df.sort_values(['ID2', 'counter'], ascending=[True, False])
.drop_duplicates(['ID2']).reset_index(drop=True)
)
Out[336]:
reg.nr counter value ID2 categ date
0 37368 428 235156 93 B 21.01.2004
1 55575 415 215818 139 F 25.01.2004
2 69943 416 55495 175 E 30.01.2004
df.loc[df.groupby('ID2')['counter'].idxmax(), :].reset_index()
index reg.nr counter value ID2 categ date
0 2 37368 428 235156 93 B 21.01.2004
1 6 55575 415 215818 139 F 25.01.2004
2 11 69943 416 55495 175 E 30.01.2004
First, you are grouping your dataframe by column ID2. Then you get counter column and calculate an index of (first) maximal element of this column in each group. Then you use these indexes to filter your initial dataframe. Finally you resets indexes (if you need it).

Categories