Find maximum in Dataframe based on variable values - python

I have a dataframe of the form:
A| B| C | D
a| x| r | 1
a| x| s | 2
a| y| r | 1
b| w| t | 4
b| z| v | 2
I'd like to be able to return something like (showing unique values and frequency)
A| freq of most common value in Column B |maximum of column D based on the most common value in Column B | most common value in Column B
a 2 2 x
b 1 4 w
at the moment i can calculate the everything but the 3 column of the result dataframe quiet fast via
df = (df.groupby('A', sort=False)['B']
.apply(lambda x: x.value_counts().head(1))
.reset_index()
but to calculate the 2 Column ("maximum of column D based on the most common value in Column B") i have writen a for-loop witch is slow for a lot of data.
Is there a fast way?
The question is linked to: Count values in dataframe based on entry

Use merge with get rows by maximum D per groups by DataFrameGroupBy.idxmax:
df1 = (df.groupby('A', sort=False)['B']
.apply(lambda x: x.value_counts().head(1))
.reset_index()
.rename(columns={'level_1':'E'}))
#print (df1)
df = df1.merge(df, left_on=['A','E'], right_on=['A','B'], suffixes=('','_'))
df = df.loc[df.groupby('A')['D'].idxmax(), ['A','B','D','E']]
print (df)
A B D E
1 a 2 2 x
2 b 1 4 w

Consider doing this in 3 steps:
find most common B (as in your code):
df2 = (df.groupby('A', sort=False)['B']).apply(lambda x: x.value_counts().head(1)).reset_index()
build DataFrame with max D for each combination of A and B
df3 = df.groupby(['A','B']).agg({'D':max}).reset_index()
merge 2 DataFrames to find max Ds matching the A-B pairs selected earlier
df2.merge(df3, left_on=['A','level_1'], right_on=['A','B'])
The column D in the resulting DataFrame will be what you need
A level_1 B_x B_y D
0 a x 2 x 2
1 b w 1 w 4

Related

Find out if values in dataframe are between values in other dataframe

I'm new to pandas and i'm trying to understand if there is a method to find out, if two values from one row in df1 are between two values from one row in df2.
Basically my df1 looks like this:
start | value | end
1 | TEST | 5
2 | TEST | 3
...
and my df2 looks like this:
start | value | end
2 | TEST2 | 10
3 | TEST2 | 4
...
Right now i've got it working with two loops:
for row in df1.iterrows():
for row2 in df2.iterrows():
if row2[1]["start"] >= row[1]["start"] and row2[1]["end"] <= row[1]["end"]:
print(row2)
but this doesn't feel like it's the pandas way to me.
What I'm expecting is that row number 2 from df2 is getting printed because 3 > 1 and 4 < 5, i.e.:
3 | TEST2 | 4
Is there a method to do this in the pandas kind of working?
You could use a cross merge to get all combinations of df1 and df2 rows, and filter using classical comparisons. Finally, get the indices and slice:
idx = (df1.merge(df2.reset_index(), suffixes=('1', '2'), how='cross')
.query('(start2 > start1) & (end2 < end1)')
['index'].unique()
)
df2.loc[idx]
NB. I am using unique here to ensure that a row is selected only once, even if there are several matches
output:
start value end
1 3 TEST2 4

Faster way to index pandas dataframe multiple times

For every row in df_a, I am looking to find rows in df_b where the id's are the same and the df_a row's location falls within the df_b row's start and end location.
df_a looks like:
|---------------------|------------------|------------------|
| Name | id | location |
|---------------------|------------------|------------------|
| a | 1 | 202013 |
|---------------------|------------------|------------------|
df_b looks like:
|---------------------|------------------|------------------|------------------|
| Name | id | location_start | location_end |
|---------------------|------------------|------------------|------------------|
| x | 1 | 202010 | 2020199 |
|---------------------|------------------|------------------|------------------|
Unfortunately, df_a and df_b are both nearly a million rows. This code is taking like 10 hours to run on my local. Currently I'm running the following:
for index,row in df_a.iterrows():
matched = df_b[(df_b['location_start']<row['location'])
& (df_b['location_end']>row['location'])
& (df_b['id']==row['id'])]
Is there any obvious way to speed this up?
You can do this:
Consider my sample dataframes below:
In [90]: df_a = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location':[202013, 102013]})
In [91]: df_b = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location_start':[202010, 1020199],'location_end':[2020199, 1020299] })
In [92]: df_a
Out[92]:
Name id location
0 a 1 202013
1 b 2 102013
In [93]: df_b
Out[93]:
Name id location_start location_end
0 a 1 202010 2020199
1 b 2 1020199 1020299
In [95]: d = pd.merge(df_a, df_b, on='id')
In [106]: indexes = d[d['location'].between(d['location_start'], d['location_end'])].index.tolist()
In [107]: df_b.iloc[indexes, :]
Out[107]:
Name id location_start location_end
0 a 1 202010 2020199

Pandas reverse order of some columns not all

There are a few solutions to reverse the order of all columns here on SO like for example so:
df = df.iloc[:,::-1]
What I want to achieve is only reverse the order of columns starting from column 2 on, so I want to keep A and B in the same place and only reverse the order of the columns after that. As I have a couple of hundred columns and don't want to write the order manually.
My initial columns are:
| A | B | C | D | E | F |
|---|---|---|---|---|---|
And The Result I am looking for is:
| A | B | F | E | D | C |
|---|---|---|---|---|---|
Thanks!
df[list(df.columns[:2]) + list(df.columns[:1:-1])]
Use numpy.r_ for concatenate of indices and tehn select by DataFrame.iloc:
df = pd.DataFrame(columns=list('ABCDEF'))
print (df.iloc[:, np.r_[0:2, len(df.columns)-1:1:-1]])
Empty DataFrame
Columns: [A, B, F, E, D, C]
Index: []
Or seelct both DataFrames with join:
print (df.iloc[:, :2].join(df.iloc[:, :1:-1]))
Empty DataFrame
Columns: [A, B, F, E, D, C]
Index: []

How to add values to a row if a column exist in a df

I have a list of items that I loop to find values example: dfexample1
#List of documents:
list1 = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']
#each document contains a DF with the same Test columns with the same Tets letters:
doc1.pdf: df1
Test | Value
a | 5.5
b | 6.5
c | 8.5
doc2.pdf: df2
Test | Value
a | 6.5
b | 11.5
c | 13.5
doc3.pdf: df3
Test | Value
a | 12.5
b | 3.5
c | 9.5
I have a df with the Test values being the columns
a | b | c
in each loop I am trying to extract the value of each test and ad them to the df.
in the first example the same analysis will be repeated over a list of pdf documents so I need to extract each of those values and add them into de df.
I tried this:
for items in list:
for index, row in dfexample1.iterrows():
if item in row[0]:
value1=row[1]
Verifying if value exists as column in the df:
if items in df.columns:
#How can I insert this value in the next row?
My expected result would be:
a | b | c
5.5|6.5 |8.5
6.5|11.5|13.5
12.5| 3.5|9.5
.T would not work I guess because the dfexample1 will be repeated with different values while looping across documents.

Matrix Multiplication of 2 Pandas DF's

I have a 2 pandas dataframes that are (156915, 22) and (22,2) in shape. The DF1, (156915, 22), has column names that match DF2 column 1 rows. I want to do matrix multiplication where DF1.columns = DF2['col1']. Heres a quick view of what the df's may look like. I would like to return a pandas dataframe of the same shape as DF1. Thank you in advance!
DF1:
A | B | C
1 | 15 | 8
5 | 3 | 2
DF2:
col1 | col2
A | 5
B | 1
C | 0
One method:
df3 = df2.set_index('col1')
df1[df3.index].apply(lambda x: x*df3['col2'].T,axis=1)`
If your columns in DF1 are in the same order as your col1 in DF2, then using np.dot should work:
np.dot(DF1, DF2['col2'])

Categories