This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 10 months ago.
New to the Python world and I'm working through a problem where I need to pull a value for the largest index value for each year. Will provide a table example and explain further
Year
Index
D_Value
2010
13
85
2010
14
92
2010
15
76
2011
9
68
2011
10
73
2012
100
94
2012
101
89
So, the desired output would look like this:
Year
Index
D_Value
2010
15
76
2011
10
73
2012
101
89
I've tried researching how to apply max() and .loc() functions, however, I'm not sure what the optimal approach is for this scenario. Any help would be greatly appreciated. I've also included the below code to generate the test table.
import pandas as pd
data = {'Year':[2010,2010,2010,2011,2011,2012,2012],'Index':[13,14,15,9,10,100,101],'D_Value':[85,92,76,68,73,94,89]}
df = pd.DataFrame(data)
print(df)
You can use groupby + rank
df['Rank'] = df.groupby(by='Year')['Index'].rank(ascending=False)
print(df[df['Rank'] ==1])
Related
I have a data frame like so. I am trying to make a plot with the mean of 'number' for each year on the y and the year on the x. I think what I have to do to do this is make a new data frame with 2 columns 'year' and 'avg number' for each year. How would I go about doing that?
year number
0 2010 40
1 2010 44
2 2011 33
3 2011 32
4 2012 34
5 2012 56
When opening a question about pandas please make sure you following these guidelines: How to make good reproducible pandas examples. It will help us reproduce your environment.
Assuming your dataframe is stored in the df variable:
df.groupby('year').mean().plot()
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
Sorry if the title is not clear i'm a newbie, hopefully an exemple will make it more understandable.
So let's take this DataFrame :
Area AorB Population
Hah A 23
Hah B 8
Hah C 78
Ryu A 150
Ryu B 61
Ryu C 17
I'd like to create a dataframe which would have a column with the 2 Areas, 3 columns named Apop, Bpop and Cpop and the corresponding population. It should look like that :
Area Apop Bpop Cpop
Hah 23 8 78
Ryu 150 61 17
It might sounds dumb but i've been searching for hours how to do this D: help.
Let us try
out = df.pivot(*df).add_suffix('pop').reset_index()
Out[8]:
AorB Area Apop Bpop Cpop
0 Hah 23 8 78
1 Ryu 150 61 17
Another way;
df.set_index(['Area','AorB']).unstack().reset_index().droplevel(0, axis=1).add_suffix('pop')
AorB pop Apop Bpop Cpop
0 Hah 23 8 78
1 Ryu 150 61 17
Imagine I have the following dataframe:
np.random.seed(42)
t = pd.DataFrame({'year': 4*['2018']+3*['2019']+4*['2016'],
'pop': np.random.randint(10, 100, size=(11)),
'production': np.random.randint(2000, 40000, size=(11))})
print(t)
year pop production
2018 61 3685
2018 24 2769
2018 81 4433
2018 70 7311
2019 30 39819
2019 92 19568
2019 96 21769
2016 84 30693
2016 84 8396
2016 97 29480
2016 33 27658
I want to find the sum of production divided by the sum of the pop by each year, my final dataframe would be something like:
tmp = t.groupby('year').sum()
tmp['production']/tmp['pop']
year
2016 322.909396
2018 77.110169
2019 372.275229
I was thinking if it could be done using groupby year and then using agg based on two columns, something like:
#doesn't work
t.groupby('year').agg(prod_per_pop = (['pop', 'production'],
lambda x: x['production'].sum()/x['pop'].sum()))
My question is basically if it is possible to use any pandas groupby method to achieve that in an easy way rather than having to create another dataframe and then having to divide.
You could use lambda functions with axis=1 to solve it in single line.
t.groupby('year')['pop','production'].agg('sum').apply(lambda x: x['production']/x['pop'], axis=1)
This question already has answers here:
Python: Checking to which bin a value belongs
(3 answers)
Closed 3 years ago.
I am trying and failing here. All I want to do is take a "Time_of_Event" value from this dataframe:
events_data = {'Time_of_Event':[8, 22, 24,34,61,62,73,79,86]}
my_events_df = pd.DataFrame(events_data)
And search it against the "Job_Start_Times" of this dataframe:
job_data = {'Job_Start_Time':[20,50,75], 'Job_Name':['Job_01','Job_02','Job_03']}
my_jobs_df = pd.DataFrame(job_data)
And find which range it falls in, and return/append the "Job_Name" to my first "my_events_df" dataframe.
For example, for the value of 8 in "Time_of_Event", I want to return "Job_01". For the value of 61, I want to return "Job_02", as 61 falls between 50 and 75.
I've tried some for loops, if-elses, and I haven't made much progress. Any help is appreciated!
We can try with pd.merge_asof
new_df = (pd.merge_asof(my_events_df.sort_values('Time_of_Event'),
my_jobs_df, left_on='Time_of_Event',
right_on = 'Job_Start_Time',
direction = 'backward')
.drop(columns = 'Job_Start_Time')
.bfill())
print(new_df)
Time_of_Event Job_Name
0 8 Job_01
1 22 Job_01
2 24 Job_01
3 34 Job_01
4 61 Job_02
5 62 Job_02
6 73 Job_02
7 79 Job_03
8 86 Job_03
This question already has answers here:
How to analyze all duplicate entries in this Pandas DataFrame?
(3 answers)
Closed 7 years ago.
I am new on Python. I would like to find the duplicated lines in a data frame.
To explain myself, I have the following data frame
type(data)
pandas.core.frame.DataFrame
data.head()
User Hour Min Day Month Year Latitude Longitude
0 0 1 48 17 10 2010 39.75000 -105.000000
1 0 6 2 16 10 2010 39.90625 -105.062500
2 0 3 48 16 10 2010 39.90625 -105.062500
3 0 18 25 14 10 2010 39.75000 -105.000000
I would like to find the duplicated lines in this data frame and to return the 'User' that corresponds to this line.
Thanks a lot,
Is this what you are looking for?
user = data[data.duplicated()]['User']