create a new dataframe based on given dataframe [duplicate] - python

This question already has answers here:
Group dataframe and get sum AND count?
(4 answers)
Closed 1 year ago.
I have a table that looks like this:
user id
observation
25
2
25
3
25
2
23
1
23
3
the desired outcome is:
user id
observation
retention
25
7
3
23
4
2
I want to keep the user id column with unique ids and have another column showing how many times this id has appeared in the dataset summing up the observation column values.
any help will be appreciated
thanks

Use groupby() method and chain agg() method to it:
outputdf=df.groupby('user id',as_index=False).agg(observation=('observation','sum'),retention=('observation','count'))
Now if you print outputdf you will get your desired output:
user id observation retention
0 23 4 2
1 25 7 3

You have to use group by:
import pandas as pd
d = {'user id': [25,25,25,33,33], 'observation': [2,3,2,1,3]}
# get the dataframe
df = pd.DataFrame(data=d)
df_new = df.groupby('user id').agg({"sum", "count"}).reset_index()
# rename the columns as you desire
df_new.columns = ['user id', 'observation', 'retention']
df_new
Output:

Related

drop rows based on a condition based on another [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 6 months ago.
I have the following data frame
user_id
value
1
5
1
7
1
11
1
15
1
35
2
8
2
9
2
14
I want to drop all rows that are not the maximum value of every user_id
resulting on a 2 row data frame:
user_id
value
1
35
2
14
How can I do that?
You can use pandas.DataFrame.max after the grouping.
Assuming that your original dataframe is named df, try the code below :
out = df.groupby('user_id', as_index=False).max('value')
>>> print(out)
Edit :
If you want to group more than one column, use this :
out = df.groupby(['user_id', 'sex'], as_index=False, sort=False)['value'].max()
>>> print(out)

How to calculate the average of a column where the row meets a certain condition in Pandas [duplicate]

This question already has answers here:
Pandas Groupby: Count and mean combined
(6 answers)
Closed last year.
Basically I have this Dataframe:
import pandas as pd
dict = {'number': [1,1,1,1,1,2,2,2,4,4,4,4,6,6], 'time':[34,33,41,36,43,22,24,32,29,28,33,32,55,51]}
df = pd.DataFrame(dict)
print(df)
Output:
And I want to transform the df or create another one where instead of being several rows with the same 'number', there is a unique 'number' per row; and in the 'time' column, its average (of the records that had the same 'number'). Also, there should be a 3rd column called 'count' that shows the amount of records each 'number' had.
The output expected is:
Thanks.
Simply use groupby + agg:
agg = df.groupby('number')['time'].agg(['count', 'mean']).reset_index()
Output:
>>> agg
number count mean
0 1 5 37.4
1 2 3 26.0
2 4 4 30.5
3 6 2 53.0

Doing .diff() on pandas column(s) gives wrong output? [duplicate]

This question already has answers here:
Subtract consecutive columns in a Pandas or Pyspark Dataframe
(2 answers)
Closed 2 years ago.
I am trying to take the difference of a column using .diff() in a dataframe with a date column and a value column.
import pandas as pd
d = {'Date':['11/11/2011', '11/12/2011', '11/13/2011'], 'a': [2, 3,4]}
df1 = pd.DataFrame(data=d)
df1.diff(axis = 1)
Pandas gives me this output:
Date a
0 11/11/2011 2
1 11/12/2011 3
2 11/13/2011 4
Which is the df1 and not the difference where I expect the output to be:
Date a
0 11/11/2011 NaN
1 11/12/2011 1
2 11/13/2011 1
df1.set_index('Date').diff(axis = 0) saves the day
axis=1 means you are subtracting columns not rows. Your target result is related to rows. Use axis=0 instead.
Second, it is not correct to do subtractions over strings. It will throw an error since python does not support that.

How to calculate the mean between multiple rows that matchs by a given id [duplicate]

This question already has answers here:
group by in group by and average
(3 answers)
Closed 4 years ago.
I want to calculate the mean of mulptiple rows that have one single value where they match and store it in another csv file.
The given data is:
ID salary days_of_work ...
1 2000 3 ...
1 1890 2 ...
1 2109 4 ...
2 .
2 .
2 .
2
3
3
...
And then obtain in another file, for every ID, one single row that contains the mean of the datas of the other columns like this:
ID salary days_of_work ...
1 1999.6667 3 ...
2 ...
3 ...
.
.
.
Update:
I tried to do this but for a file that has utc_time instead of ID
import pandas as pd
keep_col = ['utc_time','temperature','pressure','humidity','wind_direction','wind_speed/kph']
pd.read_csv('Gridpoints.csv', names=keep_col).to_csv("GridPoints/test.csv", index=False)
f=pd.read_csv("Gridpoints"+".csv")
df = f[keep_col]
df.groupby(['utc_time']).mean()
df.to_csv("GridPoints/test.csv", index=False)
So first what I do is getting a column deleted and then on the dataframe obtained, I want to do it for the utc_time column but it doesn't do anything
First you need to group by ID and then calculate the mean.
import pandas as pd
df = pd.read_csv('Book1.csv')
df1 = df.groupby(['ID'], as_index= False)[['Salary', 'days']].mean()
print(df1)
ID Salary days
1 1999.666667 3.0

Expand Dataframe containing JSON object into larger dataframe

I have a dataframe in pandas with two columns. One is an ID and the other is a long JSON object, which is the same object for each object in the dataframe. My goal here is to create columns for each key in the JSON object.
Here is an example of the input
ID request_json
175431467 {"Rate":"50","Groups":"7 months - 3 years"
I'd like to expand this into a dataframe with three columns: ID, Rate, and Groups.
What's the best way to do this?
You can use DataFrame constructor with join or concat:
import json
df = df[['ID']].join(pd.DataFrame(df['request_json'].apply(json.loads).values.tolist()))
print (df)
ID Groups Rate
0 175431467 7 months - 3 years 50
Or:
df = pd.concat([df['ID'],
pd.DataFrame(df['request_json'].apply(json.loads).values.tolist())], axis=1)
print (df)
ID Groups Rate
0 175431467 7 months - 3 years 50
In [38]: pd.io.json.json_normalize(df.to_dict('r'))
Out[38]:
ID request_json.Groups request_json.Rate
0 175431467 7 months - 3 years 50

Categories