Expand Dataframe containing JSON object into larger dataframe - python

I have a dataframe in pandas with two columns. One is an ID and the other is a long JSON object, which is the same object for each object in the dataframe. My goal here is to create columns for each key in the JSON object.
Here is an example of the input
ID request_json
175431467 {"Rate":"50","Groups":"7 months - 3 years"
I'd like to expand this into a dataframe with three columns: ID, Rate, and Groups.
What's the best way to do this?

You can use DataFrame constructor with join or concat:
import json
df = df[['ID']].join(pd.DataFrame(df['request_json'].apply(json.loads).values.tolist()))
print (df)
ID Groups Rate
0 175431467 7 months - 3 years 50
Or:
df = pd.concat([df['ID'],
pd.DataFrame(df['request_json'].apply(json.loads).values.tolist())], axis=1)
print (df)
ID Groups Rate
0 175431467 7 months - 3 years 50

In [38]: pd.io.json.json_normalize(df.to_dict('r'))
Out[38]:
ID request_json.Groups request_json.Rate
0 175431467 7 months - 3 years 50

Related

How to calculate the average of a column where the row meets a certain condition in Pandas [duplicate]

This question already has answers here:
Pandas Groupby: Count and mean combined
(6 answers)
Closed last year.
Basically I have this Dataframe:
import pandas as pd
dict = {'number': [1,1,1,1,1,2,2,2,4,4,4,4,6,6], 'time':[34,33,41,36,43,22,24,32,29,28,33,32,55,51]}
df = pd.DataFrame(dict)
print(df)
Output:
And I want to transform the df or create another one where instead of being several rows with the same 'number', there is a unique 'number' per row; and in the 'time' column, its average (of the records that had the same 'number'). Also, there should be a 3rd column called 'count' that shows the amount of records each 'number' had.
The output expected is:
Thanks.
Simply use groupby + agg:
agg = df.groupby('number')['time'].agg(['count', 'mean']).reset_index()
Output:
>>> agg
number count mean
0 1 5 37.4
1 2 3 26.0
2 4 4 30.5
3 6 2 53.0

create a new dataframe based on given dataframe [duplicate]

This question already has answers here:
Group dataframe and get sum AND count?
(4 answers)
Closed 1 year ago.
I have a table that looks like this:
user id
observation
25
2
25
3
25
2
23
1
23
3
the desired outcome is:
user id
observation
retention
25
7
3
23
4
2
I want to keep the user id column with unique ids and have another column showing how many times this id has appeared in the dataset summing up the observation column values.
any help will be appreciated
thanks
Use groupby() method and chain agg() method to it:
outputdf=df.groupby('user id',as_index=False).agg(observation=('observation','sum'),retention=('observation','count'))
Now if you print outputdf you will get your desired output:
user id observation retention
0 23 4 2
1 25 7 3
You have to use group by:
import pandas as pd
d = {'user id': [25,25,25,33,33], 'observation': [2,3,2,1,3]}
# get the dataframe
df = pd.DataFrame(data=d)
df_new = df.groupby('user id').agg({"sum", "count"}).reset_index()
# rename the columns as you desire
df_new.columns = ['user id', 'observation', 'retention']
df_new
Output:

Doing .diff() on pandas column(s) gives wrong output? [duplicate]

This question already has answers here:
Subtract consecutive columns in a Pandas or Pyspark Dataframe
(2 answers)
Closed 2 years ago.
I am trying to take the difference of a column using .diff() in a dataframe with a date column and a value column.
import pandas as pd
d = {'Date':['11/11/2011', '11/12/2011', '11/13/2011'], 'a': [2, 3,4]}
df1 = pd.DataFrame(data=d)
df1.diff(axis = 1)
Pandas gives me this output:
Date a
0 11/11/2011 2
1 11/12/2011 3
2 11/13/2011 4
Which is the df1 and not the difference where I expect the output to be:
Date a
0 11/11/2011 NaN
1 11/12/2011 1
2 11/13/2011 1
df1.set_index('Date').diff(axis = 0) saves the day
axis=1 means you are subtracting columns not rows. Your target result is related to rows. Use axis=0 instead.
Second, it is not correct to do subtractions over strings. It will throw an error since python does not support that.

Concatenate multiple pandas groupby outputs

I would like to make multiple .groupby() operations on different subsets of a given dataset and bind them all together. For example:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,2,2,2,3],"Subset":[1,1,2,2,2,3],"Value":[5,7,4,1,7,8]})
print(df)
ID Subset Value
0 1 1 5
1 1 1 7
2 2 2 4
3 2 2 1
4 2 2 7
5 3 1 9
I would then like to concatenate the following objects and store the result in a pandas data frame:
gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"]).mean()
gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"]).mean()
# Why do gr1 and gr2 have column names in different rows?
I realize that df.groupby(["ID","Subset"]).mean() would give me the concatenated object I'm looking for. Just bear with me, this is a reduced example of what I'm actually dealing with.
I think the solution could be to transform gr1 and gr2 to pandas data frames and then concatenate them like I normally would.
In essence, my questions are the following:
How do I convert a groupby result to a data frame object?
In case this can be done without transforming the series to data frames, how do you bind two groupby results together and then transform that to a pandas data frame?
PS: I come from an R background, so to me it's odd to group a data frame by something and have the output return as a different type of object (series or multi index data frame). This is part of my question too: why does .groupby return a series? What kind of series is this? How come a series can have multiple columns and an index?
The return type in your example is a pandas MultiIndex object. To return a dataframe with a single transformation function for a single value, then you can use the following. Note the inclusion of as_index=False.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr1
ID Subset Value
0 1 1 6
This however won't work if you wish to aggregate multiple functions like here. If you wish to avoid using df.groupby(["ID","Subset"]).mean(), then you can use the following for your example.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"], as_index=False).mean()
>>> pd.concat([gr1, gr2]).reset_index(drop=True)
ID Subset Value
0 1 1 6
1 2 2 4
If you're only concerned with dealing with a specific subset of rows, the following could be applicable, since it removes the necessity to concatenate results.
>>> values = [1,2]
>>> df[df['Subset'].isin(values)].groupby(["ID","Subset"], as_index=False).mean()
ID Subset Value
0 1 1 6
1 2 2 4

How to calculate the mean between multiple rows that matchs by a given id [duplicate]

This question already has answers here:
group by in group by and average
(3 answers)
Closed 4 years ago.
I want to calculate the mean of mulptiple rows that have one single value where they match and store it in another csv file.
The given data is:
ID salary days_of_work ...
1 2000 3 ...
1 1890 2 ...
1 2109 4 ...
2 .
2 .
2 .
2
3
3
...
And then obtain in another file, for every ID, one single row that contains the mean of the datas of the other columns like this:
ID salary days_of_work ...
1 1999.6667 3 ...
2 ...
3 ...
.
.
.
Update:
I tried to do this but for a file that has utc_time instead of ID
import pandas as pd
keep_col = ['utc_time','temperature','pressure','humidity','wind_direction','wind_speed/kph']
pd.read_csv('Gridpoints.csv', names=keep_col).to_csv("GridPoints/test.csv", index=False)
f=pd.read_csv("Gridpoints"+".csv")
df = f[keep_col]
df.groupby(['utc_time']).mean()
df.to_csv("GridPoints/test.csv", index=False)
So first what I do is getting a column deleted and then on the dataframe obtained, I want to do it for the utc_time column but it doesn't do anything
First you need to group by ID and then calculate the mean.
import pandas as pd
df = pd.read_csv('Book1.csv')
df1 = df.groupby(['ID'], as_index= False)[['Salary', 'days']].mean()
print(df1)
ID Salary days
1 1999.666667 3.0

Categories