This question already has answers here:
How to access pandas groupby dataframe by key
(6 answers)
Closed 8 years ago.
I want to group a dataframe by a column, called 'A', and inspect a particular group.
grouped = df.groupby('A', sort=False)
However, I don't know how to access a group, for example, I expect that
grouped.first()
would give me the first group
Or
grouped['foo']
would give me the group where A=='foo'.
However, Pandas doesn't work like that.
I couldn't find a similar example online.
Try: grouped.get_group('foo'), that is what you need.
from io import StringIO # from StringIO... if python 2.X
import pandas
data = pandas.read_csv(StringIO("""\
area,core,stratum,conc,qual
A,1,a,8.40,=
A,1,b,3.65,=
A,2,a,10.00,=
A,2,b,4.00,ND
A,3,a,6.64,=
A,3,b,4.96,=
"""), index_col=[0,1,2])
groups = data.groupby(level=['area', 'stratum'])
groups.get_group(('A', 'a')) # make sure it's a tuple
conc qual
area core stratum
A 1 a 8.40 =
2 a 10.00 =
3 a 6.64 =
Related
This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 4 months ago.
I have a df in python with different cities.
I am trying to create a df for each city.
So wrote this code in python and it works. It does what I need. But i was wondering if there is any over way to create the name of each df in a different way rather than using
globals()\["df\_"+str(ciudad)\] = new_grouped_by
If I try this:
"df\_"+str(ciudad) = new_grouped_by
Give me this error: SyntaxError: can't assign to operator
Any tips/suggestions would be more than welcome!
def get_city():
for ciudad in df["Ciudad"].unique():
#print (ciudad)
grouped_by = df.groupby('Ciudad')
new_grouped_by=[grouped_by.get_group(ciudad) for i in grouped_by.groups]
globals()["df_"+str(ciudad)] = new_grouped_by
get_city()
A simple way would be to store the dataframes in a dictionary with the city names as keys:
import pandas as pd
data = zip(['Amsterdam', 'Amsterdam', 'Barcelona'],[1,22,333])
df = pd.DataFrame(data, columns=['Ciudad', 'data'])
new_dfs = dict(list(df.groupby('Ciudad')))
Calling new_dfs['Amsterdam'] will then give you the dataframe:
Ciudad
data
0
Amsterdam
1
1
Amsterdam
22
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 9 months ago.
So here's my simple example (the json field in my actual dataset is very nested so I'm unpacking things one level at a time). I need to keep certain columns on the dataset post json_normalize().
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
Start:
Expected (Excel mockup):
Actual:
import json
d = {'report_id': [100, 101, 102], 'start_date': ["2021-03-12", "2021-04-22", "2021-05-02"],
'report_json': ['{"name":"John", "age":30, "disease":"A-Pox"}', '{"name":"Mary", "age":22, "disease":"B-Pox"}', '{"name":"Karen", "age":42, "disease":"C-Pox"}']}
df = pd.DataFrame(data=d)
display(df)
df = pd.json_normalize(df['report_json'].apply(json.loads), max_level=0, meta=['report_id', 'start_date'])
display(df)
Looking at the documentation on json_normalize(), I think the meta parameter is what I need to keep the report_id and start_date but it doesn't seem to be working as the expected fields to keep are not appearing on the final dataset.
Does anyone have advice? Thank you.
as you're dealing with a pretty simple json along a structured index you can just normalize your frame then make use of .join to join along your axis.
from ast import literal_eval
df.join(
pd.json_normalize(df['report_json'].map(literal_eval))
).drop('report_json',axis=1)
report_id start_date name age disease
0 100 2021-03-12 John 30 A-Pox
1 101 2021-04-22 Mary 22 B-Pox
2 102 2021-05-02 Karen 42 C-Pox
This question already has answers here:
Apply function to each row of pandas dataframe to create two new columns
(5 answers)
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 3 years ago.
I am trying to create multiple new dataframe columns using a function. When I run the simple code below, however, I get the error, "KeyError: "['AdjTime1' 'AdjTime2'] not in index."
How can I correct this to add the two new columns ('AdjTime1' & 'AdjTime2') to my dataframe?
Thanks!
import pandas as pd
df = pd.DataFrame({'Runner':['Wade','Brian','Jason'],'Time':[80,75,98]})
def adj_speed(row):
adjusted_speed1 = row['Time']*1.5
adjusted_speed2 = row['Time']*2.0
return adjusted_speed1, adjusted_speed2
df[['AdjTime1','AdjTime2']] = df.apply(adj_speed,axis=1)
Just do something like (assuming you have a list values you want to multiply Time on):
l=[1.5,2.0]
for e,i in enumerate(l):
df['AdjTime'+str(e+1)]=df.Time*i
print(df)
Runner Time AdjTime1 AdjTime2
0 Wade 80 120.0 160.0
1 Brian 75 112.5 150.0
2 Jason 98 147.0 196.0
This question already has an answer here:
Renaming columns when using resample
(1 answer)
Closed 5 years ago.
The line of code below takes columns that represent each months total sales and averages the sales by quarter.
mdf = tdf[sel_cols].resample('3M',axis=1).mean()
What I need to do is title the columns with a str (cannot use pandas .Period function).
I attempting to use the following code, but I cannot get it to work.
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year, [1, 2, 3, 4][x.quarter==1]))
I want the columns to read... 2000q1, 2000q2, 2000q3, 2000q4, 2001q1,... etc, but keep getting wrong things like 2000q1, 2000q1, 2000q1, 2000q2, 2001q1.
How can I use the .format function to make this work properly.
The easiest way is to perform the quarter function on the datetime list like so
mdf = tdf[sel_cols].resample('3M',axis=1).mean().rename(columns=lambda x: '{:}q{:}'.format(x.year,x.quarter))
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.