I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...
My dataframe consists of all United States and territories for years 2013-2016 with several attributes.
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.
Desired Output:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
So far I've tried the following:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
Which yields just the first 3 rows in the GroupBy object:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
I've also tried a lambda function:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
Which yields the absolute largest 3 in the GroupBy object:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!
Well, you could do something not that pretty.
First getting a list of unique years using set():
years_list = list(set(df.year))
Create a dummy dataframe and a function to concat that I've made in the past:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
creating the dummy final df
df_final = pd.DataFrame()
Now you'll loop for each year and concating into the new DF:
for year in years_list:
# The query function does a search for where
# the #year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == #year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
and done.
print(df_final)
You then need to sort your GroupBy object .sort_values('enrollees), ascending=False
Related
This is an example of my dataframe and I would like to apply the groupby function but I get the following output:
Example dataframe:
x sampling time y
1 morning 19
2 morning 19.1
3 morning 20
1 midday 17
2 midday 18
3 midday 19
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11F3DEB0>
My code:
file = pd.read_excel(r'C:/Users/...
r'2021_Licor_sinusoidal.xlsx')
df = file[['x', 'Sampling time', 'Y']].copy()
df.columns = ['x', 'Sampling time', 'Y']
grouped = df.groupby(['Sampling time'])
print(grouped)
Thank you in advance
Pandas' groupby is lazy in nature, which means that it doesn't return any actual data until you specifically ask for them. As mentioned in the comments, this returns a grouped iterator, which can be used with inbuilt functions like sum() and count(). This tutorial will give you more insight regarding the operation and the way it works.
In a little more detail:
by_state = df.groupby("state")
print(by_state)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>
Here you get the 'strange' object/output you mentioned, since groupby delays the split-combine-apply process until you invoke a method on these data, as mentioned above. Since the function returns an iterator, you can of course iterate on this.
for state, frame in by_state:
print(f"First 2 entries for {state!r}")
print("------------------------")
print(frame.head(2), end="\n\n")
# First 2 entries for 'AK'
# ------------------------
# last_name first_name birthday gender type state party
# 6619 Waskey Frank 1875-04-20 M rep AK Democrat
# 6647 Cale Thomas 1848-09-17 M rep AK Independent
# First 2 entries for 'AL'
# ------------------------
# last_name first_name birthday gender type state party
# 912 Crowell John 1780-09-18 M rep AL Republican
# 991 Walker John 1783-08-12 M sen AL Republican
I need to merge data from one dataframe onto another.
The main dataframe consists of survey answers with a year, month, and region variable.
The data I need to merge onto this is the weather data for that specific month. This data is stored in my second data frame for weather stations with a year variable, a temperature average variable for each month (eg. value1, value2, ... value12), and a region variable.
I've tried to merge the two dataframes on region and year, and my plan was then afterwards to select the average temperature variable which coincides with the survey.
df1
---------------------------
year month regions
2002 january Pais Vasco
2002 february Pais Vasco
2003 march Pais Vasco
2002 november Florida
2003 december Florida
... ... ...
---------------------------
df2
-----------------------------------------------
year value1 value2 ... value12 regions
2002 10 11 ... 9 Pais Vasco
2003 11 11 ... 10 Pais Vasco
2004 12 11 ... 10 Pais Vasco
2002 11 11 ... 9 Florida
2003 10 11 ... 9 Florida
-----------------------------------------------
So in this example I need for my first survey observation to get the corresponding temperature (value1) data from the region Pais Vasco and year 2002.
When I tried to merge with
df_merged = pd.merge(df1, df2, how = "left", on =["regions", "year"])
I just get a dataframe with way more observations than my original survey dataframe.
I convert this data to tidy format. Assuming value1, value2 etc. correspond to value and month, then use pd.wide_to_long to turn it into long tidy format then merge.
tidy = pd.wide_to_long(df, stubnames=['value'], i=['year', 'region'], j='month', sep='') \
.reset_index()
You need to normalise your months so that they are all either numbers or integers. How you do this is outside the scope of this answer.
Then,
df1.merge(tidy, on=['year', 'month', 'region'], how='left', validate='1:1')
If this raises an error, then you have multiple observations for the same ['year', 'month', 'region'] key. Fix that by dropping duplicates. How you do so is almost certainly based heavily on your data.
sobek noticed that you have a typo, saying 'regions' rather than 'region' in your merge command. Make sure you're referring to columns that actually exist.
I have produced some data which lists parks in proximity to different areas of East London with use of the FourSquare API. It here in the dataframe, df.
Location,Parks,Borough
Aldborough Hatch,Fairlop Waters Country Park,Redbridge
Ardleigh Green,Haynes Park,Havering
Bethnal Green,"Haggerston Park, Weavers Fields",Tower Hamlets
Bromley-by-Bow,"Rounton Park, Grove Hall Park",Tower Hamlets
Cambridge Heath,"Haggerston Park, London Fields",Tower Hamlets
Dalston,"Haggerston Park, London Fields",Hackney
Import data with df = pd.read_clipboard(sep=',')
What I would like to do is group by the borough column and count the distinct parks in that borough so that for example 'Tower Hamlets' = 5 and 'Hackney' = 2. I will create a new dataframe for this purpose which simply lists total number of parks for each borough present in the dataframe.
I know I can do:
df.groupby(['Borough', 'Parks']).size()
But I need to split parks by the delimiter ',' such that they are treated as unique, distinct entities for a borough.
What do you suggest?
Thanks!
The first rule of data science is to clean your data into a useful format.
Reformat the DataFrame to be usable:
df.Parks = df.Parks.str.split(',\s*') # per user piRSquared
df = df.explode('Parks') # pandas v 0.25
Now the DataFrame is in a proper format that can be more easily analyzed
df.groupby('Borough').Parks.nunique()
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
That's three lines of code, but now the DataFrame is in a useful format, upon which more insights can easily be extracted.
Plot
df.groupby(['Borough']).Parks.nunique().plot(kind='bar', title='Unique Parks Counts by Borough')
If you are using Pandas 0.25 or greater, consider the answer from Trenton_M
His answer provides a good suggestion for creating a more useful data set.
IIUC:
df.groupby('Borough').Parks.apply(
lambda s: len(set(', '.join(s).split(', ')))
)
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
Similar
df.Parks.str.split(', ').groupby(df.Borough).apply(lambda s: len(set().union(*s)))
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation
If all you want to do is keep the first entry for each name, you can use drop_duplicates Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases