Merging two dataframes with different structure using pandas - python

I need to merge data from one dataframe onto another.
The main dataframe consists of survey answers with a year, month, and region variable.
The data I need to merge onto this is the weather data for that specific month. This data is stored in my second data frame for weather stations with a year variable, a temperature average variable for each month (eg. value1, value2, ... value12), and a region variable.
I've tried to merge the two dataframes on region and year, and my plan was then afterwards to select the average temperature variable which coincides with the survey.
df1
---------------------------
year month regions
2002 january Pais Vasco
2002 february Pais Vasco
2003 march Pais Vasco
2002 november Florida
2003 december Florida
... ... ...
---------------------------
df2
-----------------------------------------------
year value1 value2 ... value12 regions
2002 10 11 ... 9 Pais Vasco
2003 11 11 ... 10 Pais Vasco
2004 12 11 ... 10 Pais Vasco
2002 11 11 ... 9 Florida
2003 10 11 ... 9 Florida
-----------------------------------------------
So in this example I need for my first survey observation to get the corresponding temperature (value1) data from the region Pais Vasco and year 2002.
When I tried to merge with
df_merged = pd.merge(df1, df2, how = "left", on =["regions", "year"])
I just get a dataframe with way more observations than my original survey dataframe.

I convert this data to tidy format. Assuming value1, value2 etc. correspond to value and month, then use pd.wide_to_long to turn it into long tidy format then merge.
tidy = pd.wide_to_long(df, stubnames=['value'], i=['year', 'region'], j='month', sep='') \
.reset_index()
You need to normalise your months so that they are all either numbers or integers. How you do this is outside the scope of this answer.
Then,
df1.merge(tidy, on=['year', 'month', 'region'], how='left', validate='1:1')
If this raises an error, then you have multiple observations for the same ['year', 'month', 'region'] key. Fix that by dropping duplicates. How you do so is almost certainly based heavily on your data.
sobek noticed that you have a typo, saying 'regions' rather than 'region' in your merge command. Make sure you're referring to columns that actually exist.

Related

How to convert "event" data into country-year data by summating information in columns? Using python/pandas

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
10000102
2000
Nigeria
3
10
0
10000103
2000
Mali
1
3
15
10000103
2000
Nigeria
15
0
0
10000103
2001
Benin
1
0
0
10000103
2001
Nigeria
1
3
15
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
Nigeria
2000
200
300
300
Nigeria
2001
250
450
15
So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".
For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.
This simplified example shows how you could use groupby -
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
'year': [2000, 2000, 2001, 2000],
'events1': [ 3, 4, 5, 2],
'events2': [1, 6, 3, 4]
})
df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()
print(df2)
which gives the total of each type of event by country and by year
events1 events2
country year
Mali 2000 2 4
Nigeria 2000 7 7
2001 5 3

How to produce a new data frame of mean monthly data, given a data frame consisting of daily data?

I have a data frame containing the daily CO2 data since 2015, and I would like to produce the monthly mean data for each year, then put this into a new data frame. A sample of the data frame I'm using is shown below.
month day cycle trend
year
2011 1 1 391.25 389.76
2011 1 2 391.29 389.77
2011 1 3 391.32 389.77
2011 1 4 391.36 389.78
2011 1 5 391.39 389.79
... ... ... ... ...
2021 3 13 416.15 414.37
2021 3 14 416.17 414.38
2021 3 15 416.18 414.39
2021 3 16 416.19 414.39
2021 3 17 416.21 414.40
I plan on using something like the code below to create the new monthly mean data frame, but the main problem I'm having is indicating the specific subset for each month of each year so that the mean can then be taken for this. If I could highlight all of the year "2015" for the month "1" and then average this etc. that might work?
Any suggestions would be hugely appreciated and if I need to make any edits please let me know, thanks so much!
dfs = list()
for l in L:
dfs.append(refined_data[index = 2015, "month" = 1. day <=31].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T

Cumulative sum of all previous values

A similar question has been asked for cumsum and grouping but it didn't solve my case.
I have a financial balance sheet of a lot of years and need to sum all previous values by year.
This is my reproducible set:
df=pd.DataFrame(
{"Amount": [265.95,2250.00,-260.00,-2255.95,120],
"Year": [2018,2018,2018,2019,2019]})
The result I want is the following:
Year Amount
2017 0
2018 2255.95
2019 120.00
2020 120.00
So actually in a loop going from the lowest year in my whole set to the highest year in my set.
...
df[df.Year<=2017].Amount.sum()
df[df.Year<=2018].Amount.sum()
df[df.Year<=2019].Amount.sum()
df[df.Year<=2020].Amount.sum()
...
First step is aggregate sum, then use Series.cumsum and Series.reindex with forward filling missing values by all possible years, last replace first missing values to 0:
years = range(2017, 2021)
df1 = (df.groupby('Year')['Amount']
.sum()
.cumsum()
.reindex(years, method='ffill')
.fillna(0)
.reset_index())
print (df1)
Year Amount
0 2017 0.00
1 2018 2255.95
2 2019 120.00
3 2020 120.00

Pandas Groupby Multiple Columns - Top N

I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...
My dataframe consists of all United States and territories for years 2013-2016 with several attributes.
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.
Desired Output:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
So far I've tried the following:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
Which yields just the first 3 rows in the GroupBy object:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
I've also tried a lambda function:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
Which yields the absolute largest 3 in the GroupBy object:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!
Well, you could do something not that pretty.
First getting a list of unique years using set():
years_list = list(set(df.year))
Create a dummy dataframe and a function to concat that I've made in the past:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
creating the dummy final df
df_final = pd.DataFrame()
Now you'll loop for each year and concating into the new DF:
for year in years_list:
# The query function does a search for where
# the #year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == #year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
and done.
print(df_final)
You then need to sort your GroupBy object .sort_values('enrollees), ascending=False

pandas dataframe group year index by decade

suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories