How do you separate a pandas dataframe by year in python? - python

I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year
​
I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!

Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()

Related

Merge several Dataframes with outside temperature and power generation

I have several dataframes of heating devices which are containing data over 1 year. One time step is 15 min, each df have two columns: outside_temp and heat_generation. Each df looks like this:
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
3 11.0 0
4 11.0 300
5 10.9 49
6
.
.
.
35037 -5.1 450
35038 -5.1 450
35039 -5.1 450
35040 -5.2 600
I now want to know at which outside_temp I need how much heat_production for all heat devices(and therefore for all dataframes) -> I was thinking about groupby oder somthing else. But I dont know how to handel this amount of data the best way. When directly merging the dfs there is the problem that the outside temperature is there several times and the heat production of course differs. To solve this, I could imagine to take the average heat_production for each device at a given outside_temperature. Of course it can also be the case that a device was not measuring a specific temperature (e.g. the device is located in warmer or colder area -> Therefore NaN Values are possbile)
At the end I want to get kind of Polynomial/Sigmoid function to see how much heat_production is necessary at a given outside temperature
At the end I want to have a dataframe like this:
outside_temp heat_production_average_device_1 heat_production_average_device_2 ...etc
-20.0 790 NaN
-19.9 789 NaN
-19.8 788 790
-19.7 NaN 780
-19.6 770 NaN
.
.
.
19.6 34 0
19.7 32 0
19.8 30 0
19.9 32 0
20.0 0 0
Any idea whats the best way to do so ?
Given:
>>> df1
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
>>> df2
outside_temp heat_production
3 11.0 0
4 11.0 300
5 10.9 49
Doing:
def my_func(i, df):
renamer = {'heat_production': f'heat_production_average_device_{i}'}
return (df.groupby('outside_temp')
.mean()
.rename(columns=renamer))
dfs = [df1, df2]
dfs = [my_func(i+1, df) for i, df in enumerate(dfs)]
df = pd.concat(dfs, axis=1)
print(df)
Output:
heat_production_average_device_1 heat_production_average_device_2
outside_temp
11.0 245.0 150.0
11.1 175.0 NaN
10.9 NaN 49.0

Loop through Series to find which have the same index value

I'd like to concatenate/merge my pandas Series together. This is my data structure (For extra information)
dictionary = { 'a':{'1','2','3','4'}, 'b':{'1','2','3','4'} }
There are many more values at both levels, and each number corresponds to a series that contains timeseries data. I would like to merge all of 'a' together into one dataframe, the only trouble is that some of the data is yearly, some quarterly and some monthly.
so what I'm looking to do is loop through my data, something like this:
for level1 in dictData:
for level2 in dictData[level1]:
dictData[level1][level2].index.equals(dictData[level1][level2])
but obviously here I'm just comparing the series to itself! How would I compare each element to all the others? I know I'm missing something fairly fundamental. Thank you.
EDIT:
Here's some samples of actual data:
{'noT10101': {'A191RL': Gross domestic product
1947-01-01 -1.1
1947-04-01 -1.0
1947-07-01 -0.8
1947-10-01 6.4
1948-01-01 4.1
... ...
2020-01-01 -5.0
2020-04-01 -31.4
2020-07-01 33.4
2020-10-01 4.3
2021-01-01 6.4
[370 rows x 1 columns], 'DGDSRL': Goods
1947-01-01 2.9
1947-04-01 7.4
1947-07-01 2.7
1947-10-01 1.5
1948-01-01 2.0
... ...
2020-01-01 0.1
2020-04-01 -10.8
2020-07-01 47.2
2020-10-01 -1.4
2021-01-01 26.6
[370 rows x 1 columns], 'A191RP': Gross domestic product, current dollars
1947-01-01 9.7
1947-04-01 4.7
1947-07-01 6.0
1947-10-01 17.3
1948-01-01 10.0
... ...
2020-01-01 -3.4
2020-04-01 -32.8
2020-07-01 38.3
2020-10-01 6.3
2021-01-01 11.0
[370 rows x 1 columns], 'DSERRL': Services
1947-01-01 0.4
1947-04-01 5.9
1947-07-01 -0.8
1947-10-01 -2.1
1948-01-01 2.7
... ...
2020-01-01 -9.8
2020-04-01 -41.8
2020-07-01 38.0
2020-10-01 4.3
2021-01-01 4.2
[370 rows x 1 columns],
As you can see, dictionary key 'not10101' corresponds to a series of keys 'A191RL', 'DGDSRL', 'A191RP', etc. whose associated value is a Series. So when I am accessing .index I am looking at the index of that Series aka the datetime values. In this example they all match but in some cases they don't.
You can use the pandas concat function. It would be something like this:
import pandas as pd
import numpy as np
df1 = pd.Series(np.random.random_sample(size=5),
index=pd.Timestamp("2021-01-01") + np.arange(5) * pd.Timedelta(days=365),
dtype=float)
df2 = pd.Series(np.random.random_sample(size=12),
index=pd.Timestamp("2021-01-15") + np.arange(12) * pd.Timedelta(days=30),
dtype=float)
dictData= {"a": {"series": df, "same_series": df}, "b": {"series":df, "different_series": df2}}
new_dict = {}
for level1 in dictData:
new_dict[level1] = pd.concat(list(dictData[level1].values()))
Notice that I tried to mimic both yearly and monthly granularity. I want to say with this is that it doesn't matter the granularity of the series that are being concatenated.
The result will be something like this:
{'a': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
dtype: float64,
'b': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-05-01 0.614485
2021-05-31 0.611967
2021-06-30 0.820435
2021-07-30 0.839613
2021-08-29 0.507669
2021-09-28 0.471049
2021-10-28 0.550482
2021-11-27 0.723789
2021-12-27 0.209169
2022-01-26 0.664584
2022-02-25 0.901832
2022-03-27 0.946750
dtype: float64}
Take a look at the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

pandas dataframe interpolate for Nans with groupby using window of discrete days of the year

The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.
Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]
# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index
# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))
This answers both parts
build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})
df[df.isna().any(axis=1)]
ystart = df.Date.dt.year.min()
# generate rolling means for month/day. bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
.agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
.to_frame().rename_axis(["month","day"])
)
# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()
# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)
# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])
analysis
taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22
dfm.query("key_0==4 & key_1==11").head(7)
key_0
key_1
key_2
Date
vals
0
100
4
11
1918
1918-04-11 00:00:00
nan
22
465
4
11
1919
1919-04-11 00:00:00
12
22
831
4
11
1920
1920-04-11 00:00:00
2
22
1196
4
11
1921
1921-04-11 00:00:00
47
27
1561
4
11
1922
1922-04-11 00:00:00
47
36
1926
4
11
1923
1923-04-11 00:00:00
2
34.6
2292
4
11
1924
1924-04-11 00:00:00
37
29.4
I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements
Need an arbitrary number of averages
Use those averages to fill in the NA
I have addressed the
Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)
df.loc[df['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 NaN NaN 34.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN 49.0 NaN NaN NaN 32.0 NaN NaN 76 21.6
04-11 NaN 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 NaN 5.0 33.0 3 29.7
04-12 NaN 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 NaN 1.0 12.0 3 21.3
04-13 NaN 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 NaN 2.0 46.0 3 21.3
04-14 NaN 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 NaN 16.0 8.0 3 24.7
df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 21.6 21.6 34.0 21.6 21.6 21.6 2.0 21.6 21.6 21.6 ... 21.6 49.0 21.6 21.6 21.6 32.0 21.6 21.6 76.0 21.6
04-11 29.7 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 29.7 5.0 33.0 3.0 29.7
04-12 21.3 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 21.3 1.0 12.0 3.0 21.3
04-13 21.3 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 21.3 2.0 46.0 3.0 21.3
04-14 24.7 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 24.7 16.0 8.0 3.0 24.7

Python Stacked bar chart from DF with index dates?

I have created a data frame in python using pandas that has the following output with date being the index:
Date Daily Anger Daily Haha Daily Like Daily Love Daily Sad Daily WoW
2019-08-31 1 2.0 132.0 8.0 0.0 5.0
2019-09-30 0 1.0 41.0 4.0 0.0 0.0
2019-10-31 15 1.0 117.0 4.0 0.0 2.0
2019-11-30 0 3.0 84.0 4.0 0.0 4.0
2019-12-31 2 17.0 98.0 20.0 5.0 7.0
I'm trying to get these values in a stacked bar chart where the X axis is the date and the y axis is the total values across these metrics
I've spent the last couple of hours trying to get this to work with google with no success. Could anyone help me?
If Date is column use x parameter in DataFrame.plot.bar:
df.plot.bar(x='Date', stacked=True)
If Date is DatetimeIndex use only stacked parameter:
df.plot.bar(stacked=True)

selecting dates of a dataframe and looping over date index

I m having a dataframe with dates from 2006 to 2016 and for each date 7 corresponding values.
The data is like below:
H PS T RH TD WDIR WSP
date
2006-01-01 11:28:00 38 988.6 0.9 98.0 0.6 120.0 14.4
2006-01-01 11:28:00 46 987.6 0.5 91.0 -0.7 122.0 15.0
2006-01-01 11:28:00 57 986.3 0.5 89.0 -1.1 124.0 15.5
2006-01-01 11:28:00 66 985.1 0.5 90.0 -1.1 126.0 16.0
2006-01-01 11:28:00 74 984.1 0.4 90.0 -1.1 127.0 16.5
2006-01-01 11:28:00 81 983.3 0.4 90.0 -1.1 129.0 17.0
I want to select few columns for each year ( for example T and RH for all 2006) . So, for each year 2006 to 2016 select a bunch of columns then write each of the new dataframes in one file.
I did the following:
df_H_T=(df[['RH','T']])
mask = (df_H_T['date'] >'2016-01-01 00:00:00') & (df_H_T['date'] <='2016-12-31 23:59:59')
df_H_T_2006 =df.loc[mask]
print(df_H_T_2006.head(20))
print(df_H_T_2006.tail(20))
But is not working because it seems it doesn't know what 'date' is but then when I print the head of the dataframe it seems date is there. What am I doing wrong ?
My second question is how can I put this in a loop over the year variable so that I don t write by hand each new dataframe and select one year at a time up to 2016 ? ( I m new and never used loops in python).
Thanks,
Ioana
date is in the original dataframe, but then you take df_H_T=df[['RH','T']], so now date isn't in df_H_T. You can use masks generated from one dataframe to slice another, as long as they have the same index. So you can do
mask = (df['date'] >'2016-01-01 00:00:00') & (df['date'] <='2016-12-31 23:59:59')
df_H_T_2006 =df_H_T.loc[mask]
(Note: you're applying the mask to df, but presumably you want to apply it to df_H_T).
If date is in datetime format, you can just take df['date'].apply(lamda x: x.year==2016). For your for-loop, it would be
df_H_T=(df[['RH','T']])
for year in years:
mask = df['date'].apply(lamda x: x.year==year)
df_H_T_cur_year =df_H_T.loc[mask]
print(df_H_T_cur_year.head(20))
print(df_H_T_cur_year.tail(20))

Categories