I'd like to concatenate/merge my pandas Series together. This is my data structure (For extra information)
dictionary = { 'a':{'1','2','3','4'}, 'b':{'1','2','3','4'} }
There are many more values at both levels, and each number corresponds to a series that contains timeseries data. I would like to merge all of 'a' together into one dataframe, the only trouble is that some of the data is yearly, some quarterly and some monthly.
so what I'm looking to do is loop through my data, something like this:
for level1 in dictData:
for level2 in dictData[level1]:
dictData[level1][level2].index.equals(dictData[level1][level2])
but obviously here I'm just comparing the series to itself! How would I compare each element to all the others? I know I'm missing something fairly fundamental. Thank you.
EDIT:
Here's some samples of actual data:
{'noT10101': {'A191RL': Gross domestic product
1947-01-01 -1.1
1947-04-01 -1.0
1947-07-01 -0.8
1947-10-01 6.4
1948-01-01 4.1
... ...
2020-01-01 -5.0
2020-04-01 -31.4
2020-07-01 33.4
2020-10-01 4.3
2021-01-01 6.4
[370 rows x 1 columns], 'DGDSRL': Goods
1947-01-01 2.9
1947-04-01 7.4
1947-07-01 2.7
1947-10-01 1.5
1948-01-01 2.0
... ...
2020-01-01 0.1
2020-04-01 -10.8
2020-07-01 47.2
2020-10-01 -1.4
2021-01-01 26.6
[370 rows x 1 columns], 'A191RP': Gross domestic product, current dollars
1947-01-01 9.7
1947-04-01 4.7
1947-07-01 6.0
1947-10-01 17.3
1948-01-01 10.0
... ...
2020-01-01 -3.4
2020-04-01 -32.8
2020-07-01 38.3
2020-10-01 6.3
2021-01-01 11.0
[370 rows x 1 columns], 'DSERRL': Services
1947-01-01 0.4
1947-04-01 5.9
1947-07-01 -0.8
1947-10-01 -2.1
1948-01-01 2.7
... ...
2020-01-01 -9.8
2020-04-01 -41.8
2020-07-01 38.0
2020-10-01 4.3
2021-01-01 4.2
[370 rows x 1 columns],
As you can see, dictionary key 'not10101' corresponds to a series of keys 'A191RL', 'DGDSRL', 'A191RP', etc. whose associated value is a Series. So when I am accessing .index I am looking at the index of that Series aka the datetime values. In this example they all match but in some cases they don't.
You can use the pandas concat function. It would be something like this:
import pandas as pd
import numpy as np
df1 = pd.Series(np.random.random_sample(size=5),
index=pd.Timestamp("2021-01-01") + np.arange(5) * pd.Timedelta(days=365),
dtype=float)
df2 = pd.Series(np.random.random_sample(size=12),
index=pd.Timestamp("2021-01-15") + np.arange(12) * pd.Timedelta(days=30),
dtype=float)
dictData= {"a": {"series": df, "same_series": df}, "b": {"series":df, "different_series": df2}}
new_dict = {}
for level1 in dictData:
new_dict[level1] = pd.concat(list(dictData[level1].values()))
Notice that I tried to mimic both yearly and monthly granularity. I want to say with this is that it doesn't matter the granularity of the series that are being concatenated.
The result will be something like this:
{'a': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
dtype: float64,
'b': 2021-01-01 0.213574
2022-01-01 0.263514
2023-01-01 0.627435
2024-01-01 0.388753
2024-12-31 0.990316
2021-05-01 0.614485
2021-05-31 0.611967
2021-06-30 0.820435
2021-07-30 0.839613
2021-08-29 0.507669
2021-09-28 0.471049
2021-10-28 0.550482
2021-11-27 0.723789
2021-12-27 0.209169
2022-01-26 0.664584
2022-02-25 0.901832
2022-03-27 0.946750
dtype: float64}
Take a look at the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Related
Suppose I have a Pandas dataframe with 'Date' column whose values have gaps like below:
>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])
>>> df
Date $
0 2021-01-02 1.0
1 2021-01-05 2.0
2 2021-02-05 3.0
I would like to fill the gaps in the 'Date' column from the period between Jan 01, 2021 to Feb 28, 2021 while copying (forward-filling) the values, so from some reading up on StackOverflow posts like this, I came up with this solution to transform the dataframe as shown below:
# I need to first convert values in 'Date' column to datetime64 type
>>> df['Date'] = pd.to_datetime(df['Date'])
# Then I have to set 'Date' column as the dataframe's index
>>> df = df.set_index(['Date'])
# Without doing the above two steps, the call below returns error
>>> df_new=df.asfreq(freq='D', how={'start':'2021-01-01', 'end':'2021-03-31'}, method='ffill')
>>> df_new
$
Date
2021-01-02 1.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 2.0
2021-01-06 2.0
2021-01-07 2.0
2021-01-08 2.0
2021-01-09 2.0
2021-01-10 2.0
...
2021-01-31 2.0
2021-02-01 2.0
2021-02-02 2.0
2021-02-03 2.0
2021-02-04 2.0
2021-02-05 3.0
But as you can see above, the dates in df_new only starts at '2021-01-02' instead of '2021-01-01' AND it ends on '2021-02-05' instead of '2021-02-28'. I hope I'm entering the input for how parameter correctly above.
Q1: What else do I need to do to make the resulting dataframe look like below:
>>> df_new
$
Date
2021-01-01 1.0
2021-01-02 1.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 2.0
2021-01-06 2.0
2021-01-07 2.0
2021-01-08 2.0
2021-01-09 2.0
2021-01-10 2.0
...
2021-01-31 2.0
2021-02-01 2.0
2021-02-02 2.0
2021-02-03 2.0
2021-02-04 2.0
2021-02-05 3.0
2021-02-06 3.0
...
2021-02-28 3.0
Q2: Is there any way I can accomplish this simpler (i.e. without having to set the 'Date' column as the index of the dataframe for example)
Thanks in advance for your suggestions/answers!
You can find min/max date, create new pd.date_range() using MonthBegin/End date offsets and reindex:
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
print(df)
Prints:
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
...
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year
I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!
Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()
I have a dataframe that has 5-minute timestamps as the index and I would like to switch to 15-minute periods. So I would like to take the mean of 3 5-minute periods, and then assign the index value of the first period to that mean, building another dataframe.
df1=
variable_1
(Settlement_Date,)
2018-06-30 20:30:00 4.5
2018-06-30 20:35:00 3.8
2018-06-30 20:40:00 4.2
2018-06-30 20:45:00 4.1
2018-06-30 20:50:00 6.0
2018-06-30 20:55:00 3.3
2018-06-30 21:00:00 1.9
2018-06-30 21:05:00 2.8
2018-06-30 21:10:00 3.1
... ...
I want this dataframe to become some thing like this
df1=
variable_1
(Settlement_Date,)
2018-06-30 20:30:00 4.2
2018-06-30 20:45:00 4.5
2018-06-30 21:00:00 2.6
... ...
I have tried to use a 'for loop', but am having issues getting the date back into the dataframe
mean_list = []
date_list = []
for i in range(len(df1)-3):
mean_holding = df1[:i+3].mean()
date_holding = df1.iloc[i+3]
mean_list.append(mean_holding)
date_list.append(date_holding)
I believe need resample with mean:
df = df.resample('15Min').mean()
Alternative solution with Grouper:
df = df.groupby(pd.Grouper(freq='15Min')).mean()
print (df)
variable_1
(Settlement_Date,)
2018-06-30 20:30:00 4.166667
2018-06-30 20:45:00 4.466667
2018-06-30 21:00:00 2.600000
I have a time series of daily rainfall data that looks like this:
PRCP
year_month_day
1797-01-01 00:00:00 0.0
1797-01-02 00:00:00 0.0
1797-01-03 00:00:00 1.1
1797-01-04 00:00:00 0.0
1797-01-05 00:00:00 3.5
1797-02-01 00:00:00 8.1
1797-02-02 00:00:00 3.0
1797-02-03 00:00:00 0.0
1797-02-04 00:00:00 0.0
1797-02-05 00:00:00 0.0
1797-03-01 00:00:00 0.0
1797-03-02 00:00:00 0.0
1797-03-03 00:00:00 0.0
1797-03-04 00:00:00 0.0
1797-03-05 00:00:00 1.5
1797-04-01 00:00:00 6.3
1797-04-02 00:00:00 24.0
1797-04-03 00:00:00 0.0
1797-04-04 00:00:00 2.2
1797-04-05 00:00:00 5.9
1797-05-01 00:00:00 0.0
1797-05-02 00:00:00 15.9
1797-05-03 00:00:00 0.0
1797-05-04 00:00:00 0.0
1797-05-05 00:00:00 0.0
1797-06-01 00:00:00 1.6
1797-06-02 00:00:00 0.0
1797-06-03 00:00:00 0.0
1797-06-04 00:00:00 7.9
1797-06-05 00:00:00 0.0
I have been able to import it with the index column as a pandas datetime object. I am trying to count all of the non-zero raindays per month. I can group by month with:
grouped = df.groupby(pd.Grouper(freq='M'))
and can count everything per month with:
raindays = grouped.resample("M").count()
But that also counts days with 0 rainfall. I found hints about using nunique(), but it doesn't seem to work with resample. eg:
raindays = grouped.resample("M").nunique()
returns error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'nunique'
Is there a way to count non zero values in a grouped pandas object?
Mask those 0s and try again.
df.mask(df.PRCP.eq(0)).groupby(pd.Grouper(freq='M')).count()
Or, the more obvious version with replace.
df.replace({0 : np.nan}).groupby(pd.Grouper(freq='M')).count()
PRCP
year_month_day
1797-01-31 2
1797-02-28 2
1797-03-31 1
1797-04-30 4
1797-05-31 1
1797-06-30 2
Using factorize and bincount
f, u = pd.factorize(df.index + pd.offsets.MonthEnd(0))
pd.Series(np.bincount(f, df.PRCP.values != 0).astype(int), u)
1797-01-31 2
1797-02-28 2
1797-03-31 1
1797-04-30 4
1797-05-31 1
1797-06-30 2
dtype: float64