I have data frame(df1) of start and end dates look like this.
Start Date End Date
1875-01-01 1877-09-30
1881-07-01 1886-03-31
1888-01-01 1889-06-30
1890-10-01 1890-12-31
.
.
.
2016-10-01 2018-12-31
I have a different set of data frame (df2) which consists of daily time series. For example,
Date Value
1875-01-01 7.21
1875-01-02 7.23
1875-01-03 7.22
1875-01-04 7.12
.
.
.
2018-12-31 3.12
I set dates as an index for df2.
I am trying to make stats table based on df2 using df1.
First I created an empty data frame to add the values. For example,
outputtable = pd.DataFrame(columns = ('Max','Min','Ave'))
for i in df1.index:
try:
df3 = df2.loc[df1['Start Date'][i]:df1['End Date'][i]]
minimum = df3['Value'].min()
maximum = df3['Value'].max()
average = df3['Value'].mean()
outputtable[-1]= [minimum, maximum, average]
except:
pass
I used try because some of the dates in df1 are not in df2. In that case, I want the code to ignore and move onto the next set of dates.
I want the code to go through every row of df1 and do the stats (min, max and mean) and put them into the outputtable to fo further calculations. So far the code above is not working. Help would be much appreciated.
Desired output
Start Date End Date Min Max Ave
1875-01-01 1877-09-30 7 8 7.2
1881-07-01 1886-03-31 1 4 2.2
1888-01-01 1889-06-30 2 6.5 3
1890-10-01 1890-12-31 3 5 4.2
.
.
.
2016-10-01 2018-12-31 1 2 1.7
Related
I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0
I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).
Perform group_by to compute mean using last two month's Value and Duration.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values
Use GroupBy.agg instead transform if need output filled by aggregate values:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
Group Value Duration
0 A 10.0 60.0
1 B 27.5 224.0
2 C 15.0 130.0
If need last value per groups use GroupBy.last:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)
I extracted some dates and temperature values from an xml file and wanted to make a dataframe out of them. So after some loops I defined variable temperature and date and appended their values to a list out of the loop (placeholder). Later I made a DataFrame from them and assigned the column name directly while making the dataframe. But I relaized that with every time I run my code the column names get assigned right or wrong randomly.
Here is my code:
placeholder=[]
for timeserie in timeseries:
date = re.findall('<entryisIntraday\D*(\d*.\d*.\d*)', timeserie)
temperature = re.findall('<value>(.*)<\/value>', timeserie)[0]
placeholder.append([date, temperature])
print(placeholder)
df = pd.DataFrame(placeholder, columns= {"DATE", "TEMP"})
print(df)
after running the code sometimes the result is like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'] ...
TEMP DATE
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
and sometimes like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'], ...
DATE TEMP
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
I didn't had this problem when I assigned the column names after I built the DataFrame:
df = pd.DataFrame(placeholder)
df=df.rename(columns= {0:"DATE",1:"TEMP"})
How can I solve this problem?
The columns argument of DataFrame constructor should be a list, not a set:
df = pd.DataFrame(placeholder, columns = ["DATE", "TEMP"])
So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]