Pandas: Group by date and the median of another variable - python

This is a demo example of my DataFrame. The full DataFrame has multiple additional variables and covers 6 months of data.
sentiment date
1 2015-05-26 18:58:44
0.9 2015-05-26 19:57:31
0.7 2015-05-26 18:58:24
0.4 2015-05-27 19:17:34
0.6 2015-05-27 18:46:12
0.5 2015-05-27 13:32:24
1 2015-05-28 19:27:31
0.7 2015-05-28 18:58:44
0.2 2015-05-28 19:47:34
I want to group the DataFrame by just the day of the date column, but at the same time aggregate the median of the sentiment column.
Everything I have tried with groupby, the dt accessor and timegrouper has failed.
I want to return a pandas DataFrame not a GroupBy object.
The date column is M8[ns]
The sentiment column float64

You fortunately have the tools you need listed in your question.
In [61]: df.groupby(df.date.dt.date)[['sentiment']].median()
Out[61]:
sentiment
2015-05-26 0.9
2015-05-27 0.5
2015-05-28 0.7

I would do this :
df['date'] = df['date'].apply(lambda x : x.date())
df = df.groupby('date').agg({'sentiment':np.median}).reset_index()
You first replace the datetime column with the date.
Then you perform the groupby+agg operation.

I would do this, because you can do multiple aggregations (like median, mean, min, max, etc.) on multiple columns at the same time:
df.groupby(df.date.dt.date).agg({'sentiment': ['median']})

You can get any number of metrics using one group by and .agg() function
1) create new column extracting date.
2) Use groupy by and apply numpy.median,numpy.mean etc
import pandas as pd
x = [[1,'2015-05-26 18:58:44'],
[0.9,'2015-05-26 19:57:31']]
t = pd.DataFrame(x,columns = ['a','b'])
t.b = pd.to_datetime(t['b'])
t['datex'] = t['b'].dt.date
t.groupby(['datex']).agg({
'a': np.median
})
Output -
datex
2015-05-26 0.95

Related

Compare master and child dataframe and extract new rows base on two column values only

I have two Dataframes as:
Master_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,120.0,1.0,32.0,96156.9,81.05,1.15,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,80.0,0.8,29.0,161207.0,56.95,0.95,6500
PCJEWELLER,82.5,0.55,31.0,154772.0,56.95,0.95,6500
PCJEWELLER,85.0,0.6,33.0,147882.0,56.95,0.7,6500
PCJEWELLER,90.0,0.5,37.0,138977.0,56.95,0.55,6500
and Child_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
I want compare child_DF with master_DF base on Column (Symbol,Strike_Price) i.e. if the Symbol & Strike_Price are already available in master_DF then it will not be consider as new data.
New Rows are:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
You can use right merge with indicator=True and then query 'right_only', finally reindex() to get columns in order of child:
(master.merge(child,on=['Symbol','Strike_Price'],how='right',
suffixes=('_',''),indicator=True)
.query('_merge=="right_only"')).reindex(child.columns,axis=1)
Symbol Strike_Price C_BidPrice Pecentage Margin_Req Underlay \
2 JETAIRWAYS 150.0 1.3 22.0 44156.9 81.05
3 PCJEWELLER 100.0 1.8 29.0 441207.0 46.95
C_LTP LotSize
2 1.05 2200
3 4.95 6500
First merge both dataframe on symbol and strike_price setting indicator=True and how='right'
result = pd.merge(master_df[['Symbol','Strike_Price']],child_df,on=['Symbol','Strike_Price'],indicator=True,how='right')
Then filter right_only from _merge column to get desired result
result = result[result['_merge']=='right_only']
Code snippet

merge different dataframes and add other columns from base dataframe

I am trying to merge different dataframes.
Assume these two melted dataframes.
melted_dfs[0]=
Date Code delta_7
0 2014-04-01 GWA 0.08
1 2014-04-02 TVV -0.98
melted_dfs[1] =
Date Code delta_14
0 2014-04-01 GWA nan
1 2014-04-02 XRP -1.02
I am looking to merge both the above dataframes along with volume& GR columns from my base dataframe.
base_df =
Date Code Volume GR
0 2014-04-01 XRP 74,776.48 482.76
1 2014-04-02 TRR 114,052.96 460.19
I tried to use Python's built in reduce function by converting all dataframes into a list but it throws a error
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
# feature_dfs is a list which contains all the above dfs.
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
Any help is appreciated. Thanks!
This should work , as it stated some of df's date is not datetime format
feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs]
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
One of your dataframes in feature_dfs probably has a non-datetime dtype.
Try printing the datatypes and index of the DataFrames:
for i, df in enumerate(feature_dfs):
print 'DataFrame index: {}'.format(str(i))
print df.info()
print '-'*72
I would assume that one of the DataFrames is going to show a line like:
Date X non-null object
Indicating that you don't have a datetime datatype for Date. That DataFrame is the culprit and you will have the index from the print above.

How to combine daily data with monthly data in Pandas?

I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]

How to resample a pandas dataframe backwards

Hi I am trying to resample a pandas DataFrame backwards.
This is my dataframe:
seconds = np.arange(20, 700, 60)
timedeltas = pd.to_timedelta(seconds, unit='s')
vals = np.array([randint(-10,10) for a in range(len(seconds))])
df = pd.DataFrame({'values': vals}, index = timedeltas)
then I have
In [252]: df
Out[252]:
values
00:00:20 8
00:01:20 4
00:02:20 5
00:03:20 9
00:04:20 7
00:05:20 5
00:06:20 5
00:07:20 -6
00:08:20 -3
00:09:20 -5
00:10:20 -5
00:11:20 -10
and
In [253]: df.resample('5min').mean()
Out[253]:
values
00:00:20 6.6
00:05:20 -0.8
00:10:20 -7.5
and what I would like is something like
Out[***]:
values
00:01:20 6
00:06:20 valb
00:11:20 -5.8
where the values of each new time are the ones if I roll back the dataframe and compute the mean in each bin going from backwards to forward. For example in this
case the last value should be
valc = (-6-3-5-5-10)/5.
valc= -5.8
which is the average of the last 5 values, and the first one should be the average of the only 2 first values because the "bin" is incomplete.
Reading pandas documentation I thought that I have to use the parameters how='last' but in my current version of pandas this is not working (version 0.20.3). Additionally I tried with the options closed and convention, but I wasn't able to perform this.
Thanks for the help
The easiest way is to sort the index in reverse order, then resample to get the desired results:
df.sort_index(ascending=False).resample('5min').mean()
Resample reference - When the resample starts the first bin is of max available length, in this case 5. Closed, label, convention parameters are helpful but do not compute the mean going from backwards to forward. To do that use sort.

Using Python's Pandas to find average values by bins

I just started using pandas to analyze groundwater well data over time.
My data in a text file looks like (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.
The most concise way is probably to convert this to a timeseris data and them downsample to get the means:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupby together with cut:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27
so what i like to do is create a separate column with the rounded bin number:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
df[(df.date > a) & (df.date < b)]

Categories