I have a dataframe with time series data in one column and index as a date. The data looks like below: It's from 2000 to 2015.
2000-02-24 NaN
2000-02-25 NaN
2000-02-26 0.272
2000-02-27 0.417
2000-02-28 0.837
2000-02-29 1.082
2000-03-01 0.613
2000-03-02 0.709
2000-03-03 0.857
2000-03-04 0.391
2000-03-05 0.470
2000-03-06 0.288
2000-03-07 0.286
I want data only from months March to July for each year so is there any way to do that in pandas?
You can filter using the DatetimeIndex attribute month:
df[(df.index.month >=2) & (df.index.month<=6)]
months are zero based so 2 corresponds to March and 6 is July
Related
I'm trying to get percentage changes per month/year so that I see years in index and months in columns.
Here's how the original data looks like:
time
2009-12-31 1.692868
2010-01-03 1.693478
2010-01-04 1.681354
2010-01-05 1.681792
2010-01-06 1.676942
2010-01-07 1.685896
2010-01-08 1.675619
2010-01-09 1.675620
2010-01-10 1.671965
2010-01-11 1.668323
I have further used the following formula to obtain monthly percentage change.
prices.resample("M").ffill().pct_change().apply(lambda x: round(x*100,2))
Here's the data I received:
time
2009-12-31 NaN
2010-01-31 1.32
2010-02-28 0.48
2010-03-31 -0.49
2010-04-30 0.11
2010-05-31 4.45
2010-06-30 -1.30
2010-07-31 -4.09
2010-08-31 1.08
2010-09-30 -3.72
2010-10-31 -1.91
2010-11-30 2.93
2010-12-31 -3.42
2011-01-31 0.14
2011-02-28 -0.83
2011-03-31 -0.40
2011-04-30 -3.91
2011-05-31 0.88
2011-06-30 -0.34
2011-07-31 -2.66
However, my final goal is to have percentage changes per each month, so that I have years in index and months in columns. How can I do it? I would appreciate any advice. Also, I am wondering how to build a similar heatmap with matplotlib.
Here's an example of what I need.
You can obtain your graph directly without the first step:
As your example is a bit short, let's use this dummy one:
dates = pd.date_range('2009-01-01', '2020-12-31')
df = pd.DataFrame({'time': dates,
'value': np.random.random(size=len(dates)),
}).set_index('time')
>>> print(df)
value
time
2009-01-01 0.661110
2009-01-02 0.757710
2009-01-03 0.490736
2009-01-04 0.148575
2009-01-05 0.715500
... ...
2020-12-27 0.715620
2020-12-28 0.171634
2020-12-29 0.412512
2020-12-30 0.814756
2020-12-31 0.427202
[4383 rows x 1 columns]
processing:
Here we'll use pivot_table to reshape the data, taking the mean of the values, then we'll apply pct_change:
# just to get month names in order
cols = pd.date_range('2020', '2021', freq='M').month_name().str[:3]
df2 = (df.assign(year=df.index.year,
month=df.index.month_name().str[:3],
)
.pivot_table(index='year', columns='month', values='value', fill_value=0)
.pct_change()
[cols]
)
output:
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
2009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010 0.214070 0.065160 -0.073928 -0.013328 0.145379 -0.000346 0.118528 0.069972 0.037107 -0.249954 -0.244608 -0.087839
2011 0.031421 0.126091 -0.032321 -0.000812 0.004430 -0.084645 0.026099 0.020069 0.073262 0.087346 0.228409 0.093019
2012 -0.095717 -0.248492 0.094027 0.173968 0.307899 0.075966 -0.200719 0.030460 -0.185117 0.107859 -0.090682 -0.109882
2013 0.075015 0.242875 -0.049873 -0.195608 -0.144225 0.017974 0.311462 0.041423 0.277412 -0.113914 0.175273 0.045258
2014 0.018353 -0.113219 0.189669 0.064207 0.036269 0.022477 -0.232103 0.109159 -0.103024 -0.088224 -0.159047 0.067562
2015 -0.094678 0.186993 -0.128900 0.074652 0.054206 0.093470 0.111634 -0.053931 0.034411 -0.088852 0.181860 -0.055049
2016 0.057190 0.029102 0.011317 -0.051180 -0.181694 -0.084899 0.013056 -0.078995 -0.198341 0.377086 -0.096291 -0.181843
2017 -0.161556 -0.059750 -0.051224 -0.202536 0.165222 -0.086402 0.116095 -0.029666 0.224123 -0.010386 -0.081571 0.381159
2018 0.109618 -0.004155 -0.007470 0.251640 -0.100422 -0.113325 -0.161298 -0.107079 0.023862 -0.029307 0.070167 -0.144116
2019 0.027455 -0.189825 0.142514 -0.037071 0.100118 0.157974 0.020722 0.022490 -0.187602 0.168074 0.187713 0.209489
2020 0.014801 0.310334 -0.037249 -0.010381 -0.231910 0.012961 0.128481 -0.083552 0.186090 -0.055755 -0.102882 -0.020587
plotting
Let's use seaborn.heatmap to plot with the "vlag" colormap (blue/red is much better than green/red for colorblinds):
import seaborn as sns
ax = sns.heatmap(df2, cmap='vlag', center=0, annot=True, fmt='.2f')
ax.figure.set_size_inches(8, 6)
I had a similar objective a while ago and solved it using the groupby method. You do have to use an aggregation function (I used mean()), but since you have only single values for every month and year already this method is a bit inefficient, but it does still work for your case.
df["month"] = df.index.month
# Have a dataframe named df with a month column, and group it by year
df2 = df.groupby([df.index.year, 'month'])[1].mean().unstack()
print(df2)
This gives this output:
month 1 2 3 4 5 6 7 8 9 10 11 12
0
2009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010 1.32 0.48 -0.49 0.11 4.45 -1.30 -4.09 1.08 -3.72 -1.91 2.93 -3.42
2011 0.14 -0.83 -0.40 -3.91 0.88 -0.34 -2.66 NaN NaN NaN NaN NaN
Now I do not use matplotlib, but using plotly you can make a heat-map quite easily from this:
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
times = list(df.index.year.unique())
fig2 = go.Figure(data=go.Heatmap(
z=df2,
x=months,
y=times))
fig2.update_xaxes(title_text='Months')
fig2.update_yaxes(title_text='Years',dtick=1)
fig2.show()
I am trying to reindex my pandas dataframe to a column-wise MultiIndex. Most answers I've explored seem to answer only row wise. My current df looks as such:
ticker calendardate eps price ps revenue
0 ABNB 2019-12-31 -2.59 NaN NaN 4.80
1 ABNB 2020-12-31 -16.12 146.80 25.962 3.37
2 AMZN 2019-12-31 23.46 1847.84 3.266 2.80
3 AMZN 2020-12-31 42.64 3256.93 4.233 3.86
I want a MultiIndex based upon calendardate so that my output looks as such:
ticker eps price ps revenue
2019 2020 2019 2020 2019 2020 2019 2020
0 ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.80 3.37
1 AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.80 3.86
Any help would be appreciated. Thanks
We can use str.split to split the column calenderdate around the delimiter - then use str[0] to select the year portion of splitted column, now set the index of dataframe to column ticker along with extracted year followed by unstack to reshape.
y = df['calendardate'].str.split('-', n=1).str[0]
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
If the dtype of column calendardate is datetime then we can instead use:
y = df['calendardate'].dt.year
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
eps price ps revenue
calendardate 2019 2020 2019 2020 2019 2020 2019 2020
ticker
ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.8 3.37
AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.8 3.86
I have a dataframe, which has different rates for multiple 'N' currencies over a time period.
dataframe
Dates AUD CAD CHF GBP EUR
20/05/2019 0.11 -0.25 -0.98 0.63 0.96
21/05/2019 0.14 -0.35 -0.92 1.92 0.92
...
02/01/2020 0.135 -0.99 -1.4 0.93 0.83
Firstly, I would like to reshape the dataframe table to look like the below as I would like to join another table which would be in a similar format:
dataframe
Dates Pairs Rates
20/05/2019 AUD 0.11
20/05/2019 CAD -0.25
20/05/2019 CHF -0.98
...
...
02/01/2020 AUD 0.135
02/01/2020 CAD -0.99
02/01/2020 CHF -1.4
Then, for every N currency, I would like to plot a histogram . So with the above, it would be 5 separate histograms based off each N ccy.
I assume I would need to get this in some sort of loop, but not sure on the easiest way to approach.
Thanks
Use DataFrame.melt first:
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)
df = df.melt('Dates', var_name='Pairs', value_name='Rates')
print (df)
Dates Pairs Rates
0 2019-05-20 AUD 0.110
1 2019-05-21 AUD 0.140
2 2020-01-02 AUD 0.135
3 2019-05-20 CAD -0.250
4 2019-05-21 CAD -0.350
5 2020-01-02 CAD -0.990
6 2019-05-20 CHF -0.980
7 2019-05-21 CHF -0.920
8 2020-01-02 CHF -1.400
9 2019-05-20 GBP 0.630
10 2019-05-21 GBP 1.920
11 2020-01-02 GBP 0.930
12 2019-05-20 EUR 0.960
13 2019-05-21 EUR 0.920
14 2020-01-02 EUR 0.830
And then DataFrameGroupBy.hist:
df.groupby('Pairs').hist()
Dates rates
7/26/2019 1.04
7/30/2019 1.0116
7/31/2019 1.005
8/1/2019 1.035
8/2/2019 1.01
8/6/2019 0.9886
8/12/2019 0.965
df = df.merge(
pd.DataFrame({'Dates':df['Dates'] + pd.offsets.BDay()}), on='Dates', how='outer'
).sort_values('Dates').bfill().dropna().reset_index(drop=True)
print(df)
I tried the above code but its unable to fix the consecutive missing business days. It can fix only for 1 day. In the above dataframe, 29th July 2019 then 5th, 7th , 8th , 9th August are missing. These are weekdays. I need to populate the missing weekdays dates and assign the 'rate' which is next to missing date. For example: Assign the 30th july 2019 'rate' to the missing 29th july 2019 as well and so on for all missing dates. Please suggest. Thanks I expect the following output
Dates rates
7/26/2019 1.04
7/29/2019 1.0116
7/30/2019 1.0116
7/31/2019 1.005
8/1/2019 1.035
8/2/2019 1.01
8/5/2019 0.9886
8/6/2019 0.9886
8/7/2019 0.965
8/8/2019 0.965
8/9/2019 0.965
8/12/2019 0.965
you can use reindex with bdate_range to create all the missing values in rates for business days only:
new_df = df.set_index('Dates')\
.reindex( pd.bdate_range(df.Dates.min(), df.Dates.max(), name='Dates'),
method='bfill')\
.reset_index()
print (new_df)
Dates rates
0 2019-07-26 1.0400
1 2019-07-29 1.0116
2 2019-07-30 1.0116
3 2019-07-31 1.0050
4 2019-08-01 1.0350
5 2019-08-02 1.0100
6 2019-08-05 0.9886
7 2019-08-06 0.9886
8 2019-08-07 0.9650
9 2019-08-08 0.9650
10 2019-08-09 0.9650
11 2019-08-12 0.9650
You could create a Series of all business days then outer merge and bfill the missing values. This will retain any non-business days in your initial DataFrame (if any) and will also use their values in the filling.
import pandas as pd
#df['Dates'] = pd.to_datetime(df['Dates'])
s = pd.Series(pd.date_range(df['Dates'].min(), df['Dates'].max(), freq='D'),
name='Dates')
s = s[s.dt.dayofweek.lt(5)]
df = df.merge(s, how='outer').sort_values('Dates').bfill()
Dates rates
0 2019-07-26 1.0400
7 2019-07-29 1.0116
1 2019-07-30 1.0116
2 2019-07-31 1.0050
3 2019-08-01 1.0350
4 2019-08-02 1.0100
8 2019-08-05 0.9886
5 2019-08-06 0.9886
9 2019-08-07 0.9650
10 2019-08-08 0.9650
11 2019-08-09 0.9650
6 2019-08-12 0.9650
I have a csv which looks like this:
Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc
And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:
Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3
I'm using Pandas to open csv
data = pd.read_csv("pks.csv", sep=",")
so in data['Date'] I have dates and in data['Sentiment'] I have values. Any idea how to do it?
Probably the simplest approach is to use the resample command. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIO part and the header=True ... I am reading in your sample data from a multi-line string):
>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
index_col='Date')
>>> df
Sentiment
Date
2014-01-03 0.40
2014-01-04 -0.03
2014-01-09 0.00
2014-01-10 0.07
2014-01-12 0.00
2014-02-24 0.00
2014-02-25 0.00
2014-02-25 0.00
2014-02-26 0.00
2014-02-28 0.00
2014-03-01 0.10
2014-03-02 -0.50
2014-03-03 0.00
2014-03-08 -0.06
2014-03-11 -0.13
2014-03-22 0.00
2014-03-23 0.33
2014-03-23 0.30
2014-03-25 -0.14
2014-03-28 -0.25
>>> df.resample('M').mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
And if you want a month counter, you can add it after your resample:
>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg
Sentiment cnt
2014-01-31 0.088 0
2014-02-28 0.000 1
2014-03-31 -0.035 2
You can also do this with the groupby method and the TimeGrouper function (group by month and then call the mean convenience method that is available with groupby).
>>> df.groupby(pd.TimeGrouper(freq='M')).mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:
Convert the column with the dates , df['dates'] into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).Sentiment.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'Sentiment':
date Sentiment
0 2018-01-07 34
1 2018-01-14 32
2 2018-01-21 15
3 2018-01-28 0
4 2018-02-04 95
5 2018-02-11 53
6 2018-02-18 7
7 2018-02-25 35
8 2018-03-04 17
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'Sentiment' and the index is 'date':
Sentiment
date
2018-01-07 34
2018-01-14 32
2018-01-21 15
2018-01-28 0
2018-02-04 95
2018-02-11 53
2018-02-18 7
2018-02-25 35
2018-03-04 17
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month grouping by month:
monthly_avg=df.groupby(months).Sentiment.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 20.25
2 47.50
3 17.00