I have two dataframes:
a.head()
AAPL SPY date
0 1000000.000000 1000000.000000 2010-01-04
1 921613.643818 969831.805642 2010-02-04
2 980649.393244 1000711.933790 2010-03-04
3 980649.393244 1000711.933790 2010-04-04
4 1232535.257461 1059090.504583 2010-05-04
and
b.head()
date test
0 2010-01-26 22:17:44 990482.664854
1 2010-03-09 22:37:17 998565.699784
2 2010-03-12 02:11:23 989957.374785
3 2010-04-05 18:01:37 994315.860439
4 2010-04-06 11:06:50 987887.723816
After I set the index for a and b (set_index('date')), I can use the pandas plot() function to create a nice plot with the date as the x-axis and the various columns as y-values. What I want to do is plot two dataframes with different indices on the same figure. As you can see from a and b, the indices are different, and I want to plot them on the same figure.
I tried merge and concat to join the dataframes together, but the resulting plot is not what I'd like because those functions insert numpy.NaN in places where the date is not the same, which makes discontinuities in my plots. I can use pd.fillna() but this is not what I'd like, since I'd rather it just connect the points together rather than drop down to 0.
Assuming you want the same time scale on the x-axis, then you will need timestamps as the index for for a and b before concatenating the columns.
You can then use interpolation to fill in the missing data, optionally with ffill() as an additional operation if you want to fill forward past the last observed data point.
df = pd.concat([a, b.set_index('date')], axis=1)
df.interpolate(method='time').plot() # interpolate(method='time').ffill()
Related
I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:
I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.
This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0
I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]
I have a dataframe with dates (datetime) in python. How can I plot a histogram with 30 min bins from the occurrences using this dataframe?
starttime
1 2016-09-11 00:24:24
2 2016-08-28 00:24:24
3 2016-07-31 05:48:31
4 2016-09-11 00:23:14
5 2016-08-21 00:55:23
6 2016-08-21 01:17:31
.............
989872 2016-10-29 17:31:33
989877 2016-10-02 10:00:35
989878 2016-10-29 16:42:41
989888 2016-10-09 07:43:27
989889 2016-10-09 07:42:59
989890 2016-11-05 14:30:59
I have tried looking at examples from Plotting series histogram in Pandas and A per-hour histogram of datetime using Pandas. But they seem to be using a bar plot which is not what I need. I have attempted to create the histogram using temp.groupby([temp["starttime"].dt.hour, temp["starttime"].dt.minute]).count().plot(kind="hist") giving me the results as shown below
If possible I would like the X axis to display the time(e.g 07:30:00)
I think you need bar plot and for axis with times simpliest is convert datetimes to strings by strftime:
temp = temp.resample('30T', on='starttime').count()
ax = temp.groupby(temp.index.strftime('%H:%M')).sum().plot(kind="bar")
#for nicer bar some ticklabels are hidden
spacing = 2
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)
Hi I am trying to resample a pandas DataFrame backwards.
This is my dataframe:
seconds = np.arange(20, 700, 60)
timedeltas = pd.to_timedelta(seconds, unit='s')
vals = np.array([randint(-10,10) for a in range(len(seconds))])
df = pd.DataFrame({'values': vals}, index = timedeltas)
then I have
In [252]: df
Out[252]:
values
00:00:20 8
00:01:20 4
00:02:20 5
00:03:20 9
00:04:20 7
00:05:20 5
00:06:20 5
00:07:20 -6
00:08:20 -3
00:09:20 -5
00:10:20 -5
00:11:20 -10
and
In [253]: df.resample('5min').mean()
Out[253]:
values
00:00:20 6.6
00:05:20 -0.8
00:10:20 -7.5
and what I would like is something like
Out[***]:
values
00:01:20 6
00:06:20 valb
00:11:20 -5.8
where the values of each new time are the ones if I roll back the dataframe and compute the mean in each bin going from backwards to forward. For example in this
case the last value should be
valc = (-6-3-5-5-10)/5.
valc= -5.8
which is the average of the last 5 values, and the first one should be the average of the only 2 first values because the "bin" is incomplete.
Reading pandas documentation I thought that I have to use the parameters how='last' but in my current version of pandas this is not working (version 0.20.3). Additionally I tried with the options closed and convention, but I wasn't able to perform this.
Thanks for the help
The easiest way is to sort the index in reverse order, then resample to get the desired results:
df.sort_index(ascending=False).resample('5min').mean()
Resample reference - When the resample starts the first bin is of max available length, in this case 5. Closed, label, convention parameters are helpful but do not compute the mean going from backwards to forward. To do that use sort.