AttributeError: Unknown property color_cycle - python

I am learning 'pandas' and trying to plot id column but I get an error AttributeError: Unknown property color_cycle and empty graph. The graph only appears in interactive shell. When I execute as script I get same error except the graph doesn't appear.
Below is the log:
>>> import pandas as pd
>>> pd.set_option('display.mpl_style', 'default')
>>> df = pd.read_csv('2015.csv', parse_dates=['log_date'])
>>> employee_198 = df[df['employee_id'] == 198]
>>> print(employee_198)
id version company_id early_minutes employee_id late_minutes \
90724 91635 0 1 NaN 198 NaN
90725 91636 0 1 NaN 198 0:20:00
90726 91637 0 1 0:20:00 198 NaN
90727 91638 0 1 0:05:00 198 NaN
90728 91639 0 1 0:25:00 198 NaN
90729 91640 0 1 0:15:00 198 0:20:00
90730 91641 0 1 NaN 198 0:15:00
90731 91642 0 1 NaN 198 NaN
90732 91643 0 1 NaN 198 NaN
90733 91644 0 1 NaN 198 NaN
90734 91645 0 1 NaN 198 NaN
90735 91646 0 1 NaN 198 NaN
90736 91647 0 1 NaN 198 NaN
90737 91648 0 1 NaN 198 NaN
90738 91649 0 1 NaN 198 NaN
90739 91650 0 1 NaN 198 0:10:00
90740 91651 0 1 NaN 198 NaN
90741 91652 0 1 NaN 198 NaN
90742 91653 0 1 NaN 198 NaN
90743 91654 0 1 NaN 198 NaN
90744 91655 0 1 NaN 198 NaN
90745 91656 0 1 NaN 198 NaN
90746 91657 0 1 1:30:00 198 NaN
90747 91658 0 1 0:04:25 198 NaN
90748 91659 0 1 NaN 198 NaN
90749 91660 0 1 NaN 198 NaN
90750 91661 0 1 NaN 198 NaN
90751 91662 0 1 NaN 198 NaN
90752 91663 0 1 NaN 198 NaN
90753 91664 0 1 NaN 198 NaN
90897 91808 0 1 NaN 198 0:04:14
91024 91935 0 1 NaN 198 0:21:43
91151 92062 0 1 NaN 198 0:42:07
91278 92189 0 1 NaN 198 0:16:36
91500 92411 0 1 NaN 198 0:07:12
91532 92443 0 1 NaN 198 NaN
91659 92570 0 1 NaN 198 0:53:03
91786 92697 0 1 NaN 198 NaN
91913 92824 0 1 NaN 198 NaN
92040 92951 0 1 NaN 198 NaN
92121 93032 0 1 4:22:35 198 NaN
92420 93331 0 1 NaN 198 NaN
92421 93332 0 1 NaN 198 3:51:15
log_date log_in_time log_out_time over_time remarks \
90724 2015-11-15 No In No Out NaN [Absent]
90725 2015-10-18 10:00:00 17:40:00 NaN NaN
90726 2015-10-19 9:20:00 17:10:00 NaN NaN
90727 2015-10-25 9:30:00 17:25:00 NaN NaN
90728 2015-10-26 9:34:00 17:05:00 NaN NaN
90729 2015-10-27 10:00:00 17:15:00 NaN NaN
90730 2015-10-28 9:55:00 17:30:00 NaN NaN
90731 2015-10-29 9:40:00 17:30:00 NaN NaN
90732 2015-10-30 9:00:00 17:30:00 0:30:00 NaN
90733 2015-10-20 No In No Out NaN [Absent]
90734 2015-10-21 No In No Out NaN [Maha Asthami]
90735 2015-10-22 No In No Out NaN [Nawami/Dashami]
90736 2015-10-23 No In No Out NaN [Absent]
90737 2015-10-24 No In No Out NaN [Off]
90738 2015-11-01 9:15:00 17:30:00 0:15:00 NaN
90739 2015-11-02 9:50:00 17:30:00 NaN NaN
90740 2015-11-03 9:30:00 17:30:00 NaN NaN
90741 2015-11-04 9:40:00 17:30:00 NaN NaN
90742 2015-11-05 9:38:00 17:30:00 NaN NaN
90743 2015-11-06 9:30:00 17:30:00 NaN NaN
90744 2015-11-08 9:30:00 17:30:00 NaN NaN
90745 2015-11-09 9:30:00 17:30:00 NaN NaN
90746 2015-11-10 9:30:00 16:00:00 NaN NaN
90747 2015-11-16 9:30:00 17:25:35 NaN NaN
90748 2015-11-07 No In No Out NaN [Off]
90749 2015-11-11 No In No Out NaN [Laxmi Puja]
90750 2015-11-12 No In No Out NaN [Govardhan Puja]
90751 2015-11-13 No In No Out NaN [Bhai Tika]
90752 2015-11-14 No In No Out NaN [Off]
90753 2015-10-31 No In No Out NaN [Off]
90897 2015-11-17 9:44:14 17:35:01 NaN NaN
91024 2015-11-18 10:01:43 17:36:29 NaN NaN
91151 2015-11-19 10:22:07 17:43:47 NaN NaN
91278 2015-11-20 9:56:36 17:37:00 NaN NaN
91500 2015-11-22 9:47:12 17:46:44 NaN NaN
91532 2015-11-21 No In No Out NaN [Off]
91659 2015-11-23 10:33:03 17:30:00 NaN NaN
91786 2015-11-24 9:34:11 17:32:24 NaN NaN
91913 2015-11-25 9:36:05 17:35:00 NaN NaN
92040 2015-11-26 9:35:39 17:58:05 0:22:26 NaN
92121 2015-11-27 9:08:45 13:07:25 NaN NaN
92420 2015-11-28 No In No Out NaN [Off]
92421 2015-11-29 13:31:15 17:34:44 NaN NaN
shift_in_time shift_out_time work_time under_time
90724 9:30:00 17:30:00 NaN NaN
90725 9:30:00 17:30:00 7:40:00 0:20:00
90726 9:30:00 17:30:00 7:50:00 0:10:00
90727 9:30:00 17:30:00 7:55:00 0:05:00
90728 9:30:00 17:30:00 7:31:00 0:29:00
90729 9:30:00 17:30:00 7:15:00 0:45:00
90730 9:30:00 17:30:00 7:35:00 0:25:00
90731 9:30:00 17:30:00 7:50:00 0:10:00
90732 9:30:00 17:30:00 8:30:00 NaN
90733 9:30:00 17:30:00 NaN NaN
90734 9:30:00 17:30:00 NaN NaN
90735 9:30:00 17:30:00 NaN NaN
90736 9:30:00 17:30:00 NaN NaN
90737 9:30:00 17:30:00 NaN NaN
90738 9:30:00 17:30:00 8:15:00 NaN
90739 9:30:00 17:30:00 7:40:00 0:20:00
90740 9:30:00 17:30:00 8:00:00 NaN
90741 9:30:00 17:30:00 7:50:00 0:10:00
90742 9:30:00 17:30:00 7:52:00 0:08:00
90743 9:30:00 17:30:00 8:00:00 NaN
90744 9:30:00 17:30:00 8:00:00 NaN
90745 9:30:00 17:30:00 8:00:00 NaN
90746 9:30:00 17:30:00 6:30:00 1:30:00
90747 9:30:00 17:30:00 7:55:35 0:04:25
90748 9:30:00 17:30:00 NaN NaN
90749 9:30:00 17:30:00 NaN NaN
90750 9:30:00 17:30:00 NaN NaN
90751 9:30:00 17:30:00 NaN NaN
90752 9:30:00 17:30:00 NaN NaN
90753 9:30:00 17:30:00 NaN NaN
90897 9:30:00 17:30:00 7:50:47 0:09:13
91024 9:30:00 17:30:00 7:34:46 0:25:14
91151 9:30:00 17:30:00 7:21:40 0:38:20
91278 9:30:00 17:30:00 7:40:24 0:19:36
91500 9:30:00 17:30:00 7:59:32 0:00:28
91532 9:30:00 17:30:00 NaN NaN
91659 9:30:00 17:30:00 6:56:57 1:03:03
91786 9:30:00 17:30:00 7:58:13 0:01:47
91913 9:30:00 17:30:00 7:58:55 0:01:05
92040 9:30:00 17:30:00 8:22:26 NaN
92121 9:30:00 17:30:00 3:58:40 4:01:20
92420 9:30:00 17:30:00 NaN NaN
92421 9:30:00 17:30:00 4:03:29 3:56:31
>>> employee_198['id'].plot()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 3497, in __call__
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2587, in plot_series
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2384, in _plot
plot_obj.generate()
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 987, in generate
self._make_plot()
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1664, in _make_plot
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1678, in _plot
lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1300, in _plot
return ax.plot(*args, **kwds)
File "C:\Python27\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes\_axes.py", line 1427, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 386, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 374, in _plot_args
seg = func(x[:, j % ncx], y[:, j % ncy], kw, kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 280, in _makeline
seg = mlines.Line2D(x, y, **kw)
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 366, in __init__
self.update(kwargs)
File "C:\Python27\lib\site-packages\matplotlib\artist.py", line 856, in update
raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property color_cycle
>>>

There's currently a bug in Pandas 0.17.1 with Matplotlib 1.5.0
print pandas.__version__
print matplotlib.__version__
Instead of using
import pandas as pd
pd.set_option('display.mpl_style', 'default')
Use:
import matplotlib
matplotlib.style.use('ggplot')

Related

dataframe data transfer with selected values to another dataframe

My goal is selecting the column Sabah in dataframe prdt and entering every value to repeated rows called Sabah in dataframe prcal
prcal
Vakit Start_Date End_Date Start_Time End_Time
0 Sabah 2022-01-01 2022-01-01 NaN NaN
1 Güneş 2022-01-01 2022-01-01 NaN NaN
2 Öğle 2022-01-01 2022-01-01 NaN NaN
3 İkindi 2022-01-01 2022-01-01 NaN NaN
4 Akşam 2022-01-01 2022-01-01 NaN NaN
..........................................................
2184 Sabah 2022-12-31 2022-12-31 NaN NaN
2185 Güneş 2022-12-31 2022-12-31 NaN NaN
2186 Öğle 2022-12-31 2022-12-31 NaN NaN
2187 İkindi 2022-12-31 2022-12-31 NaN NaN
2188 Akşam 2022-12-31 2022-12-31 NaN NaN
2189 rows × 5 columns
prdt
Day Sabah Güneş Öğle İkindi Akşam Yatsı
0 2022-01-01 06:51:00 08:29:00 13:08:00 15:29:00 17:47:00 19:20:00
1 2022-01-02 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:21:00
2 2022-01-03 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:22:00
3 2022-01-04 06:51:00 08:29:00 13:09:00 15:31:00 17:49:00 19:22:00
4 2022-01-05 06:51:00 08:29:00 13:10:00 15:32:00 17:50:00 19:23:00
...........................................................................
360 2022-12-27 06:49:00 08:27:00 13:06:00 15:25:00 17:43:00 19:16:00
361 2022-12-28 06:50:00 08:28:00 13:06:00 15:26:00 17:43:00 19:17:00
362 2022-12-29 06:50:00 08:28:00 13:07:00 15:26:00 17:44:00 19:18:00
363 2022-12-30 06:50:00 08:28:00 13:07:00 15:27:00 17:45:00 19:18:00
364 2022-12-31 06:50:00 08:28:00 13:07:00 15:28:00 17:46:00 19:19:00
365 rows × 7 columns
Selected every row called sabah prcal.iloc[::6,:]
Made a list for prdt['Sabah'].
When integrating prcal.iloc[::6,:] = prdt['Sabah'][0:365] I get a value error:
ValueError: Must have equal len keys and value when setting with an iterable

Continuous dates for products in Pandas

I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance
We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0
Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Group by column and resampled date and get rolling sum of other column

I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.

Select pandas dataframe rows between dates and set column value

In the dataframe below, I want to set row values in the column p50 to NaN if they are below 2.0 between the dates May 15th and August 15th 2018.
date p50
2018-03-02 2018-03-02 NaN
2018-03-03 2018-03-03 NaN
2018-03-04 2018-03-04 0.022590
2018-03-05 2018-03-05 NaN
2018-03-06 2018-03-06 -0.042227
2018-03-07 2018-03-07 NaN
2018-03-08 2018-03-08 NaN
2018-03-09 2018-03-09 -0.028646
2018-03-10 2018-03-10 NaN
2018-03-11 2018-03-11 -0.045244
2018-03-12 2018-03-12 NaN
2018-03-13 2018-03-13 NaN
2018-03-14 2018-03-14 -0.020590
2018-03-15 2018-03-15 NaN
2018-03-16 2018-03-16 -0.028317
2018-03-17 2018-03-17 NaN
2018-03-18 2018-03-18 NaN
2018-03-19 2018-03-19 NaN
2018-03-20 2018-03-20 NaN
2018-03-21 2018-03-21 NaN
2018-03-22 2018-03-22 NaN
2018-03-23 2018-03-23 NaN
2018-03-24 2018-03-24 -0.066800
2018-03-25 2018-03-25 NaN
2018-03-26 2018-03-26 -0.104135
2018-03-27 2018-03-27 NaN
2018-03-28 2018-03-28 NaN
2018-03-29 2018-03-29 -0.115200
2018-03-30 2018-03-30 NaN
2018-03-31 2018-03-31 -0.000455
... ...
2018-07-03 2018-07-03 NaN
2018-07-04 2018-07-04 2.313035
2018-07-05 2018-07-05 NaN
2018-07-06 2018-07-06 NaN
2018-07-07 2018-07-07 NaN
2018-07-08 2018-07-08 NaN
2018-07-09 2018-07-09 0.054513
2018-07-10 2018-07-10 NaN
2018-07-11 2018-07-11 NaN
2018-07-12 2018-07-12 3.711159
2018-07-13 2018-07-13 NaN
2018-07-14 2018-07-14 6.583810
2018-07-15 2018-07-15 NaN
2018-07-16 2018-07-16 NaN
2018-07-17 2018-07-17 0.070182
2018-07-18 2018-07-18 NaN
2018-07-19 2018-07-19 3.688812
2018-07-20 2018-07-20 NaN
2018-07-21 2018-07-21 NaN
2018-07-22 2018-07-22 0.876552
2018-07-23 2018-07-23 NaN
2018-07-24 2018-07-24 1.077895
2018-07-25 2018-07-25 NaN
2018-07-26 2018-07-26 NaN
2018-07-27 2018-07-27 3.802159
2018-07-28 2018-07-28 NaN
2018-07-29 2018-07-29 0.077402
2018-07-30 2018-07-30 NaN
2018-07-31 2018-07-31 NaN
2018-08-01 2018-08-01 3.202214
The dataframe has a datetime index. I do the foll:
mask = (group['date'] > '2018-5-15') & (group['date'] <= '2018-8-15')
group[mask].loc[group[mask]['p50'] < 2.]['p50'] = np.NaN
However, this does not update the dataframe. How to fix this?
I think you should using .loc like
mask = (group['date'] > '2018-5-15') & (group['date'] <= '2018-8-15')
group.loc[mask&(group['p50'] < 2),'p50']=np.nan

Categories