Plot DataFrame in 1 year period - python

I have dataframe:
temp_old temp_new
Year 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018
Date
2013-01-01 23:00:00 21.587569 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 00:00:00 21.585347 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 01:00:00 21.583472 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.882083
2018-02-05 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.878472
When I plot this df thats my result.
My goal is show it but without separate by years. So I want to have 5 curves in range January: December on one chart.
update:(code to plot)
df_sep_by_year.plot(figsize=(15,8))

Simply remove year from your Date column. I mean instead of 2013-01-01 23:00:00 use 01-01 23:00:00 and adjust your data similarly for other records.
# remove datetime index
df.reset_index(inplace=True)
# create new column without year, use ':02d' to correct sorting
df['new_date'] = df.Date.apply(lambda x: '{:02d}-{:02d}-{:02d}:00:00'.format(x.month, x.day, x.hour))
# set new index to df
df.set_index('new_date', inplace=True)
# remove old column with datetime
df = df.drop(labels=['Date'], axis=1)
# remove multiindex in columns
df.columns = [''.join(str(col)) for col in df.columns]
# join variable from different year but the same month and day
df = pd.concat([pd.DataFrame(df[x]).dropna(axis=0, how='any') for x in df_sep_by_year], axis=1).dropna(axis=1, how='all')
df.plot()

Related

Pandas: Filling NaN values in dataframe with monthly mean

The dataframe I am working with is as follows:
date AA1 AB2 AC3 AD4
0 1996-01-01 00:00:00 NaN NaN NaN NaN
1 1996-01-01 01:00:00 NaN 19.2 NaN NaN
2 1996-01-01 02:00:00 NaN 16.4 NaN NaN
3 1996-01-01 03:00:00 NaN 23.5 NaN NaN
4 1996-01-01 04:00:00 20.4 NaN NaN NaN
... ... ... ... ... ...
219164 2020-12-31 20:00:00 13.4 NaN 23.0 26.6
219165 2020-12-31 21:00:00 14.2 NaN 19.6 28.3
219166 2020-12-31 22:00:00 13.5 NaN 17.9 20.5
219167 2020-12-31 23:00:00 NaN NaN 16.7 20.7
219168 2021-01-01 00:00:00 NaN NaN NaN NaN
These are hourly data readings taken from different sensors from the year 1996 to 2021.
My goal is to be able to fill the NaN values with the monthly mean for each of the columns based on the date.
I have tried grouping the data and getting the monthly means for the group, though I am not sure where to go from here to transfer the grouped means to the original, larger dataframe, filling in some of the NaN values.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
tem = df.groupby(['year', 'month']).mean().reset_index()
The resulting dataframe looks like this, with less indices because of the grouping:
year month AA1 AB2 AC3 AD4
0 1996 1 20.1 18.3 NaN NaN
1 1996 2 NaN NaN NaN NaN
2 1996 3 NaN NaN NaN NaN
3 1996 4 NaN NaN NaN NaN
4 1996 5 NaN NaN NaN NaN
... ... ... ... ... ... ...
296 2020 9 NaN NaN 15.7 20.2
297 2020 10 NaN NaN 15.3 19.7
298 2020 11 NaN NaN 26.7 25.9
299 2020 12 NaN NaN 24.6 25.3
300 2021 1 NaN NaN NaN NaN
Any advice on how I can implement this would be helpful. In the end, I need the original dataset indices, dates and columns, but with the NaN values filled with the means calculated from the monthly groups. The months with all NaN values can be ignored for the time being.
Assuming your date column is of type datetime64 or equivalent:
df['AA2'] = df['AA2'].fillna(df.groupby(df.date.dt.month)['AA2'].transform('mean'))
Or looping over all your columns (except the date column):
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby(df.date.dt.month)[col].transform('mean'))
If you only want the mean of the month in that specific year, add df.date.dt.year to the group by function:
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby([df.date.dt.year, df.date.dt.month])[col].transform('mean'))

How to filter column based on another column date range

I currently have a dataframe where 1st column is dates (1990 - 2020) and the subsequent columns are 'stocks' that are trading and are NaN if they are not yet being traded. Is there any way to filter the columns based on date range? For example, if 2 years is selected, all stocks that are not null in all columns from 2019-2020 (2 years) will be filtered in.
import pandas as pd
df = pd.read_csv("prices.csv")
df.head()
display(df)
date s_0000 s_0001 s_0002 s_0003 s_0004 s_0005 s_0006 s_0007 s_0008 ... s_2579 s_2580 s_2581 s_2582 s_2583 s_2584 s_2585 s_2586 s_2587 s_2588
0 1990-01-02 NaN 13.389421 NaN NaN NaN NaN NaN 0.266812 NaN ... NaN 1.950358 NaN 7.253997 NaN NaN NaN NaN NaN NaN
1 1990-01-03 NaN 13.588601 NaN NaN NaN NaN NaN 0.268603 NaN ... NaN 1.985185 NaN 7.253997 NaN NaN NaN NaN NaN NaN
2 1990-01-04 NaN 13.610730 NaN NaN NaN NaN NaN 0.269499 NaN ... NaN 1.985185 NaN 7.188052 NaN NaN NaN NaN NaN NaN
3 1990-01-05 NaN 13.477942 NaN NaN NaN NaN NaN 0.270394 NaN ... NaN 1.985185 NaN 7.188052 NaN NaN NaN NaN NaN NaN
4 1990-01-08 NaN 13.477942 NaN NaN NaN NaN NaN 0.272185 NaN ... NaN 1.985185 NaN 7.385889 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7806 2020-12-23 116.631310 22.171579 15.890000 16.577030 9.00 65.023491 157.495850 130.347580 27.481012 ... 19.870001 42.675430 2.90 8.850000 9.93 NaN 0.226 207.470001 158.974014 36.650002
7807 2020-12-24 116.641243 21.912146 15.660000 16.606722 8.77 65.292725 158.870193 131.352829 27.813406 ... 20.180000 42.508686 2.88 8.810000 9.91 NaN 0.229 205.270004 159.839264 36.009998
7808 2020-12-28 117.158287 22.191536 16.059999 16.200956 8.93 66.429459 157.011383 136.050766 28.272888 ... 19.959999 42.528305 2.69 8.760000 9.73 NaN 0.251 199.369995 161.500122 36.709999
7809 2020-12-29 116.561714 21.991972 15.860000 16.745275 8.80 66.529175 154.925140 134.239273 27.705866 ... 19.530001 41.949623 2.59 8.430000 9.61 NaN 0.243 197.839996 162.226105 36.610001
7810 2020-12-30 116.720795 22.899990 16.150000 17.932884 8.60 66.299828 155.884232 133.094650 27.725418 ... 19.870001 42.390987 2.65 8.540000 9.72 NaN 0.230 201.309998 163.369812 36.619999
so I want to do something like:
year = input(Enter number of years:)
year = 3
If year is 3, the daterange selected would be 3 years to 2020 (2018-2020)
You could try the following code:
df[(df['date'] >= '2019-01-01') & (df['date'] <= '2020-12-30')]
Once you filter, you could remove all rows, which include NaN:
df.dropna()

Get last non-NaN value for each month in pandas

I have a DataFrame of the form
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-01 NaN NaN NaN
2015-01-02 NaN NaN NaN
2015-01-05 1 NaN NaN
2015-01-06 NaN NaN NaN
2015-01-07 NaN NaN NaN
2015-01-08 NaN 0.2 NaN
2015-01-09 NaN NaN NaN
2015-01-12 5 NaN NaN
2015-01-13 NaN NaN NaN
2015-01-14 NaN NaN NaN
2015-01-15 NaN NaN NaN
And I would like, for each month, to get the last non-NaN value of each column (NaN if there is no valid value). Hence resulting in something like
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5 0.2 NaN
2015-02-28 10 1 3
2015-03-31 NaN NaN 3
2015-04-30 10 1 3
I had two ideas to perform this:
Do a ffill with a limit that goes to the end of the month. Something like df.ffill(<add good thing here>).resample('M').last().
Use last_valid_index with resample('M').
Using resample
df.resample('M').last()
Out[82]:
ACA_FP AC_FP AI_FP
eqt_code
2015-01-31 1.0 0.2 NaN
Use groupby and last:
# Do this if the index isn't a DatetimeIndex.
# df.index = pd.to_datetime(df.index)
df.groupby(df.index + pd.offsets.MonthEnd(0)).last()
ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5.0 0.2 NaN
...
Using df.dropna(how='all') will remove each row where all the values are NaN, and will get you most of the way there.

Python interpolate not working on rows

Related to Error in gapfilling by row in Pandas, I would like to interpolate instead of using fillna. Currently, I am doing this:
df.ix[:,'2015':'2100'].interpolate(axis = 1, method = 'linear')
However, this does not seem to replace the NaN's. Any suggestion?
--EDIT
This does not seem to work either:
df.apply(pandas.Series.interpolate, inplace = True)
This looks like a bug (I'm using Pandas 0.16.2 with Python 3.4.3).
Using a subset of your data:
>>>df.ix[:3, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
The linear interpolation works fine and returns a new dataframe.
>>> df.ix[:3, '2015':'2020'].interpolate(axis=1, method='linear')
2015 2016 2017 2018 2019 2020
0 0.001248 0.001255 0.001261 0.001268 0.001275 0.001281
1 0.009669 0.009728 0.009786 0.009845 0.009904 0.009963
2 0.020005 0.020134 0.020264 0.020393 0.020522 0.020651
3 0.025557 0.025687 0.025818 0.025949 0.026080 0.026211
The original is still untouched.
>>> df.ix[:4, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
3 0.025557 NaN NaN NaN NaN 0.026211
4 0.060077 NaN NaN NaN NaN 0.060909
So let's to to change it using the inplace=True parameter.
df.ix[:3, '2015':'2020'].interpolate(axis=1, method='linear', inplace=True)
>>> df.ix[:4, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
3 0.025557 NaN NaN NaN NaN 0.026211
4 0.060077 NaN NaN NaN NaN 0.06090
The changes didn't hold.

creating dictionary of dataframes within for-loop throws AssertionError [Pandas for Python]

I am trying to build a set of dataframes from a folder full of csv files. I first create a dictionary of dataframes using the following for loop
mydir = os.getcwd()
pdatahistorypath = os.path.join(mydir, pdatahistoryfolder)
currentcsvfilenames = os.listdir(pdatahistorypath)
dframes = {}
for filey in currentcsvfilenames:
thispath = os.path.join(pdatahistorypath, filey)
sitedata = pd.read_csv(thispath, header=4)
sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
sitedata['Date'] = pd.to_datetime(sitedata['Date'])
sitedata.index = sitedata['Date'] # reasign the index to the date column
dframes[filey[:-4]] = sitedata
I then pull them into a panel using
mypanel = pd.Panel(dframes) # create panel
From that panel, I pull the oldest and the latest date, round the oldest date to the nearest 20 minutes, create a DateTimeIndex for that timespan in 20 minute intervals
first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1] # the very last date in series
multiplier = (1e9)*60*20 # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']
Then, I am trying to reindex my irregularly timestamped data to the 20 minute interval series I created before, idx
df2 = df.reindex(idx)
Problem is, I am getting the following error
Traceback (most recent call last):
File "C:/Users/ble1usb/Dropbox/Git/ers-dataanalyzzer/pandasdfmaker.py", line 50, in <module>
df2 = df.reindex(idx)#, method=None)#, method='pad', limit=None) # reindex to the datetimeindex built from first/last dates
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2625, in reindex
fill_value, limit, takeable)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2707, in _reindex_index
copy, fill_value)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2723, in _reindex_with_indexers
fill_value=fill_value)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1985, in reindex_indexer
return BlockManager(new_blocks, new_axes)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1001, in __init__
self._verify_integrity()
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1236, in _verify_integrity
raise AssertionError("Block ref_items must be BlockManager "
AssertionError: Block ref_items must be BlockManager items
In dubugging this problem, I have discovered that the following works just fine. I have tried to recreate every difference I can think of, short of the dataframes being created inside of the loop
dframes = {}
dfpath = 'C:\Users\\ble1usb\Dropbox\Git\ers-dataanalyzzer\datahistoryPandas\Naka-1.csv'
sitedata = pd.read_csv(dfpath, header=4)
sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
sitedata['Date'] = pd.to_datetime(sitedata['Date'])
sitedata.index = sitedata['Date'] # reasign the index to the date column
dframes['Naka-1'] = sitedata
dframes['myOtherSite'] = sitedata[sitedata['Out ppm'] > 3]
mypanel = pd.Panel(dframes)
first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1] # the very last date in series
multiplier = (1e9)*60*20 # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']
df2 = df.reindex(idx)
Here is the output of the previous block of code (I am losing some data to rounding, to be solved later)
>> print df2.tail(15)
Date Status Alarms Present RPM Hours Oil Pres. Out ppm Ratio In Out Inlet psi Bag psi Disch. psi Hi Pres Coolant Temp Comm
2013-12-10 16:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 17:00:00 2013-12-10 17:00:00 Running none 2,820 9,384 53 0 0 469 473 5.56 0.72 268.1 0 1 Normal
2013-12-10 17:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 17:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 18:00:00 2013-12-10 18:00:00 Running none 2,820 9,385 54 0 0 462 470 12.28 0.82 259.1 0 1 Normal
2013-12-10 18:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 18:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:00:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 20:00:00 2013-12-10 20:00:00 Running none 2,880 9,387 55 0 0 450 456 10.91 0.73 249.9 0 1 Normal
2013-12-10 20:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 20:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 21:00:00 2013-12-10 21:00:00 Running none 2,820 9,388 54 0 0 440 449 8.16 0.62 243.1 0 1 Normal
2013-12-10 21:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
So, I know it should be working. I can't think of anything else that would be causing this Assertion Error.
Anything I could try?
You should be using resample rather than reindexing with a date_range:
idx = pd.date_range(t3, last_date, freq="20min")
df2 = df.reindex(idx)
Might rather be:
df.resample('20min', 'last')

Categories