I have a DataFrame of the form
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-01 NaN NaN NaN
2015-01-02 NaN NaN NaN
2015-01-05 1 NaN NaN
2015-01-06 NaN NaN NaN
2015-01-07 NaN NaN NaN
2015-01-08 NaN 0.2 NaN
2015-01-09 NaN NaN NaN
2015-01-12 5 NaN NaN
2015-01-13 NaN NaN NaN
2015-01-14 NaN NaN NaN
2015-01-15 NaN NaN NaN
And I would like, for each month, to get the last non-NaN value of each column (NaN if there is no valid value). Hence resulting in something like
eqt_code ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5 0.2 NaN
2015-02-28 10 1 3
2015-03-31 NaN NaN 3
2015-04-30 10 1 3
I had two ideas to perform this:
Do a ffill with a limit that goes to the end of the month. Something like df.ffill(<add good thing here>).resample('M').last().
Use last_valid_index with resample('M').
Using resample
df.resample('M').last()
Out[82]:
ACA_FP AC_FP AI_FP
eqt_code
2015-01-31 1.0 0.2 NaN
Use groupby and last:
# Do this if the index isn't a DatetimeIndex.
# df.index = pd.to_datetime(df.index)
df.groupby(df.index + pd.offsets.MonthEnd(0)).last()
ACA_FP AC_FP AI_FP
BDATE
2015-01-31 5.0 0.2 NaN
...
Using df.dropna(how='all') will remove each row where all the values are NaN, and will get you most of the way there.
Related
I have a pandas dataframe as following:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
2020-02-11 17:30:00 6 7 NaN NaN 3
2020-02-11 17:30:00 NaN NaN 3 3 NaN
2020-02-12 15:30:00 2 2 NaN NaN 3
2020-02-16 14:30:00 4 NaN NaN NaN 1
2020-02-16 14:30:00 NaN 7 NaN NaN NaN
2020-02-16 14:30:00 NaN NaN 8 2 NaN
The dates are identical for some rows, is it possible to add 1 second, 2 second, 3 seconds to 2, 3, and 4 identical dates? So if its just one unique date, leave as is. If there are two identical dates, leave first one as is but add 1 second to the second identical date. And if three identical date, leave first as is, second add 1 second and add 2 second to third one. Is this possible to do easily in pandas?
You can use groupby.cumcount combined with pandas.to_datetime with unit='s' to add incremental seconds to the duplicated rows:
s = pd.to_datetime(df['Date time'])
df['Date time'] = s+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
As a one liner with python 3.8+ walrus operator:
df['Date time'] = ((s:=pd.to_datetime(df['Date time']))
+pd.to_timedelta(s.groupby(s).cumcount(), unit='s')
)
output:
Date time LifeTime1 LifeTime2 LifeTime3 LifeTime4 LifeTime5
0 2020-02-11 17:30:00 6.0 7.0 NaN NaN 3.0
1 2020-02-11 17:30:01 NaN NaN 3.0 3.0 NaN
2 2020-02-12 15:30:00 2.0 2.0 NaN NaN 3.0
3 2020-02-16 14:30:00 4.0 NaN NaN NaN 1.0
4 2020-02-16 14:30:01 NaN 7.0 NaN NaN NaN
5 2020-02-16 14:30:02 NaN NaN 8.0 2.0 NaN
Right, so I'm a bit rusty with python (pulling it out after 4yrs) and was looking for a solution to this problem. While there were similar threads i wasn't able to figure out what I'm doing wrong.
I have some data that looks like this:
print (fwds)
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP \
Date
2019-10-15 1.47518 -0.503679 0.681473 1.84996 -0.804212 0.626394
2019-10-14 NaN -0.513647 0.684232 NaN -0.815201 0.643280
2019-10-11 1.51515 -0.520474 0.654544 1.84918 -0.812819 0.697584
2019-10-10 1.39085 -0.538651 0.564055 1.72812 -0.846291 0.546696
2019-10-09 1.30827 -0.568942 0.564897 1.63652 -0.896871 0.479307
... ... ... ... ... ... ...
1995-01-09 8.59473 NaN 10.830200 9.59729 NaN 9.407250
1995-01-06 8.58316 NaN 10.851200 9.42043 NaN 9.434480
1995-01-05 8.56470 NaN 10.839000 9.51209 NaN 9.560490
1995-01-04 8.44306 NaN 10.745900 9.51142 NaN 9.507650
1995-01-03 8.58847 NaN NaN 9.38380 NaN 9.611590
The problem is the data quality is not great and I need to remove outliers on a rolling basis (since these time series have been trending and using a static ZS will not work).
I tried a few solutions. One was to try and get a rolling zscore and then filter for the large ones. However, when I try calculating the zscore, my result is all NaNs:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0, skipna=True).shift(1)
z = (x-m)/s
return z
print (fwds)
print (zscore(fwds, 200))
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
Another approach:
r = fwds.rolling(window=200)
large = r.mean() + 4 * r.std()
small = r.mean() - 4 * r.std()
print(fwds[fwds > mps])
print (fwds[fwds < mps])
returns:
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
for both max and min as well. Anyone have any idea how to deal with these darn NaNs when calculating rolling stdev or zscores?
Any hints appreciated. Thanks!
Edit:
For further clarity, I was hoping to remove things like the spike in the green and brown lines from the chart systematically:
fwds.plot()
Link below: https://i.stack.imgur.com/udu5O.png
Welcome to stack overflow.... depending on your use case (and how many crazy extreme values there are) data interpolation should fit the bill....
Since you're looking at forwards (I think), interpolation should be statistically sound unless some of your missing values are the result of massive disruption in the market.
You can use pandas' DataFrame.interpolate to fill in your NaN values with interpolated values.
From the docs
Filling in NaN in a Series via linear interpolation.
>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0 0.0
1 1.0
2 NaN
3 3.0
dtype: float64
>>> s.interpolate()
0 0.0
1 1.0
2 2.0
3 3.0
dtype: float64
Edit I just realized you are looking for market dislocations so you may not want to use linear interpolation as that will mute the effect of missing data
This question already has answers here:
Replace NaN or missing values with rolling mean or other interpolation
(2 answers)
Python: Sliding windowed mean, ignoring missing data
(2 answers)
Closed 4 years ago.
I have a df like this:
a001 a002 a003 a004 a005
time_axis
2017-02-07 1 NaN NaN NaN NaN
2017-02-14 NaN NaN NaN NaN NaN
2017-03-20 NaN NaN 2 NaN NaN
2017-04-03 NaN 3 NaN NaN NaN
2017-05-15 NaN NaN NaN NaN NaN
2017-06-05 NaN NaN NaN NaN NaN
2017-07-10 NaN 6 NaN NaN NaN
2017-07-17 4 NaN NaN NaN NaN
2017-07-24 NaN NaN NaN 1 NaN
2017-08-07 NaN NaN NaN NaN NaN
2017-08-14 NaN NaN NaN NaN NaN
2017-08-28 NaN NaN NaN NaN 5
And I would like to make a rolling mean for each row on the previous 3 valid values(not empty rows) and save in another df:
last_3
time_axis
2017-02-07 1 # still there is only a row
2017-02-14 1 # only a valid value(in the first row) -> average is the value itself
2017-03-20 1.5 # average on the previous rows (only 2 rows contain value-> (2+1)/2
2017-04-03 2 # average on the previous rows with non-NaN values(2017-02-14 excluded) (2+3+1)/3
2017-05-15 2 # Same reason as the previous row
2017-06-05 2 # Same reason
2017-07-10 3.6 # Now the considered values are:2,3,6
2017-07-17 4.3 # considered values: 4,6,3
2017-07-24 3.6 # considered values: 1,4,6
2017-08-07 3.6 # no new values in this row, so again 1,4,6
2017-08-14 3.6 # same reason
2017-08-28 3.3 # now the considered values are: 5,1,4
I was trying deleting the empty rows in the first dataframe and then apply rolling and mean, but I think it is the wrong approach(df1 in my example already exist):
df2 = df.dropna(how='all')
df1['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=3).mean()
I think you need:
df2 = df.dropna(how='all')
df['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=1).mean()
df['last_3'] = df['last_3'].ffill()
print (df)
a001 a002 a003 a004 a005 last_3
2017-02-07 1.0 NaN NaN NaN NaN 1.000000
2017-02-14 NaN NaN NaN NaN NaN 1.000000
2017-03-20 NaN NaN 2.0 NaN NaN 1.500000
2017-04-03 NaN 3.0 NaN NaN NaN 2.000000
2017-05-15 NaN NaN NaN NaN NaN 2.000000
2017-06-05 NaN NaN NaN NaN NaN 2.000000
2017-07-10 NaN 6.0 NaN NaN NaN 3.666667
2017-07-17 4.0 NaN NaN NaN NaN 4.333333
2017-07-24 NaN NaN NaN 1.0 NaN 3.666667
2017-08-07 NaN NaN NaN NaN NaN 3.666667
2017-08-14 NaN NaN NaN NaN NaN 3.666667
2017-08-28 NaN NaN NaN NaN 5.0 3.333333
I have dataframe:
temp_old temp_new
Year 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018
Date
2013-01-01 23:00:00 21.587569 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 00:00:00 21.585347 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 01:00:00 21.583472 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.882083
2018-02-05 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.878472
When I plot this df thats my result.
My goal is show it but without separate by years. So I want to have 5 curves in range January: December on one chart.
update:(code to plot)
df_sep_by_year.plot(figsize=(15,8))
Simply remove year from your Date column. I mean instead of 2013-01-01 23:00:00 use 01-01 23:00:00 and adjust your data similarly for other records.
# remove datetime index
df.reset_index(inplace=True)
# create new column without year, use ':02d' to correct sorting
df['new_date'] = df.Date.apply(lambda x: '{:02d}-{:02d}-{:02d}:00:00'.format(x.month, x.day, x.hour))
# set new index to df
df.set_index('new_date', inplace=True)
# remove old column with datetime
df = df.drop(labels=['Date'], axis=1)
# remove multiindex in columns
df.columns = [''.join(str(col)) for col in df.columns]
# join variable from different year but the same month and day
df = pd.concat([pd.DataFrame(df[x]).dropna(axis=0, how='any') for x in df_sep_by_year], axis=1).dropna(axis=1, how='all')
df.plot()
I am trying to build a set of dataframes from a folder full of csv files. I first create a dictionary of dataframes using the following for loop
mydir = os.getcwd()
pdatahistorypath = os.path.join(mydir, pdatahistoryfolder)
currentcsvfilenames = os.listdir(pdatahistorypath)
dframes = {}
for filey in currentcsvfilenames:
thispath = os.path.join(pdatahistorypath, filey)
sitedata = pd.read_csv(thispath, header=4)
sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
sitedata['Date'] = pd.to_datetime(sitedata['Date'])
sitedata.index = sitedata['Date'] # reasign the index to the date column
dframes[filey[:-4]] = sitedata
I then pull them into a panel using
mypanel = pd.Panel(dframes) # create panel
From that panel, I pull the oldest and the latest date, round the oldest date to the nearest 20 minutes, create a DateTimeIndex for that timespan in 20 minute intervals
first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1] # the very last date in series
multiplier = (1e9)*60*20 # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']
Then, I am trying to reindex my irregularly timestamped data to the 20 minute interval series I created before, idx
df2 = df.reindex(idx)
Problem is, I am getting the following error
Traceback (most recent call last):
File "C:/Users/ble1usb/Dropbox/Git/ers-dataanalyzzer/pandasdfmaker.py", line 50, in <module>
df2 = df.reindex(idx)#, method=None)#, method='pad', limit=None) # reindex to the datetimeindex built from first/last dates
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2625, in reindex
fill_value, limit, takeable)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2707, in _reindex_index
copy, fill_value)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2723, in _reindex_with_indexers
fill_value=fill_value)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1985, in reindex_indexer
return BlockManager(new_blocks, new_axes)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1001, in __init__
self._verify_integrity()
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 1236, in _verify_integrity
raise AssertionError("Block ref_items must be BlockManager "
AssertionError: Block ref_items must be BlockManager items
In dubugging this problem, I have discovered that the following works just fine. I have tried to recreate every difference I can think of, short of the dataframes being created inside of the loop
dframes = {}
dfpath = 'C:\Users\\ble1usb\Dropbox\Git\ers-dataanalyzzer\datahistoryPandas\Naka-1.csv'
sitedata = pd.read_csv(dfpath, header=4)
sitedata = sitedata.drop('Unnamed: 16', axis=1) # drops waste column
sitedata['Date'] = pd.to_datetime(sitedata['Date'])
sitedata.index = sitedata['Date'] # reasign the index to the date column
dframes['Naka-1'] = sitedata
dframes['myOtherSite'] = sitedata[sitedata['Out ppm'] > 3]
mypanel = pd.Panel(dframes)
first_date = mypanel.major_axis[0]
last_date = mypanel.major_axis[-1] # the very last date in series
multiplier = (1e9)*60*20 # round (floor) to 20 minute interval
t3 = first_date.value - first_date.value % multiplier
idx = pd.date_range(t3, last_date, freq="20min")
df = dframes['Naka-1.csv']
df2 = df.reindex(idx)
Here is the output of the previous block of code (I am losing some data to rounding, to be solved later)
>> print df2.tail(15)
Date Status Alarms Present RPM Hours Oil Pres. Out ppm Ratio In Out Inlet psi Bag psi Disch. psi Hi Pres Coolant Temp Comm
2013-12-10 16:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 17:00:00 2013-12-10 17:00:00 Running none 2,820 9,384 53 0 0 469 473 5.56 0.72 268.1 0 1 Normal
2013-12-10 17:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 17:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 18:00:00 2013-12-10 18:00:00 Running none 2,820 9,385 54 0 0 462 470 12.28 0.82 259.1 0 1 Normal
2013-12-10 18:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 18:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:00:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 19:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 20:00:00 2013-12-10 20:00:00 Running none 2,880 9,387 55 0 0 450 456 10.91 0.73 249.9 0 1 Normal
2013-12-10 20:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 20:40:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-12-10 21:00:00 2013-12-10 21:00:00 Running none 2,820 9,388 54 0 0 440 449 8.16 0.62 243.1 0 1 Normal
2013-12-10 21:20:00 NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
So, I know it should be working. I can't think of anything else that would be causing this Assertion Error.
Anything I could try?
You should be using resample rather than reindexing with a date_range:
idx = pd.date_range(t3, last_date, freq="20min")
df2 = df.reindex(idx)
Might rather be:
df.resample('20min', 'last')