Not able to use a key from a merged dataframe - python

I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.

Related

Python Pandas can do these tasks?

I have a time series data frame for eight Years (2013-2020) has Hourly data, each Year has Nine zones, under each zone two columns("Gen", "Load") as follows:
A ZONE B ZONE ... G ZONE H ZONE I ZONE
date_time GEN LOAD GEN LOAD ... LOAD GEN LOAD GEN LOAD
2013-01-01 00:00:00 725.7 5,859.5 312.2 3,194.7 ... 77.1 706.0 227.1 495.0 861.9
2013-01-01 01:00:00 436.2 450.5 248.0 198.0 ... 865.5 240.7 107.9 640.5 767.3
2013-01-01 02:00:00 464.5 160.2 144.2 068.3 ... 738.7 044.7 32.7 509.3 700.4
2013-01-01 03:00:00 169.9 733.8 268.1 869.5 ... 671.7 649.4 951.3 626.8 652.1
2013-01-01 04:00:00 145.4 553.4 280.2 872.8 ... 761.5 561.0 912.9 552.1 637.3
... ... ... ... ... ... ... ... ... ... ... ...
2020-12-31 19:00:00 450.9 951.7 371.4 516.3 ... 461.7 808.9 471.4 983.7 447.8
2020-12-31 20:00:00 553.0 936.5 848.7 233.9 ... 397.3 978.3 404.3 490.9 233.0
2020-12-31 21:00:00 458.6 735.6 716.8 121.7 ... 385.1 808.0 192.0 131.5 70.1
2020-12-31 22:00:00 515.8 651.6 693.5 142.4 ... 291.4 826.1 16.8 591.9 863.2
2020-12-31 23:00:00 218.6 293.4 448.2 14.2 ... 340.6 435.0 897.4 622.5 768.3
What I want is the following:
1- Detect outliers in each column which is more or less three time Standard Deviation
of that column and put it in a new column its name "A_gen_outliers" if the there is
outliers in "GEN"column under "A Zone" as well as "A_load_outliers" if the there is
outliers in "LOAD"column under "A Zone". Number of new columns are 18 columns.
2- A new column represents sum of "Gen" columns
3- A new column represents sum of "Load" columns
4- A new column represents "GEN" column calculate A_GEN_div = cell value/maximum value of "GEN column under A Zone for each year for example 725.7/725.7=1 for the first cell and 436.2/725.1 for second cell and for last cell 218.6/553. etc. and the same for all "GEN" columns and also for "LOAD" columns- proposed names "A_Load_div".
Number of new columns are 18 columns.
Number of total new columns are "18 *2 + 2" columns
Thanks in advance.
I think this might help. Note that this will keep the columns MultiIndex. Your points above seem to imply that you want to flatten your MultIndex. If this is the case, you might want to look at this question.
1:
df.join(df>(3*df.std()), rsuffix='_outlier')
2 and 3:
df.groupby(level=-1, axis=1).sum()
Note that it is not clear from what the first level of the columns MultIndex should be for this.
4:
maxima = df.resample('1Y').max()
maxima.index = maxima.index + pd.DateOffset(hours=23)
maxima = maxima.reindex(df.index, method='bfill')
df.join(df.divide(maxima), rsuffix='_div')

Remove outliers while preserving the timestamps in dataframe

I have data in a dataframe in the format shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
45 2017-04-06 11:35:00 51987.427673 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
It's very clear that row 45 needs to be eliminated since it's an anomaly. There are multiple columns and quite a few rows (100,000+). Now I want to remove the outliers from this, and have been using the code:
drop_df = df.drop(columns=['timestamp'])
drop_df = drop_df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]
However, this would give me the data without the timestamps. This is due to the fact that I cannot use timestamps within the z-score calculation. However, I want to preserve the timestamps, the correlation to which is completely lost over the filtering with the z-score. This is shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
How can I achieve that?
It's probably better to explicitly set which columns to use for the z-score calculation:
cols = ['cas_pre', 'fl_rat', ...]
df = df[(np.abs(stats.zscore(df[cols])) < 3).all(axis=1)]
Alternatively, you can drop the timestamp column only in the input to the z-score calculation:
drop_df = df.drop(columns=['timestamp'])
df = df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]

Pandas upsampling does not include the 23 hours of last day in year

I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.
Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]
What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
Potência Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

Fill data gaps with average of data from adjacent days

Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)
I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?
Sample data set:
import numpy as np
import pandas as pd
# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))
# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan
print df["2014-01-04 08:00":"2014-01-04 16:00"]
A B C
2014-01-04 08:00:00 0.675720 2.186484 -0.033969
2014-01-04 08:30:00 -0.897217 1.332437 -2.618197
2014-01-04 09:00:00 0.299395 0.837023 1.346117
2014-01-04 09:30:00 0.223051 0.913047 NaN
2014-01-04 10:00:00 NaN 1.395480 NaN
2014-01-04 10:30:00 NaN -0.800921 NaN
2014-01-04 11:00:00 NaN -0.932760 NaN
2014-01-04 11:30:00 0.057219 -0.071280 NaN
2014-01-04 12:00:00 0.215810 -1.099531 NaN
2014-01-04 12:30:00 -0.532563 NaN NaN
2014-01-04 13:00:00 -0.697872 NaN NaN
2014-01-04 13:30:00 -0.028541 NaN NaN
2014-01-04 14:00:00 -0.073426 NaN NaN
2014-01-04 14:30:00 -1.187419 0.221636 NaN
2014-01-04 15:00:00 1.802449 0.144715 NaN
2014-01-04 15:30:00 0.446615 1.013915 -1.813272
2014-01-04 16:00:00 -0.410670 1.265309 -0.198607
[17 rows x 3 columns]
(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )
/edit: A sample solution that I'm not really happy with:
# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
for idx, rows in df.iterrows():
if np.isnan(df.ix[idx, col]):
# replace with mean of adjacent days
df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])
There is two things I don't like about this solution:
If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
I would like to get rid of the loops, if that is possible.
This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:
df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )
It turned out to be really hard to generalize this, but this seems to work:
df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in
range(-48*window,48*(window+1),48) ], axis=0 )
I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.
Check results for window=1:
print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]
2014-01-04 12:30:00 0.940193 # was nan, now filled
2014-01-04 13:00:00 0.078160
2014-01-04 13:30:00 -0.662918
2014-01-04 14:00:00 -0.967121
2014-01-03 12:30:00 0.947915 # day before
2014-01-03 13:00:00 0.167218
2014-01-03 13:30:00 -0.391444
2014-01-03 14:00:00 -1.157040
2014-01-05 12:30:00 0.932471 # day after
2014-01-05 13:00:00 -0.010899
2014-01-05 13:30:00 -0.934391
2014-01-05 14:00:00 -0.777203
Regarding problem #2, it will depend on your data but if you precede the above with
df = df.resample('30min')
that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.
Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:
df.groupby( df.index.hour ).fillna(method='pad')
would correctly deal the issue of missing rows, but not the other things.

Categories