Resampling dataframe in pandas as a checking operation - python

I have a DataFrame like this:
A B value
2014-11-14 12:00:00 30.5 356.3 344
2014-11-15 00:00:00 30.5 356.3 347
2014-11-15 12:00:00 30.5 356.3 356
2014-11-16 00:00:00 30.5 356.3 349
...
2017-01-06 00:00:00 30.5 356.3 347
I want to check if the index is running every 12 hours, perhaps there is some data missing, so there can be a jump of 24 or more hours. In that case I want to introduce nan in the value column and copy the values from columns A and B.
I thought of using resample:
df = df.resample('12H')
but I don't know how to handles the different columns or if this is the right approach.
EDIT: If there is a value missing, for instance in 2015-12-12 12:00:00 I would like to add a row like this:
...
2015-12-12 00:00:00 30.5 356.3 323
2015-12-12 12:00:00 30.5 356.3 NaN *<- add this*
2015-12-13 00:00:00 30.5 356.3 347
...

You can use the asfreq method to produce evenly spaced indexes every 12 hours which will automatically put np.nan values for every jump. Then you can just forward fill columns A and B.
df1= df.asfreq('12H')
df1[['A','B']] = df1[['A','B']].fillna(method='ffill')

I would go for simply sorting your dataframe on the index and create a new column that takes the value from the next row (for the time). The current time would be called "from" and the time from the next time would be called "to".
Next step would be to use the two columns ("from" and "to") to create a column containing a list of values between this row and next row for every 12 hours (a range basically).
Final step would be to "explode" every line for each value in the range. Look at How to explode a list inside a Dataframe cell into separate rows
Hope this helps :)

Related

Pulling start date, end date, and mean quantity for unbalanced dataset

I have a dataset (seen in the image) that consists of cities (column "IBGE"), dates, and quantities (column "QTD"). I am trying to extract three things into a new column: start date per "IBGE", end date per "IBGE", and mean per "code".
Also, before doing so, should I change the index of my dataset?
The panel data is unbalanced, so different "IBGE" values have different start and end dates, and mean. How could I go about creating a new data frame with the following information separated in columns? I want the dataframe to look like this:
CODE
Start
End
Mean QTD
10001
2020-01-01
2022-01-01
604
10002
2019-09-01
2021-10-01
1008
10003
2019-02-01
2020-12-01
568
10004
2020-03-01
2021-05-01
223
...
...
...
...
99999
2020-02-01
2022-04-01
9394
I am thinking that maybe a for while loop could potentially take that info, but I am not sure how to write the code.
Try with groupby and named aggregations:
#convert DATE column to datetime if needed
df["DATE"] = pd.to_datetime(df["DATE"])
output = df.groupby("IBGE").agg(Start=("DATE","min"),
End=("DATE","max"),
Mean_QTD=("QTD","mean"))

Python Pandas can do these tasks?

I have a time series data frame for eight Years (2013-2020) has Hourly data, each Year has Nine zones, under each zone two columns("Gen", "Load") as follows:
A ZONE B ZONE ... G ZONE H ZONE I ZONE
date_time GEN LOAD GEN LOAD ... LOAD GEN LOAD GEN LOAD
2013-01-01 00:00:00 725.7 5,859.5 312.2 3,194.7 ... 77.1 706.0 227.1 495.0 861.9
2013-01-01 01:00:00 436.2 450.5 248.0 198.0 ... 865.5 240.7 107.9 640.5 767.3
2013-01-01 02:00:00 464.5 160.2 144.2 068.3 ... 738.7 044.7 32.7 509.3 700.4
2013-01-01 03:00:00 169.9 733.8 268.1 869.5 ... 671.7 649.4 951.3 626.8 652.1
2013-01-01 04:00:00 145.4 553.4 280.2 872.8 ... 761.5 561.0 912.9 552.1 637.3
... ... ... ... ... ... ... ... ... ... ... ...
2020-12-31 19:00:00 450.9 951.7 371.4 516.3 ... 461.7 808.9 471.4 983.7 447.8
2020-12-31 20:00:00 553.0 936.5 848.7 233.9 ... 397.3 978.3 404.3 490.9 233.0
2020-12-31 21:00:00 458.6 735.6 716.8 121.7 ... 385.1 808.0 192.0 131.5 70.1
2020-12-31 22:00:00 515.8 651.6 693.5 142.4 ... 291.4 826.1 16.8 591.9 863.2
2020-12-31 23:00:00 218.6 293.4 448.2 14.2 ... 340.6 435.0 897.4 622.5 768.3
What I want is the following:
1- Detect outliers in each column which is more or less three time Standard Deviation
of that column and put it in a new column its name "A_gen_outliers" if the there is
outliers in "GEN"column under "A Zone" as well as "A_load_outliers" if the there is
outliers in "LOAD"column under "A Zone". Number of new columns are 18 columns.
2- A new column represents sum of "Gen" columns
3- A new column represents sum of "Load" columns
4- A new column represents "GEN" column calculate A_GEN_div = cell value/maximum value of "GEN column under A Zone for each year for example 725.7/725.7=1 for the first cell and 436.2/725.1 for second cell and for last cell 218.6/553. etc. and the same for all "GEN" columns and also for "LOAD" columns- proposed names "A_Load_div".
Number of new columns are 18 columns.
Number of total new columns are "18 *2 + 2" columns
Thanks in advance.
I think this might help. Note that this will keep the columns MultiIndex. Your points above seem to imply that you want to flatten your MultIndex. If this is the case, you might want to look at this question.
1:
df.join(df>(3*df.std()), rsuffix='_outlier')
2 and 3:
df.groupby(level=-1, axis=1).sum()
Note that it is not clear from what the first level of the columns MultIndex should be for this.
4:
maxima = df.resample('1Y').max()
maxima.index = maxima.index + pd.DateOffset(hours=23)
maxima = maxima.reindex(df.index, method='bfill')
df.join(df.divide(maxima), rsuffix='_div')

Not able to use a key from a merged dataframe

I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.

Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?

The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.

Conditional test within pandas dataframe

Can someone help me with a pandas question? I have a timeseries dataframe such as this:
GOOG AAPL
2010-12-09 16:00:00 591.50 551
2010-12-10 16:00:00 592.21 523
2010-12-13 16:00:00 594.62 578
2010-12-14 16:00:00 594.91 567
2010-12-15 16:00:00 590.30 577
...
I need to loop through each timestamp and test whether AAPL is > 570. If it is, then I want to print the date and the price of AAPL for that entry. Is this possible?
There's no need for any looping, one of the main benefits of pandas being built on numpy is it can easily operate on whole columns. It's as simple as:
df['AAPL'][df['AAPL'] > 570]
Output:
2010-12-13 16:00:00 578
2010-12-15 16:00:00 577
Name: AAPL, dtype: int64
Ah ha I got it:
What you can do since it is built on top of numpy is this:
my_dataframe[my_dataframe.AAPL > 570]
and you're almost done.
From here you have all the rows that correspond to AAPL > 570, now it's just printing out the values you need:
valid_rows = my_dataframe[my_dataframe.AAPL > 570]
for row in valid_rows.to_records():
print row[1],row[2]
The dataframe.where can be used for searching the entire frame.
I had forgotten that pandas made it extremely easy to reference columns.

Categories