I have a panda dataset with every line timestamped (unix time - every line represents a day).
Ex:
Index Timestamp Value
1 1544400000 2598
2 1544572800 2649
3 1544659200 2234
4 1544745600 2204
5 1544832000 1293
Is it possible to use a method in which I can subtract every row (from first column) from the previous row? The purpose is to know if the interval between lines is the same, to make sure that the dataset isn't skipping a day.
In the example above, the first day skips to the third day, giving a 48hrs interval, while the other rows are all 24hrs interval.
I think i could do it using iterrows(), but that seems very costly for large databases.
--
Not sure I was clear enough so, in the example above:
Column Timestamp:
Row 2 - row 1 = 172800 (48hrs)
Row 3 - row 2 = 86400 (24hs)
Row 4 - row 3 = 86400 (24hrs) ...
Pandas DataFrames have a diff method that does what you want. Note that the first row of the returned diff will contain NaNs, so you'll want to ignore that in any comparison.
An example would be
import pandas as pd
df = pd.DataFrame({'timestamps': [100, 200, 300, 500]})
# get diff of column (ignoring the first NaN values) and convert to a list
X = df['timestamps'].diff()[1:].tolist()
X.count(X[0]) == len(X) # check if all values are the same, e.g. https://stackoverflow.com/a/3844948/1862861
Related
I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967
In the following pandas DataFrame, The first two columns (Remessas_A and Remessas_A_1d) were given and I had to find the third (previsao) following the pattern described below. Notice that I'm not counting the column DataEntrega as the first, which is a datetime index.
DataEntrega,Remessas_A,Remessas_A_1d,previsao
2020-07-25,696.0,,
2020-07-26,0.0,,
2020-07-27,518.0,,
2020-07-28,629.0,,
2020-07-29,699.0,,
2020-07-30,660.0,,
2020-07-31,712.0,,
2020-08-01,2.0,-672.348684948797,23.651315051203028
2020-08-02,0.0,-504.2138715410994,-504.2138715410994
2020-08-03,4.0,-91.10009092298037,426.89990907701963
2020-08-04,327.0,194.46620611760167,823.4662061176017
2020-08-05,442.0,220.65451760630847,919.6545176063084
2020-08-06,474.0,-886.140302693952,-226.14030269395198
2020-08-07,506.0,-61.28132269808316,650.7186773019168
2020-08-08,11.0,207.12286256242962,230.77417761363265
2020-08-09,2.0,109.36137834671834,-394.85249319438105
2020-08-10,388.0,146.2428764085755,573.1427854855951
2020-08-11,523.0,-193.02046115081606,630.4457449667857
2020-08-12,509.0,-358.59415822684485,561.0603593794635
2020-08-13,624.0,966.9258406162757,740.7855379223237
2020-08-14,560.0,175.8273195122506,826.5459968141674
2020-08-15,70.0,19.337299248463978,250.11147686209662
2020-08-16,3.0,83.09413535361391,-311.75835784076713
2020-08-17,401.0,-84.67345026550751,488.4693352200876
2020-08-18,526.0,158.53310638454195,788.9788513513276
2020-08-19,580.0,285.99137337700336,847.0517327564669
2020-08-20,624.0,-480.93226226400344,259.85327565832023
2020-08-21,603.0,-194.68412031046182,631.8618765037056
2020-08-22,45.0,-39.23172496101115,210.87975190108546
2020-08-23,2.0,-115.26376570266325,-427.0221235434304
2020-08-24,463.0,10.04635376084557,498.5156889809332
2020-08-25,496.0,-32.44638720124206,756.5324641500856
2020-08-26,600.0,-198.6715680014182,648.3801647550487
2020-08-27,663.0,210.40991269713578,470.263188355456
2020-08-28,628.0,40.32391720053602,672.1857937042416
2020-08-29,380.0,-2.4418918145294626,208.437860086556
2020-08-30,0.0,152.66166068424076,-274.3604628591896
2020-08-31,407.0,18.499558564880928,517.0152475458141
The first 7 values of Remessas_A_1d and previsao are nulls, and will be kept nulls.
In order to obtain the first 7 non nulls values of previsao, from 2020-08-01 to 2020-08-07, I've made a shift of the Remessas_A 7 days ahead and I've added the rows of the shifted Remessas_A and the original Remessas_A_1d:
#res is the name of the dataframe
res['previsao'].loc['2020-08-01':'2020-08-07'] = res['Remessas_A'].shift(7).loc['2020-08-01':'2020-08-07'].add(res['Remessas_A_1d'].loc['2020-08-01':'2020-08-07'])
To find the next 7 values of previsao, from 2020-08-08 to 2020-08-14, now I shifted the previsao column 7 days ahead and I've added the rows of the shifted previsao and the original previsao:
res['previsao'].loc['2020-08-08':'2020-08-14'] = res['previsao'].shift(7).loc['2020-08-08':'2020-08-14'].add(res['Remessas_A_1d'].loc['2020-08-08':'2020-08-14'])
To find the next values of previsao, I repeated the last step, moving 7 days ahead each time:
res['previsao'].loc['2020-08-15':'2020-08-21'] = res['previsao'].shift(7).loc['2020-08-15':'2020-08-21'].add(res['Remessas_A_1d'].loc['2020-08-15':'2020-08-21'])
res['previsao'].loc['2020-08-22':'2020-08-28'] = res['previsao'].shift(7).loc['2020-08-22':'2020-08-28'].add(res['Remessas_A_1d'].loc['2020-08-22':'2020-08-28'])
res['previsao'].loc['2020-08-29':'2020-08-31'] = res['previsao'].shift(7).loc['2020-08-29':'2020-08-31'].add(res['Remessas_A_1d'].loc['2020-08-29':'2020-08-31'])
#the last line only spaned 3 days because I reached the end of my dataframe
Instead of doing that by hand, how can I create a function that would take periods=7, Remessas_A and Remessas_A_1d as input and would give previsao as the output?
Not the most elegant code, but this should do the trick:
df["previsao"][df.index <= pd.to_datetime("2020-08-07")] = df["Remessas_A"].shift(7) + df["Remessas_A_1d"]
for d in pd.date_range("2020-08-08", "2020-08-31"):
df.loc[d, "previsao"] = df.loc[d - pd.Timedelta("7d"), "previsao"] + df.loc[d, "Remessas_A_1d"]
Edit: I've assumed you have DataEntrega as an index and datetime object. Can post the rest of the code if you need.
I have a data frame with 2 columns.
The 1st column is a timestamp of every minute.
The 2nd column is a number.
All I want to do is to change the 1st column into timestamp of every 30 minutes, and the sum of the 30 numbers within that period from column 2.
Power is demonstrated for every minute and but I want to sum them up for every 30 minutes.
Using pandas/Series.resample
Series.resample can help you if set the timestamp as index ; then use series.resample('30T').sum()
Manual version
You can use cumsum over the serie you want to keep.
Then select only the index at every 30 positions (np.arange(0, len(df), 30).
Then iterate over the dataframe backward and substract at row n the sum found at row n-1 to keep only the value of the last 30 minutes. Iterating is not very efficient but since your dataset is 1M row, if you take 1 row every 30 rows, it should be fast (33,333 iterations).
df['cumsum'] = df["Power_kw"].cumsum()
df_30_min = df.iloc[np.arange(0, len(df), 30)].copy()
for i in range(len(df_30_min), 1, -1):
df_30_min.iloc[i-1, df_30_min.columns.get_loc('B')] -= df_30_min.iloc[i-2, df_30_min.columns.get_loc('B')]
I've got a dataframe like this
Day,Minute,Second,Value
1,1,0,1
1,2,1,2
1,3,1,2
1,2,6,0
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Sometimes the sensor records incorrect values and gets added again but with the correct value. For example, here we should delete the second and third rows since they are being overridden by row four coming from a timestamp before them. How do I filter out the 'bad' rows like those that are unnecessary? For the example, the expected output should be:
Day,Minute,Second,Value
1,1,0,1
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Here's the pseudocode for an iterative solution(Sorry for no indents in the formatting this is my first post)
for row in dataframe:
for previous_row in rows in dataframe before row:
if previous_row > row:
delete previous row
I think there should be a vectorized solution, especially for the second loop. I also don't want to modify what I'm iterating over but I'm not sure there is another option other than duplicating the dataframe.
Here is some starter code to work with the example dataframe
import pandas as pd
data = [{'Day':1, 'Minute':1, 'Second':0, 'Value':1},
{'Day':1, 'Minute':2, 'Second':1, 'Value':2},
{'Day':1, 'Minute':2, 'Second':6, 'Value':2},
{'Day':1, 'Minute':3, 'Second':1, 'Value':0},
{'Day':1, 'Minute':2, 'Second':1, 'Value':1},
{'Day':1, 'Minute':2, 'Second':5, 'Value':1},
{'Day':2, 'Minute':0, 'Second':1, 'Value':1},
{'Day':2, 'Minute':0, 'Second':5, 'Value':2}]
df = pd.DataFrame(data)
If you have multiple rows for the same combination of Day, Minute, Second but a different Value, I am assuming you want to retain the last recorded value and discard all the previous ones considering they are "bad".
You can do this simply by using drop_duplicates:
df.drop_duplicates(subset=['Day', 'Minute', 'Second'], keep='last')
UPDATE v2:
If you need to retain the last group of ['Minute', 'Second'] combinations for each day, identify monotonically increasing Minute groups (since it's the bigger time unit of the two) and select the group with the max value of Group_Id for each ['Day']:
res = pd.DataFrame()
for _, g in df.groupby(['Day']):
g['Group_Id'] = (g.Minute.diff() < 0).cumsum()
res = pd.concat([res, g[g['Group_Id'] == max(g['Group_Id'].values)]])
OUTPUT:
Day Minute Second Value Group_Id
1 2 1 1 1
1 2 5 1 1
2 0 1 1 0
2 0 5 2 0
I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]