I am trying to combine two different timeframes of pandas dataframe. The first dataframe has 1 hour timeseries. and the second dataframe has 1 minute timeseries.
1 hour dataframe
get_time value
0 1599739200 123.10
1 1599742800 136.24
2 1599750000 224.14
1 minute dataframe
get_time value
0 1599739200 2.11
1 1599739260 3.11
2 1599739320 3.12
3 1599742800 4.23
4 1599742860 2.22
5 1599742920 1.11
6 1599746400 7.24
7 1599746460 22.10
8 1599746520 2.13
9 1599750000 5.14
10 1599750060 12.10
11 1599750120 21.30
I want to combine those two dataframes, so the value of 1 hour dataframe will be mapped in 1 minute dataframe. if there is no 1 hour value then the mapped value will be nan.
Desired Result:
get_time value 1h mapped value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
Basically i want to combine those dataframe with these logic:
if (1m_get_time >= 1h_get_time) and (1m_get_time < 1h_get_time+60minutes)
1h mapped value = 1h value
else:
1h mapped value = nan
Currently i use recursive method. But it takes long time for big size of data. here is the example of dataframe:
dfhigh_ = pd.DataFrame({
'get_time' : [1599739200, 1599742800, 1599750000],
'value' : [123.1, 136.24, 224.14],
})
dflow_ = pd.DataFrame({
'get_time' : [1599739200, 1599739260, 1599739320, 1599742800, 1599742860, 1599742920, 1599746400, 1599746460, 1599746520, 1599750000, 1599750060, 1599750120],
'value' : [2.11, 3.11, 3.12, 4.23, 2.22, 1.11, 7.24, 22.1, 2.13, 5.14, 12.1, 21.3],
})
Floor the get_time from dflow_ to nearest hour representation then use Series.map to map the values from dfhigh_ to dflow_ based on this rounded timestamp:
hr = dflow_['get_time'] // 3600 * 3600
dflow_['mapped_value'] = hr.map(dfhigh_.set_index('get_time')['value'])
get_time value mapped_value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
This should work (for edge cases as well):
import pandas as pd
from datetime import datetime
import numpy as np
dfhigh_ = dfhigh_.rename(columns={'value': '1h mapped value'})
df_new = pd.merge(dflow_, dfhigh_, how='outer', on=['get_time'])
df_new.get_time = [datetime.fromtimestamp(x) for x in df_new['get_time']]
for idx,row in df_new.iterrows():
if not np.isnan(row['1h mapped value']):
current_hour, current_1h_mapped_value = row['get_time'].hour, row['1h mapped value']
for sub_idx,sub_row in df_new.loc[(df_new.get_time.dt.hour == current_hour) & np.isnan(df_new['1h mapped value'])].iterrows():
df_new.loc[sub_idx, '1h mapped value'] = current_1h_mapped_value
csv table :
So I have a csv file that has different columns like nodeVolt, Temperature1, temperature2, temperature3, pressure and luminosity. Under temperatures column, there are various cells where the value is wrong (ie. 220). I want to replace that value in that cell by taking a mean of the previous 10 cells and replacing it there. I want this to run dynamically by finding all the cells with values 220 in that particular column and replace with the mean of previous 10 values in the same column.
I was able to search the cells containing 220 in that particular problem but unable to take mean and replace it.
import pandas as pd
import numpy as np
data = pd.read_csv(r"108e.csv")
data = data.drop(['timeStamp','nodeRSSI','packetID', 'solarPanelVolt', 'solarPanelBattVolt',
'solarPanelCurr','temperature2','nodeVolt','nodeAddress'], axis = 1)
df = pd.DataFrame(data)
df1 = df.loc[lambda df: df['temperature3'] == 220]
print(df1)
for i in df1:
df1["temperature3"][i] == df["temperature3"][i-11:i-1, 'temperature3'].mean()
Here you go:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"something": 3.37,
"temperature3": [
31.94,
31.93,
31.85,
31.91,
31.92,
31.89,
31.9,
31.94,
32.06,
32.16,
32.3,
220,
32.1,
32.5,
32.2,
32.3,
],
}
)
# replace all 220 values by NaN
df["temperature3"] = df["temperature3"].replace({220: np.nan})
# fill all NaNs with an shifted rolling average of the last 10 rows
df["temperature3"] = df["temperature3"].fillna(
df["temperature3"].rolling(10, min_periods=1).mean().shift(1)
)
Result:
something temperature3
0 3.37 31.940
1 3.37 31.930
2 3.37 31.850
3 3.37 31.910
4 3.37 31.920
5 3.37 31.890
6 3.37 31.900
7 3.37 31.940
8 3.37 32.060
9 3.37 32.160
10 3.37 32.300
11 3.37 31.986
12 3.37 32.100
13 3.37 32.500
14 3.37 32.200
15 3.37 32.300
(please provide next time some sample data as code, not as an image)
I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to