I have a long time serie that starts in 1963 and ends in 2013. However, from 1963 til 2007 it has an hourly sampling period while after 2007's sampling rate changes to 5 minutes. Is it possible to resample data just after 2007 in a way that the entire time serie has hourly data sampling? Data slice below.
yr, m, d, h, m, s, sl
2007, 11, 30, 19, 0, 0, 2180
2007, 11, 30, 20, 0, 0, 2310
2007, 11, 30, 21, 0, 0, 2400
2007, 11, 30, 22, 0, 0, 2400
2007, 11, 30, 23, 0, 0, 2270
2008, 1, 1, 0, 0, 0, 2210
2008, 1, 1, 0, 5, 0, 2210
2008, 1, 1, 0, 10, 0, 2210
2008, 1, 1, 0, 15, 0, 2200
2008, 1, 1, 0, 20, 0, 2200
2008, 1, 1, 0, 25, 0, 2200
2008, 1, 1, 0, 30, 0, 2200
2008, 1, 1, 0, 35, 0, 2200
2008, 1, 1, 0, 40, 0, 2200
2008, 1, 1, 0, 45, 0, 2200
2008, 1, 1, 0, 50, 0, 2200
2008, 1, 1, 0, 55, 0, 2200
2008, 1, 1, 1, 0, 0, 2190
2008, 1, 1, 1, 5, 0, 2190
Thanks!
Give your dataframe proper column names
df.columns = 'year month day hour minute second sl'.split()
Solution
df.groupby(['year', 'month', 'day', 'hour'], as_index=False).first()
year month day hour minute second sl
0 2007 11 30 19 0 0 2180
1 2007 11 30 20 0 0 2310
2 2007 11 30 21 0 0 2400
3 2007 11 30 22 0 0 2400
4 2007 11 30 23 0 0 2270
5 2008 1 1 0 0 0 2210
6 2008 1 1 1 0 0 2190
Option 2
Here is an option that builds off of the column renaming. We'll use pd.to_datetime to cleverly get at our dates, then use resample. However, you have time gaps and will have to address nulls and re-cast dtypes.
df.set_index(
pd.to_datetime(df.drop('sl', 1))
).resample('H').first().dropna().astype(df.dtypes)
year month day hour minute second sl
2007-11-30 19:00:00 2007 11 30 19 0 0 2180
2007-11-30 20:00:00 2007 11 30 20 0 0 2310
2007-11-30 21:00:00 2007 11 30 21 0 0 2400
2007-11-30 22:00:00 2007 11 30 22 0 0 2400
2007-11-30 23:00:00 2007 11 30 23 0 0 2270
2008-01-01 00:00:00 2008 1 1 0 0 0 2210
2008-01-01 01:00:00 2008 1 1 1 0 0 2190
Rename the minute column for convenience:
df.columns = ['yr', 'm', 'd', 'h', 'M', 's', 'sl']
Create a datetime column:
from datetime import datetime as dt
df['dt'] = df.apply(axis=1, func=lambda x: dt(x.yr, x.m, x.d, x.h, x.M, x.s))
Resample:
For pandas < 0.19:
df = df.set_index('dt').resample('60T').reset_index('dt')
For pandas >= 0.19:
df = df.resample('60T', on='dt')
You'd better first append a datetime column to your dataframe:
df['datetime'] = pd.to_datetime(df[['yr', 'mnth', 'd', 'h', 'm', 's']])
But before that you should rename the month column:
df.rename(columns={ df.columns[1]: "mnth" })
Then you set a datetime column as dataframe index:
data.set_index('datetime', inplace=True)
Now you can apply resample method on your dataframe on by prefereed sampling rate:
df.resample('60T', on='datatime').mean()
Here I used mean to aggregate. You can use other method based on your need.
See Pandas document as a ref.
Related
I have real estate dataframe with many outliers and many observations.
I have variables: total area, number of rooms (if rooms = 0, then it's studio appartment) and kitchen_area.
"Minimalized" extraction from my dataframe:
dic = [{'area': 40, 'kitchen_area': 10, 'rooms': 1, 'price': 50000 },
{'area': 20, 'kitchen_area': 0, 'rooms': 0, 'price': 50000},
{'area': 60, 'kitchen_area': 0, 'rooms': 2, 'price': 70000},
{'area': 29, 'kitchen_area': 9, 'rooms': 1, 'price': 30000},
{'area': 15, 'kitchen_area': 0, 'rooms': 0, 'price': 25000}]
df = pd.DataFrame(dic, index=['apt1', 'apt2','apt3','apt4', 'apt5'])
My target would be to eliminate apt3, because by law, kitchen area cannot be smaller than 5 squared meters in non-studio apartments.
In other words, I would like to eliminate all rows from my dataframe containing the data about apartments which are non-studio (rooms>0), but have kitchen_area <5
I have tried code like this:
df1 = df.drop(df[(df.rooms > 0) & (df.kitchen_area < 5)].index)
But it just eliminated all data from both columns kitchen_area and rooms according to the multiple conditions I put.
Clean
mask1 = df.rooms > 0
mask2 = df.kitchen_area < 5
df1 = df[~(mask1 & mask2)]
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
pd.DataFRame.query
df1 = df.query('rooms == 0 | kitchen_area >= 5')
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
I have a data frame which current reads per the below:
df_new = pd.DataFrame({'Week':['nan',14, 14, 14, 14, 14],
'Date':['NaT','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05'],
'site 1':['entry',0, 0, 0, 0, 0],
'site 1':['exit',0, 0, 0, 0, 0],
'site 2':['entry',1, 0,50, 7, 0],
'site 2':['exit',10, 0, 7, 19, 0],
'site 3':['entry',0, 100, 14, 9, 0],
'site 3':['exit',0, 0, 7, 0, 0],
'site 4':['entry',0, 0, 0, 0, 0],
'site 4':['exit',0, 0, 0, 0, 0],
'site 5':['entry',0, 0, 0, 0, 0],
'site 5':['exit',15, 0, 25, 0, 80],
})
What I desire however is columns dictating exit/entry per site (columns came from merged Excel headers)
An example of what's desired is below (ignore the actual values as I typed them out)
df_target = pd.DataFrame({'Week':[14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14],
'Date':['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05'],
'site':['site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 2', 'site 2','site 2','site 2','site 2','site 2'],
'entry/exit':['exit','exit', 'exit', 'entry', 'entry', 'entry', 'entry', 'entry', 'entry', 'exit', 'exit', 'exit', 'exit', 'entry', 'entry'],
'Value':[12 ,1, 0, 50, 7, 0, 12 ,1, 0, 50, 7, 0, 12 ,1, 0]
})
as an image:
I have tried
df_target = df_new.melt(id_vars=['Week','Date'], var_name="Site", value_name="Value")
but guess I need to somehow group by the second row too or consider it as a second header?
First create MultiIndex from input DataFrame:
#if possible
#df = pd.read_csv(file, header=[0,1], index_col=[0,1])
df_new.columns = [df_new.columns, df_new.iloc[0]]
df = df_new.iloc[1:]
print (df.columns)
MultiIndex([( 'Week', 'nan'),
( 'Date', 'NaT'),
('site 1', 'exit'),
('site 2', 'exit'),
('site 3', 'exit'),
('site 4', 'exit'),
('site 5', 'exit')],
)
And then convert first 2 MultiIndex columns to index, so possible use DataFrame.unstack for melting with Series.rename_axis and
Series.reset_index:
df = (df.set_index(df.columns[:2].tolist())
.unstack([0,1])
.rename_axis(['site','entry/exit','Week','Date'])
.reset_index(name='Value'))
print (df)
site entry/exit Week Date Value
0 site 1 exit 14 2020-04-01 0
1 site 1 exit 14 2020-04-02 0
2 site 1 exit 14 2020-04-03 0
3 site 1 exit 14 2020-04-04 0
4 site 1 exit 14 2020-04-05 0
5 site 2 exit 14 2020-04-01 10
6 site 2 exit 14 2020-04-02 0
7 site 2 exit 14 2020-04-03 7
8 site 2 exit 14 2020-04-04 19
9 site 2 exit 14 2020-04-05 0
10 site 3 exit 14 2020-04-01 0
11 site 3 exit 14 2020-04-02 0
12 site 3 exit 14 2020-04-03 7
13 site 3 exit 14 2020-04-04 0
14 site 3 exit 14 2020-04-05 0
15 site 4 exit 14 2020-04-01 0
16 site 4 exit 14 2020-04-02 0
17 site 4 exit 14 2020-04-03 0
18 site 4 exit 14 2020-04-04 0
19 site 4 exit 14 2020-04-05 0
20 site 5 exit 14 2020-04-01 15
21 site 5 exit 14 2020-04-02 0
22 site 5 exit 14 2020-04-03 25
23 site 5 exit 14 2020-04-04 0
24 site 5 exit 14 2020-04-05 80
I have the following df1 dataframe:
t A
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6
5 23:05 5
6 23:06 4
7 23:07 9
8 23:08 7
9 23:09 10
10 23:10 8
For each t (increments simplified here, not uniformly distributed in real life), I would like to find, if any, the most recent time tr within the previous 5 min where A(t)- A(tr) >= 4. I want to get:
t A tr
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6 23:03
5 23:05 5 23:01
6 23:06 4
7 23:07 9 23:06
8 23:08 7
9 23:09 10 23:06
10 23:10 8 23:06
Currently, I can use shift(-1) to compare each row to the previous row like cond = df1['A'] >= df1['A'].shift(-1) + 4.
How can I look further in time?
Assuming your data is continuous by the minute, then you can do usual shift:
df1['t'] = pd.to_timedelta(df1['t'].add(':00'))
df = pd.DataFrame({i:df1.A - df1.A.shift(i) >= 4 for i in range(1,5)})
df1['t'] - pd.to_timedelta('1min') * df.idxmax(axis=1).where(df.any(1))
Output:
0 NaT
1 NaT
2 NaT
3 NaT
4 23:03:00
5 23:01:00
6 NaT
7 23:06:00
8 NaT
9 23:06:00
10 23:06:00
dtype: timedelta64[ns]
I added a datetime index and used rolling(), which now includes time-window functionalities beyond simple index-window.
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'t' : [
datetime.datetime(2020, 5, 17, 23, 0, 0),
datetime.datetime(2020, 5, 17, 23, 0, 1),
datetime.datetime(2020, 5, 17, 23, 0, 2),
datetime.datetime(2020, 5, 17, 23, 0, 3),
datetime.datetime(2020, 5, 17, 23, 0, 4),
datetime.datetime(2020, 5, 17, 23, 0, 5),
datetime.datetime(2020, 5, 17, 23, 0, 6),
datetime.datetime(2020, 5, 17, 23, 0, 7),
datetime.datetime(2020, 5, 17, 23, 0, 8),
datetime.datetime(2020, 5, 17, 23, 0, 9),
datetime.datetime(2020, 5, 17, 23, 0, 10)
], 'A' : [2,1,2,2,6,5,4,9,7,10,8]}, columns=['t', 'A'])
df1.index = df1['t']
df2 = df1
cond = df1['A'] >= df1.rolling('5s')['A'].apply(lambda x: x[0] + 4)
result = df1[cond]
Gives
t A
2020-05-17 23:00:04 6
2020-05-17 23:00:05 5
2020-05-17 23:00:07 9
2020-05-17 23:00:09 10
2020-05-17 23:00:10 8
I m trying to implement a simple function that will allow me to iterate back to find a not null value, and this value will be stored in a new column called prv_djma.
Data
data = {'id_st': [100, 100, 100, 100, 100, 100, 100, 100, 100],
'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018],
'djma': [1000, 2200, 0, 3000, 1000, 0, 2000, 0, 0],
'taux': [np.nan, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 0.7]}
df = pd.DataFrame(data)
df
df['prv_djma'] = df['djma'].shift()
What I m looking for
The goal is to check N lines back until finding a not null djma then put the value in the current line (column prv_djma).
For example the last line with(index 8) has a djma=0 and the (index 7) also is djma=0 so I want to fill the column prv_djma by the djma (index 6).
Note
The problem I have is in index 8. all other lines are correct.
This is shift, ensuring that if there are consecutive 0s we then fill with the previous value:
m = df.djma.eq(0)
df['prv_djma'] = df.djma.shift().mask((m == m.shift()) & m).ffill()
Output:
id_st year djma taux prev_djma
0 100 2010 1000 NaN NaN
1 100 2011 2200 0.9 1000.0
2 100 2012 0 1.1 2200.0
3 100 2013 3000 1.2 0.0
4 100 2014 1000 1.3 3000.0
5 100 2015 0 1.4 1000.0
6 100 2016 2000 1.5 0.0
7 100 2017 0 1.6 2000.0
8 100 2018 0 0.7 2000.0
For groups you need to do this separately so that .shift doesn't spill outside of the group.
def get_prv(x):
m = x.eq(0)
return x.shift().mask((m == m.shift()) & m).ffill()
df['prv_djma'] = df.groupby('id_st')['djma'].apply(get_prv)
I have the following pandas dataframe and baseline value:
df = pd.DataFrame(data=[
{'yr': 2010, 'month': 0, 'name': 'Johnny', 'total': 50},,
{'yr': 2010, 'month': 0, 'name': 'Johnny', 'total': 50},
{'yr': 2010, 'month': 1, 'name': 'Johnny', 'total': 105},
{'yr': 2010, 'month': 0, 'name': 'Zack', 'total': 90}
])
baseline_value = 100
I'm grouping and aggregating the data based on year, month and name. Then I'm calculating the net sum relative to the baseline value:
pt = pd.pivot_table(data=df, index=['yr', 'month', 'name'], values='total', aggfunc=np.sum)
pt['net'] = pt['total'] - baseline_value
print(pt)
total net
yr month name
2010 0 Johnny 100 0
Zack 90 -10
1 Johnny 105 5
How can I restructure this DataFrame so the output looks something like this:
value
yr month name type
2010 0 Johnny Total 100
Net 0
Zack Total 90
Net -10
1 Johnny Total 105
Net 5
Option 1: Reshaping yout pivot dataframe: pt
Use stack, rename, and to_frame:
pt.stack().rename('value').to_frame()
Output:
value
yr month name
2010 0 Johnny total 100
net 0
Zack total 90
net -10
1 Johnny total 105
net 5
Option 2 using set_index and sum from original df
Here is another approach starting from your source df, using set_index and sum with level parameter, then reshape with stack:
baseline_value = 100
(df.set_index(['yr','month','name'])
.sum(level=[0,1,2])
.eval('net = #baseline_value - total',inplace=False)
.stack()
.to_frame(name='value'))
Output:
value
yr month name
2010 0 Johnny total 100
net 0
Zack total 90
net 10
1 Johnny total 105
net -5