I have two excel files that I'm trying to merge into one using pandas. The first file is a list of times and dates with a subscriber count for that given time and day. The second file has weather information on an hourly basis. I import both files and the data resembles:
df1=
Date Count
2010-01-02 09:00:00 15
2010-01-02 10:00:00 8
2010-01-02 11:00:00 9
2010-01-02 12:00:00 11
2010-01-02 13:00:00 8
2010-01-02 14:00:00 10
2010-01-02 15:00:00 8
2010-01-02 16:00:00 6
...
df2 =
Date Temp Rel_Hum Pressure Weather
2010-01-00 09:00:00 -5 93 100.36 Snow,Fog
2010-01-01 10:00:00 -5 93 100.36 Snow,Fog
2010-01-02 11:00:00 -6.5 91 100 Snow,Fog
2010-01-03 12:00:00 -7 87 89 Snow,Fog
2010-01-04 13:00:00 -7 87 89 Snow,Fog
2010-01-05 14:00:00 -6.7 88 89 Snow,Fog
2010-01-06 15:00:00 -6.5 89 89 Snow,Fog
2010-01-07 16:00:00 -6 88 90 Snow,Fog
2010-01-08 17:00:00 -6 89 89 Snow,Fog
...
I only need to weather info for the times that are specified in df1, but df2 contains weather info for a 24 hour period for everyday of that month.
Since df1 only contains 2 columns, I've modified df1 to have a Temp Rel_Hum Pressure and Weather column so that it resembles:
Date Count Temp Rel_Hum Pressure Weather
2010-01-02 09:00:00 15 0 0 0 0
2010-01-02 10:00:00 8 0 0 0 0
2010-01-02 11:00:00 9 0 0 0 0
2010-01-02 12:00:00 11 0 0 0 0
2010-01-02 13:00:00 8 0 0 0 0
2010-01-02 14:00:00 10 0 0 0 0
2010-01-02 15:00:00 8 0 0 0 0
2010-01-02 16:00:00 6 0 0 0 0
...
I've managed to test the code that I've written for a one month period and the problem that I'm encountering is that it is taking a great deal of time to complete the task. I wanted to know if there was a faster way of going about this
import pandas as pd
import numpy as np
from datetime import datetime
location = '/home/lukasz/Documents/BUS/HOURLY_DATA.xlsx'
location2 = '/home/lukasz/Documents/BUS/Weather Data/2010-01.xlsx'
df1 = pd.read_excel(location)
df2 = pd.read_excel(location2)
df.Temp = df.Temp.astype(float)
df.Rel_Hum = df.Rel_Hum.astype(float)
df.Pressure = df.Pressure.astype(float)
df.Weather = df.Weather.astype(str)
n = len(df2) - len(df)
for i in range(len(df)):
print(df['Date'][i])
for j in range(i, i+n):
date_object = datetime.strptime(df2['Date/Time'][j], '%Y-%m-%d %H:%M') # The date column in df2 is a str
if df['Date'][i] == date_object:
df.set_value(i, 'Temp', df2['Temp'][j])
df.set_value(i, 'Dew_Point_Temp', df2['Dew_Point_Temp'][j])
df.set_value(i, 'Rel_Hum', df2['Rel_Hum'][j])
df.set_value(i, 'Pressure', df2['Pressure'][j])
df.set_value(i, 'Weather', df2['Weather'][j])
# print(df[:5])
df.to_excel(location, index=False)
Use a combination of reindex to get df2 aligned with df1. Make sure to include parameter method='ffill' to forward fill weather information.
Then use join
df1.join(df2.set_index('Date').reindex(df1.Date, method='ffill'), on='Date')
Related
I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN.
Imagine that we have this other dataframe with the number of times the value is repeated per month:
date values
0 2013-01 200
1 2013-02 0
2 2013-03 2
3 2013-04 181
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
In this case, the month of January and April passed the stipulated value and that first dataframe should be:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.
You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:
import numpy as np
MISSING = -9999
THRESHOLD = 175
# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
df.groupby('month')['values']
.transform(lambda d: (d == MISSING).sum())
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
(df['values'] == MISSING) & (df['n_missing'] > THRESHOLD),
np.nan,
df['values']
)
Let's say I have the below Dataframe. How would I do to get an extra column 'flag' with 1's where a day has a age bigger than 90 and only if it happens in 2 consecutive days (48h in this case)? The output should contain 1' on 2 or more days, depending on how many days the condition is met The dataset is much bigger, but I put here just a small portion so you get an idea.
Age
Dates
2019-01-01 00:00:00 29
2019-01-01 01:00:00 56
2019-01-01 02:00:00 82
2019-01-01 03:00:00 13
2019-01-01 04:00:00 35
2019-01-01 05:00:00 53
2019-01-01 06:00:00 25
2019-01-01 07:00:00 23
2019-01-01 08:00:00 21
2019-01-01 09:00:00 12
2019-01-01 10:00:00 15
2019-01-01 11:00:00 9
2019-01-01 12:00:00 13
2019-01-01 13:00:00 87
2019-01-01 14:00:00 9
2019-01-01 15:00:00 63
2019-01-01 16:00:00 62
2019-01-01 17:00:00 52
2019-01-01 18:00:00 43
2019-01-01 19:00:00 77
2019-01-01 20:00:00 95
2019-01-01 21:00:00 79
2019-01-01 22:00:00 77
2019-01-01 23:00:00 5
2019-01-02 00:00:00 78
2019-01-02 01:00:00 41
2019-01-02 02:00:00 10
2019-01-02 03:00:00 10
2019-01-02 04:00:00 88
2019-01-02 05:00:00 19
This would be the desired output:
Dates Age flag
0 2019-01-01 00:00:00 29 1
1 2019-01-01 01:00:00 56 1
2 2019-01-01 02:00:00 82 1
3 2019-01-01 03:00:00 13 1
4 2019-01-01 04:00:00 35 1
5 2019-01-01 05:00:00 53 1
6 2019-01-01 06:00:00 25 1
7 2019-01-01 07:00:00 23 1
8 2019-01-01 08:00:00 21 1
9 2019-01-01 09:00:00 12 1
10 2019-01-01 10:00:00 15 1
11 2019-01-01 11:00:00 9 1
12 2019-01-01 12:00:00 13 1
13 2019-01-01 13:00:00 87 1
14 2019-01-01 14:00:00 9 1
15 2019-01-01 15:00:00 63 1
16 2019-01-01 16:00:00 62 1
17 2019-01-01 17:00:00 52 1
18 2019-01-01 18:00:00 43 1
19 2019-01-01 19:00:00 77 1
20 2019-01-01 20:00:00 95 1
21 2019-01-01 21:00:00 79 1
22 2019-01-01 22:00:00 77 1
23 2019-01-01 23:00:00 5 1
24 2019-01-02 00:00:00 78 0
25 2019-01-02 01:00:00 41 0
26 2019-01-02 02:00:00 10 0
27 2019-01-02 03:00:00 10 0
28 2019-01-02 04:00:00 88 0
29 2019-01-02 05:00:00 19 0
The dates is the index of the dataframe and is incremented by 1h.
thanks
You can first compare column by Series.gt, then grouping by DatetimeIndex.date and ccheck if at least one True per groups by GroupBy.transform with GroupBy.any, last cast mask to integers for True/False to 1/0 mapping, then combinae it with previous answer:
df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='5H', periods=24))
#for test 1H timestamp use
#df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='H', periods=24 * 5))
df.loc[pd.Timestamp('2019-01-02 01:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-03 02:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-05 19:00:00'), 'Age'] = 95
#print (df)
#for test 48 consecutive values change N = 48
N = 10
s = df['Age'].gt(90)
s1 = (s.groupby(df.index.date).transform('any'))
g1 = s1.ne(s1.shift()).cumsum()
df['flag'] = (s.groupby(g1).transform('size').ge(N) & s1).astype(int)
print (df)
Age flag
2019-01-01 00:00:00 10 0
2019-01-01 05:00:00 10 0
2019-01-01 10:00:00 10 0
2019-01-01 15:00:00 10 0
2019-01-01 20:00:00 10 0
2019-01-02 01:00:00 95 1
2019-01-02 06:00:00 10 1
2019-01-02 11:00:00 10 1
2019-01-02 16:00:00 10 1
2019-01-02 21:00:00 10 1
2019-01-03 02:00:00 95 1
2019-01-03 07:00:00 10 1
2019-01-03 12:00:00 10 1
2019-01-03 17:00:00 10 1
2019-01-03 22:00:00 10 1
2019-01-04 03:00:00 10 0
2019-01-04 08:00:00 10 0
2019-01-04 13:00:00 10 0
2019-01-04 18:00:00 10 0
2019-01-04 23:00:00 10 0
2019-01-05 04:00:00 10 0
2019-01-05 09:00:00 10 0
2019-01-05 14:00:00 10 0
2019-01-05 19:00:00 95 0
Apparently, this could be a solution to the first version of the question: how to add a column whose row values are 1 if at least one of the rows with the same date (y-m-d) has an Age value greater than 90.
import pandas as pd
df = pd.DataFrame({
'Dates':['2019-01-01 00:00:00',
'2019-01-01 01:00:00',
'2019-01-01 02:00:00',
'2019-01-02 00:00:00',
'2019-01-02 01:00:00',
'2019-01-03 02:00:00',
'2019-01-03 03:00:00',],
'Age':[29, 56, 92, 13, 1, 2, 93],})
df.set_index('Dates', inplace=True)
df.index = pd.to_datetime(df.index)
df['flag'] = pd.DatetimeIndex(df.index).day
df['flag'] = df.flag.isin(df['flag'][df['Age']>90]).astype(int)
It returns:
Age flag
Dates
2019-01-01 00:00:00 29 1
2019-01-01 01:00:00 56 1
2019-01-01 02:00:00 92 1
2019-01-02 00:00:00 13 0
2019-01-02 01:00:00 1 0
2019-01-03 02:00:00 2 1
2019-01-03 03:00:00 93 1
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
I would like to add zero values to a Panadas dataframe where data has not been recorded, using an hourly timestamp.
Ie I would like output to be:
DataFrame: quantity
created_at
2018-01-21 14:00:00 0
...
2018-01-22 12:00:00 0
2018-01-22 13:00:00 0
2018-01-22 14:00:00 31
In the code below, when I reindex, the value in the quantity column is set to Nan.
How can I keep existing values but add hour time indexes with zero values where they are missing?
data = {'date_time': ['2018-01-22 14:47:05.486877'],
'quantity': [31]}
df = pd.DataFrame(data, columns = ['date_time', 'quantity'])
df.index = df['date_time']
del df['date_time']
df.index = pd.to_datetime(df.index)
#want to sum data by hour
df = df.resample('H').sum()
#set minutes etc to zero for indexing
current_date = datetime.now().replace(microsecond=0,second=0,minute=0)
d2 = current_date - timedelta(hours = 24)
all_times = pd.date_range(d2, current_date, freq = "H")
#ensure index format is exactly same as df (may be unecessary?)
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M:%S'))
#this sets everything to Nan and wipes existing quantity data
df = df.reindex(all_times)
df = df.fillna(0)
Any ideas?
I think you need convert datetimes to hours by floor and change range for reindex - e.g. +- 24 hours from current datetime if necessary - it mainly depends of current_date and Datetimeindex:
data = {'date_time': ['2018-01-22 14:47:05.486877'],
'quantity': [31]}
df = pd.DataFrame(data, columns = ['date_time', 'quantity'])
#print (df)
df.date_time = pd.to_datetime(df.date_time)
df = df.set_index('date_time')
df = df.resample('H').sum()
current_date = pd.datetime.now()
print (current_date)
2018-01-22 10:31:37.663110
all_times = pd.date_range(current_date - pd.Timedelta(hours = 24),
current_date + pd.Timedelta(hours = 24), freq = "H").floor('H')
#print (all_times)
df = df.reindex(all_times, fill_value=0)
print (df)
quantity
2018-01-21 10:00:00 0
2018-01-21 11:00:00 0
2018-01-21 12:00:00 0
2018-01-21 13:00:00 0
2018-01-21 14:00:00 0
2018-01-21 15:00:00 0
2018-01-21 16:00:00 0
2018-01-21 17:00:00 0
2018-01-21 18:00:00 0
2018-01-21 19:00:00 0
2018-01-21 20:00:00 0
2018-01-21 21:00:00 0
2018-01-21 22:00:00 0
2018-01-21 23:00:00 0
2018-01-22 00:00:00 0
2018-01-22 01:00:00 0
2018-01-22 02:00:00 0
2018-01-22 03:00:00 0
2018-01-22 04:00:00 0
2018-01-22 05:00:00 0
2018-01-22 06:00:00 0
2018-01-22 07:00:00 0
2018-01-22 08:00:00 0
2018-01-22 09:00:00 0
2018-01-22 10:00:00 0
2018-01-22 11:00:00 0
2018-01-22 12:00:00 0
2018-01-22 13:00:00 0
2018-01-22 14:00:00 31
2018-01-22 15:00:00 0
2018-01-22 16:00:00 0
2018-01-22 17:00:00 0
2018-01-22 18:00:00 0
2018-01-22 19:00:00 0
2018-01-22 20:00:00 0
2018-01-22 21:00:00 0
2018-01-22 22:00:00 0
2018-01-22 23:00:00 0
2018-01-23 00:00:00 0
2018-01-23 01:00:00 0
2018-01-23 02:00:00 0
2018-01-23 03:00:00 0
2018-01-23 04:00:00 0
2018-01-23 05:00:00 0
2018-01-23 06:00:00 0
2018-01-23 07:00:00 0
2018-01-23 08:00:00 0
2018-01-23 09:00:00 0
2018-01-23 10:00:00 0
I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)