Pandas reindex datetimeindex keeping existing values - python

I would like to add zero values to a Panadas dataframe where data has not been recorded, using an hourly timestamp.
Ie I would like output to be:
DataFrame: quantity
created_at
2018-01-21 14:00:00 0
...
2018-01-22 12:00:00 0
2018-01-22 13:00:00 0
2018-01-22 14:00:00 31
In the code below, when I reindex, the value in the quantity column is set to Nan.
How can I keep existing values but add hour time indexes with zero values where they are missing?
data = {'date_time': ['2018-01-22 14:47:05.486877'],
'quantity': [31]}
df = pd.DataFrame(data, columns = ['date_time', 'quantity'])
df.index = df['date_time']
del df['date_time']
df.index = pd.to_datetime(df.index)
#want to sum data by hour
df = df.resample('H').sum()
#set minutes etc to zero for indexing
current_date = datetime.now().replace(microsecond=0,second=0,minute=0)
d2 = current_date - timedelta(hours = 24)
all_times = pd.date_range(d2, current_date, freq = "H")
#ensure index format is exactly same as df (may be unecessary?)
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M:%S'))
#this sets everything to Nan and wipes existing quantity data
df = df.reindex(all_times)
df = df.fillna(0)
Any ideas?

I think you need convert datetimes to hours by floor and change range for reindex - e.g. +- 24 hours from current datetime if necessary - it mainly depends of current_date and Datetimeindex:
data = {'date_time': ['2018-01-22 14:47:05.486877'],
'quantity': [31]}
df = pd.DataFrame(data, columns = ['date_time', 'quantity'])
#print (df)
df.date_time = pd.to_datetime(df.date_time)
df = df.set_index('date_time')
df = df.resample('H').sum()
current_date = pd.datetime.now()
print (current_date)
2018-01-22 10:31:37.663110
all_times = pd.date_range(current_date - pd.Timedelta(hours = 24),
current_date + pd.Timedelta(hours = 24), freq = "H").floor('H')
#print (all_times)
df = df.reindex(all_times, fill_value=0)
print (df)
quantity
2018-01-21 10:00:00 0
2018-01-21 11:00:00 0
2018-01-21 12:00:00 0
2018-01-21 13:00:00 0
2018-01-21 14:00:00 0
2018-01-21 15:00:00 0
2018-01-21 16:00:00 0
2018-01-21 17:00:00 0
2018-01-21 18:00:00 0
2018-01-21 19:00:00 0
2018-01-21 20:00:00 0
2018-01-21 21:00:00 0
2018-01-21 22:00:00 0
2018-01-21 23:00:00 0
2018-01-22 00:00:00 0
2018-01-22 01:00:00 0
2018-01-22 02:00:00 0
2018-01-22 03:00:00 0
2018-01-22 04:00:00 0
2018-01-22 05:00:00 0
2018-01-22 06:00:00 0
2018-01-22 07:00:00 0
2018-01-22 08:00:00 0
2018-01-22 09:00:00 0
2018-01-22 10:00:00 0
2018-01-22 11:00:00 0
2018-01-22 12:00:00 0
2018-01-22 13:00:00 0
2018-01-22 14:00:00 31
2018-01-22 15:00:00 0
2018-01-22 16:00:00 0
2018-01-22 17:00:00 0
2018-01-22 18:00:00 0
2018-01-22 19:00:00 0
2018-01-22 20:00:00 0
2018-01-22 21:00:00 0
2018-01-22 22:00:00 0
2018-01-22 23:00:00 0
2018-01-23 00:00:00 0
2018-01-23 01:00:00 0
2018-01-23 02:00:00 0
2018-01-23 03:00:00 0
2018-01-23 04:00:00 0
2018-01-23 05:00:00 0
2018-01-23 06:00:00 0
2018-01-23 07:00:00 0
2018-01-23 08:00:00 0
2018-01-23 09:00:00 0
2018-01-23 10:00:00 0

Related

Measure different between timestamps using conditions - python

I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]

pandas calculate duration between datetime but not considering specific time range

For clarity here is MRE:
df = pd.DataFrame(
{"id":[1,2,3,4],
"start_time":["2020-06-01 01:00:00", "2020-06-01 01:00:00", "2020-06-01 19:00:00", "2020-06-02 04:00:00"],
"end_time":["2020-06-01 14:00:00", "2020-06-01 18:00:00", "2020-06-02 10:00:00", "2020-06-02 16:00:00"]
})
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
df["sub_time"] = df["end_time"] - df["start_time"]
this outputs:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 13:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 17:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 15:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
but when start_time ~ end_time consists of times range 00:00:00~03:59:59am I want to ignore it(not calculated in sub_time)
So instead of output above I would get:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 10:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 14:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 11:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
row 0: starting at 01:00:00 do not count until 04:00:00. then 04:00:00 ~ 14:00:00 is 10 hour period
row 2: consider duration from 19:00:00 ~ 24:00:00 and 04:00:00 ~ 10:00:00 thus we get 11:00:00 in sub_time column.
Any suggestions?

How do I display a subset of a pandas dataframe?

I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).

How can I organize data hour-by-hour and set the missing values to zeros?

I played games several times a day and I got a score each time. I would like to reorganize the data hour-by-hour, and set the missing values to zero.
Here is the original data:
import pandas as pd
df = pd.DataFrame({
'Time': ['2017-01-01 08:45:00', '2017-01-01 09:11:00',
'2017-01-01 11:40:00', '2017-01-01 14:05:00',
'2017-01-01 21:00:00'],
'Score': range(1, 6)})
It looks like this:
Score Time
0 1 2017-01-01 08:45:00
1 2 2017-01-01 09:11:00
2 3 2017-01-01 11:40:00
3 4 2017-01-01 14:05:00
4 5 2017-01-01 15:00:00
How can I get a new dataframe like this:
day Hour Score
2017-01-01 00:00:00 0
...
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 5
2017-01-01 16:00:00 0
...
2017-01-01 23:00:00 0
Many thanks!
You can use resample with some aggregate function like sum, then fillna and convert to to int by astype but first add first and last DateTime values:
df.loc[-1, 'Time'] = '2017-01-01 00:00:00'
df.loc[-2, 'Time'] = '2017-01-01 23:00:00'
df['Time'] = pd.to_datetime(df['Time'])
df = df.resample('H', on='Time').sum().fillna(0).astype(int)
print (df)
Score
Time
2017-01-01 00:00:00 0
2017-01-01 01:00:00 0
2017-01-01 02:00:00 0
2017-01-01 03:00:00 0
2017-01-01 04:00:00 0
2017-01-01 05:00:00 0
2017-01-01 06:00:00 0
2017-01-01 07:00:00 0
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 0
2017-01-01 16:00:00 0
2017-01-01 17:00:00 0
2017-01-01 18:00:00 0
2017-01-01 19:00:00 0
2017-01-01 20:00:00 0
2017-01-01 21:00:00 5
2017-01-01 22:00:00 0
2017-01-01 23:00:00 0

Merging two dataframes only at specific times

I have two excel files that I'm trying to merge into one using pandas. The first file is a list of times and dates with a subscriber count for that given time and day. The second file has weather information on an hourly basis. I import both files and the data resembles:
df1=
Date Count
2010-01-02 09:00:00 15
2010-01-02 10:00:00 8
2010-01-02 11:00:00 9
2010-01-02 12:00:00 11
2010-01-02 13:00:00 8
2010-01-02 14:00:00 10
2010-01-02 15:00:00 8
2010-01-02 16:00:00 6
...
df2 =
Date Temp Rel_Hum Pressure Weather
2010-01-00 09:00:00 -5 93 100.36 Snow,Fog
2010-01-01 10:00:00 -5 93 100.36 Snow,Fog
2010-01-02 11:00:00 -6.5 91 100 Snow,Fog
2010-01-03 12:00:00 -7 87 89 Snow,Fog
2010-01-04 13:00:00 -7 87 89 Snow,Fog
2010-01-05 14:00:00 -6.7 88 89 Snow,Fog
2010-01-06 15:00:00 -6.5 89 89 Snow,Fog
2010-01-07 16:00:00 -6 88 90 Snow,Fog
2010-01-08 17:00:00 -6 89 89 Snow,Fog
...
I only need to weather info for the times that are specified in df1, but df2 contains weather info for a 24 hour period for everyday of that month.
Since df1 only contains 2 columns, I've modified df1 to have a Temp Rel_Hum Pressure and Weather column so that it resembles:
Date Count Temp Rel_Hum Pressure Weather
2010-01-02 09:00:00 15 0 0 0 0
2010-01-02 10:00:00 8 0 0 0 0
2010-01-02 11:00:00 9 0 0 0 0
2010-01-02 12:00:00 11 0 0 0 0
2010-01-02 13:00:00 8 0 0 0 0
2010-01-02 14:00:00 10 0 0 0 0
2010-01-02 15:00:00 8 0 0 0 0
2010-01-02 16:00:00 6 0 0 0 0
...
I've managed to test the code that I've written for a one month period and the problem that I'm encountering is that it is taking a great deal of time to complete the task. I wanted to know if there was a faster way of going about this
import pandas as pd
import numpy as np
from datetime import datetime
location = '/home/lukasz/Documents/BUS/HOURLY_DATA.xlsx'
location2 = '/home/lukasz/Documents/BUS/Weather Data/2010-01.xlsx'
df1 = pd.read_excel(location)
df2 = pd.read_excel(location2)
df.Temp = df.Temp.astype(float)
df.Rel_Hum = df.Rel_Hum.astype(float)
df.Pressure = df.Pressure.astype(float)
df.Weather = df.Weather.astype(str)
n = len(df2) - len(df)
for i in range(len(df)):
print(df['Date'][i])
for j in range(i, i+n):
date_object = datetime.strptime(df2['Date/Time'][j], '%Y-%m-%d %H:%M') # The date column in df2 is a str
if df['Date'][i] == date_object:
df.set_value(i, 'Temp', df2['Temp'][j])
df.set_value(i, 'Dew_Point_Temp', df2['Dew_Point_Temp'][j])
df.set_value(i, 'Rel_Hum', df2['Rel_Hum'][j])
df.set_value(i, 'Pressure', df2['Pressure'][j])
df.set_value(i, 'Weather', df2['Weather'][j])
# print(df[:5])
df.to_excel(location, index=False)
Use a combination of reindex to get df2 aligned with df1. Make sure to include parameter method='ffill' to forward fill weather information.
Then use join
df1.join(df2.set_index('Date').reindex(df1.Date, method='ffill'), on='Date')

Categories