Merge two columns in Pandas - python

I have the following Pandas DataFrame:
date at weight status buy_ts sell_ts
--- ------------------- ------ ------------ -------- ------------------- -------------------
0 2010-01-03 00:00:00 1.4286 7 buy 2010-01-04 01:47:00 nan
1 2010-01-03 00:00:00 1.4288 7 buy 2010-01-04 00:00:00 nan
2 2010-01-03 00:00:00 1.4289 7 buy 2010-01-04 00:00:00 nan
3 2010-01-04 00:00:00 1.442 25 buy 2010-01-05 00:00:00 nan
4 2010-01-05 00:00:00 1.4422 15 sell nan 2010-01-06 14:03:00
5 2010-01-05 00:00:00 1.4423 15 sell nan 2010-01-06 14:03:00
6 2010-01-05 00:00:00 1.4424 15 sell nan 2010-01-06 14:03:00
7 2010-01-06 00:00:00 1.4403 18 sell nan 2010-01-07 00:04:00
8 2010-01-06 00:00:00 1.4404 18 sell nan 2010-01-07 00:05:00
9 2010-01-06 00:00:00 1.4405 18 sell nan 2010-01-08 08:54:00
10 2010-01-07 00:00:00 1.4313 26 buy 2010-01-08 00:07:00 nan
11 2010-01-07 00:00:00 1.4314 26 buy 2010-01-08 00:07:00 nan
12 2010-01-07 00:00:00 1.4316 26 sell nan 2010-01-08 00:10:00
buy_ts and sell_ts contains a Python datetime.datetime object
I would like to create a new column called merged_ts which contains the datetime.dateime object from buy_ts or sell_ts (when one column has value the other is always nan so it is not possible that both columns are populated).

Use combine_first:
df['merged'] = df['buy_ts'].combine_first(df['sell_ts'])

Related

Pandas - Replace Duplicates with Nan and Keep Row

How do I replace duplicates for each group with NaNs while keeping the rows?
I need to keep rows without removing and perhaps keeping the first original value where it shows up first.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
'value': [10,10,10,10,12,12,12,12],
'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 10 Jackie
2 2019-01-01 02:00:00 10 Jackie
3 2019-01-01 03:00:00 10 Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 12 Zoop
6 2019-09-01 04:00:00 12 Zoop
7 2019-09-01 05:00:00 12 Zoop
Desired Dataframe:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
Edit:
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.
I assume you check duplicates on columns value and ID and further check on date of column date
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan
Out[269]:
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
As #Trenton suggest, you may use pd.NA to avoid import numpy
(Note: as #rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA
Out[273]:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 <NA> Jackie
2 2019-01-01 02:00:00 <NA> Jackie
3 2019-01-01 03:00:00 <NA> Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 <NA> Zoop
6 2019-09-01 04:00:00 <NA> Zoop
7 2019-09-01 05:00:00 <NA> Zoop
This is working if the dataframe is sorted - as in your example:
import numpy as np # to be used for np.nan
df['duplicate'] = df['value'].shift(1) # create a duplicate column
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
else x['value'], axis=1) # conditional replace
df = df.drop('duplicate', axis=1) # drop helper column
Group on the dates and take the first observed value (not necessarily the first when sorted by time), then merge the result back to the original dataframe.
df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Combine dataframes result with a DatetimeIndex index

i have a pandas dataframe with random values at every minute.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,30,size=20), index=pd.date_range("20180101", periods=20, freq='T'))
df
0
2018-01-01 00:00:00 21
2018-01-01 00:01:00 21
2018-01-01 00:02:00 23
2018-01-01 00:03:00 18
2018-01-01 00:04:00 3
2018-01-01 00:05:00 11
2018-01-01 00:06:00 3
2018-01-01 00:07:00 4
2018-01-01 00:08:00 5
2018-01-01 00:09:00 25
2018-01-01 00:10:00 15
2018-01-01 00:11:00 11
2018-01-01 00:12:00 29
2018-01-01 00:13:00 22
2018-01-01 00:14:00 7
2018-01-01 00:15:00 13
2018-01-01 00:16:00 26
2018-01-01 00:17:00 7
2018-01-01 00:18:00 26
2018-01-01 00:19:00 15
Now, I must create a new column in the dataframe df that "reflects" the mean() of a window of 2 periods on an higher frequency(5 minutes).
df2 = df.resample('5T').sum().rolling(2).mean()
df2
0
2018-01-01 00:00:00 NaN
2018-01-01 00:05:00 67.0
2018-01-01 00:10:00 66.0
2018-01-01 00:15:00 85.5
Here comes the problem. I need to "map" somehow the values of the "higher frequency" frame to the lower.
I should get something like:
0 new_column
2018-01-01 00:00:00 21 NaN
2018-01-01 00:01:00 21 NaN
2018-01-01 00:02:00 23 NaN
2018-01-01 00:03:00 18 NaN
2018-01-01 00:04:00 3 NaN
2018-01-01 00:05:00 11 67.0
2018-01-01 00:06:00 3 67.0
2018-01-01 00:07:00 4 67.0
2018-01-01 00:08:00 5 67.0
2018-01-01 00:09:00 25 67.0
2018-01-01 00:10:00 15 66.0
2018-01-01 00:11:00 11 66.0
2018-01-01 00:12:00 29 66.0
2018-01-01 00:13:00 22 66.0
2018-01-01 00:14:00 7 66.0
2018-01-01 00:15:00 13 85.5
2018-01-01 00:16:00 26 85.5
2018-01-01 00:17:00 7 85.5
2018-01-01 00:18:00 26 85.5
2018-01-01 00:19:00 15 85.5
I am using pandas 0.23.4
You can just use:
df['new_column'] = df2[0].repeat(5).values
with 5 being your resampling factor
You can pd.concat both dataframes and fillforward
df3=pd.concat([df,df2],axis=1).ffill()

How to merge two pandas time series objects with different date time indices?

I have two disjoint time series objects, for example
-ts1
Date Price
2010-01-01 1800.0
2010-01-04 1500.0
2010-01-08 1600.0
2010-01-09 1400.0
Name: Price, dtype: float64
-ts2
Date Price
2010-01-02 2000.0
2010-01-03 2200.0
2010-01-05 2010.0
2010-01-07 2100.0
2010-01-10 2110.0
How I could merge the two into a single time series that should be sorted on date? like
-ts3
Date Price
2010-01-01 1800.0
2010-01-02 2000.0
2010-01-03 2200.0
2010-01-04 1500.0
2010-01-05 2010.0
2010-01-07 2100.0
2010-01-08 1600.0
2010-01-09 1400.0
2010-01-10 2110.0
Use pandas.concat or DataFrame.append for join together and then DataFrame.sort_values by column Date, last for default indices DataFrame.reset_index with parameter drop=True:
df3 = pd.concat([df1, df2]).sort_values('Date').reset_index(drop=True)
Alternative:
df3 = df1.append(df2).sort_values('Date').reset_index(drop=True)
print (df3)
Date Price
0 2010-01-01 1800.0
1 2010-01-02 2000.0
2 2010-01-03 2200.0
3 2010-01-04 1500.0
4 2010-01-05 2010.0
5 2010-01-07 2100.0
6 2010-01-08 1600.0
7 2010-01-09 1400.0
8 2010-01-10 2110.0
EDIT:
If TimeSeries then solution is simplify:
s3= pd.concat([s1, s2]).sort_index()
You can set the index of each to 'Date' and use combine_first
ts1.set_index('Date').combine_first(ts2.set_index('Date')).reset_index()
Date Price
0 2010-01-01 1800.0
1 2010-01-02 2000.0
2 2010-01-03 2200.0
3 2010-01-04 1500.0
4 2010-01-05 2010.0
5 2010-01-07 2100.0
6 2010-01-08 1600.0
7 2010-01-09 1400.0
8 2010-01-10 2110.0
If these were Series in the first place, then you could simply
ts1.combine_first(ts2)

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Categories