Filling NaN values from another dataframe based on a condition - python

I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']

Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12

Related

pandas.Series.apply() lambda function to count data-frame column values with conditions

This post follows on from another one I posted which can be found here:
use groupby() and for loop to count column values with conditions
I am working with the same data again:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
Like in the previous post, I first created a pd.Series with the 1st day of every month in the entire history of the data
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I now want to do is count the number of rows in the data-frame where the df["start_date"] values are less than the 1st day of each month in the series and where the df["end_date"] values are greater than the 1st day of each month in the series
I would think that I would apply a lambda function or use np.logical_and on the dates series to obtain the output I am after - the logic of which would look something like this:
#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]
dates.apply(
lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)
or like this:
dates.apply(
lambda x: np.logical_and(
inactives[inactives["start_date"] < x,
inactives[inactives["cancel_date"] > x]]
).sum())
The resulting output would look like this:
month_first
count
2015-01-01
10
2015-02-01
25
2015-03-01
45
Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq "MS" to create start of month inside our defined interval.
new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})
This will result in this table:
month_first
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-04-01
4 2015-05-01
.. ...
89 2022-06-01
90 2022-07-01
91 2022-08-01
92 2022-09-01
93 2022-10-01
[94 rows x 1 columns]
Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives which the start_date is less and end_date is greater. We use & operator to perform and operation to each row of our resulting comparisons. Then, we use sum to sum all the boolean values.
new_df["count"] = new_df["month_first"].apply(
lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())
This will result in this table:
month_first count
0 2015-01-01 0
1 2015-02-01 4
2 2015-03-01 9
3 2015-04-01 14
4 2015-05-01 19
.. ... ...
89 2022-06-01 25
90 2022-07-01 22
91 2022-08-01 19
92 2022-09-01 13
93 2022-10-01 13
[94 rows x 2 columns]

Query for one dataframe row based on row in another dataframe & compare values

So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1['Po']-df1['Ref']
high range = df1['Po']+df1['Ref']
if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
df1 (base records) have more or less rows than df2 (simulation records)
df1 has more columns than df2
some columns in df1 have the same name but different values in df2
ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
only need the most recent df1 base record to compare to the df2 simulation record
previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof, its direction offer nearest.
Method 1:
reindex() with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
As you can see, this is df2.join(df1),
join multiple DataFrame objects by index at once.
with method='nearest', in this case, it would join df2 and df1 by the nearest Timestamp index.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof() with direction='nearest'
this way is not executed with indexed value, therefore we don't have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
As you can see, this is pd.merge_asof(df2, df1),
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
Method 2 (on multiple keys)
I remodified df1 adding different Name and Time
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
Important:
can only merge_asof on a single key, therefore others would utilize by= to deal with.
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept
Because you don't provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column") for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply() to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
You can do it via pandasql:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
ORDER BY Date ASC
''')
Also to work with datetime format in sqldf you need to name your Timestamp column as date in the query. If you need to get only let's say first/earliest 5 results add LIMIT 5 in the end of the query.
If you need to get closest date in df2 to df1 try this:
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2,
df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
and Date1 <= Date2
group by Date2
ORDER BY Date1 ASC
''')

Finding the beginning and end dates of when a sequence of values occurs in Pandas

I have a dataframe with an index column and another column that marks whether or not an event occurred on that day with a 1 or 0.
If an event occurred it typically happened continuously for a prolonged period of time. They'll typically mark whether or not a recession occurred, so it'd likely be 60-180 straight days that would be marked with a 1 before going to 0 again.
What I need to do is find the dates that mark the beginning and end of each sequence of 1's.
Here's some quick sample code:
dates = pd.date_range(start='2010-01-01', end='2015-01-01')
nums = np.random.normal(50, 5, 1827)
df = pd.DataFrame(nums, index=dates, columns=['Nums'])
df['Recession'] = np.where((df.index.month == 3) | (df.index.month == 12), 1, 0)
With the example dataframe, the value 1 occurs for the months of March and December, so ideally I'd have a list that reads [2010-03-01, 2010-03-31, 2010-12-01, 2010-12-30, ......, 2015-12-01, 2015-12-30].
I know I could find these values by using a for-loop, but that seems inefficient. I tried using groupby as well, but couldn't find anything that gave the results that I wanted.
Not sure if there's a pandas or numpy method to search an index for the appropriate conditions or not.
Let's try this, using DataFrameGroupBy.idxmin + DataFrameGroupBy.idxmax
# group-by on month, year & aggregate on date
g = (
df.assign(day=df.index.day)
.groupby([df.index.month, df.index.year]).day
)
# create mask of max date & min date for each (month, year) combination
mask = df.index.isin(g.idxmin()) | df.index.isin(g.idxmax())
# apply previous mask with month filter..
df.loc[mask & (df.index.month.isin([3,12])), 'Recession'] = 1
print(df[df['Recession'] == 1])
Nums Recession
2010-03-01 45.698168 1.0
2010-03-31 47.969167 1.0
2010-12-01 49.388595 1.0
2010-12-31 46.689064 1.0
2011-03-01 50.120603 1.0
2011-03-31 58.379980 1.0
2011-12-01 53.745407 1.0
...
...
I would use diff to find the periods, the diff enables to find when it switches from one state to another, so split the indices found in two parts, the starts and ends.
Depending whether the data starts with a recession or not:
locs = (df.Recession.diff().fillna(0)!=0).values.nonzero()[0]
if df.Recession.iloc[0]==0:
start = df.index[locs[::2]]
end = df.index[locs[1::2]-1]
else:
start = df.index[locs[::2]-1]
end = df.index[locs[1::2]]
If the data started with a recession already, up to you if you want to include the first date as a start or not, the above does not include it.
From what I understand you need to find the first value in a sequence? if so we can use groupby and cumsum to sum each consecutive group, and cumcount to count each of the groups.
df["keyGroup"] = (
df.groupby(df["Recession"].ne(df["Recession"].shift()).cumsum()).cumcount() + 1
)
df[df['keyGroup'].eq(1)]
Nums Recession keyGroup
2010-01-01 51.944742 0 1
2010-03-01 54.809271 1 1
2010-04-01 52.632831 0 1
2010-12-01 55.863695 1 1
2011-01-01 52.944778 0 1
2011-03-01 58.164943 1 1
2011-04-01 49.590640 0 1
2011-12-01 47.884919 1 1
2012-01-01 44.128065 0 1
2012-03-01 54.846231 1 1
2012-04-01 51.312064 0 1
2012-12-01 46.091171 1 1
2013-01-01 49.287102 0 1
2013-03-01 54.727874 1 1
2013-04-01 53.163730 0 1
2013-12-01 42.373602 1 1
2014-01-01 43.822791 0 1
2014-03-01 51.203125 1 1
2014-04-01 54.322415 0 1
2014-12-01 44.052536 1 1
2015-01-01 53.438015 0 1
you can call .index to get the values in a list.
df[df['keyGroup'].eq(1)].index
DatetimeIndex(['2010-01-01', '2010-03-01', '2010-04-01', '2010-12-01',
'2011-01-01', '2011-03-01', '2011-04-01', '2011-12-01',
'2012-01-01', '2012-03-01', '2012-04-01', '2012-12-01',
'2013-01-01', '2013-03-01', '2013-04-01', '2013-12-01',
'2014-01-01', '2014-03-01', '2014-04-01', '2014-12-01',
'2015-01-01'],
dtype='datetime64[ns]', name='date', freq=None)

Join and sum on subset of rows in a dataframe

I have a pandas dataframe which stores date ranges and some associated colums:
date_start date_end ... lots of other columns ...
1 2016-07-01 2016-07-02
2 2016-07-01 2016-07-03
3 2016-07-01 2016-07-04
4 2016-07-02 2016-07-07
5 2016-07-05 2016-07-06
and another dataframe of Pikachu sightings indexed by date:
pikachu_sightings
date
2016-07-01 2
2016-07-02 4
2016-07-03 6
2016-07-04 8
2016-07-05 10
2016-07-06 12
2016-07-07 14
For each row in the first df I'd like to calculate the sum of pikachu_sightings within that date range (i.e., date_start to date_end) and store that in a new column. So would end up with a df like this (numbers left in for clarity):
date_start date_end total_pikachu_sightings
1 2016-07-01 2016-07-02 2 + 4
2 2016-07-01 2016-07-03 2 + 4 + 6
3 2016-07-01 2016-07-04 2 + 4 + 6 + 8
4 2016-07-02 2016-07-07 4 + 6 + 8 + 10 + 12 + 14
5 2016-07-05 2016-07-06 10 + 12
If I was doing this iteratively I'd iterate over each row in the table of date ranges, select the subset of rows in the table of sightings that match the date range and perform a sum on it - but this is way too slow for my dataset:
for range in ranges.itertuples():
sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)]
sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum()
ranges.set_value(range.Index, 'total_pikachu_sightings', sum_sightings_in_range)
This is my attempt at using pandas, but fails because the length of the two dataframes does not match (and even if they did, there's probably some other flaw in my approach):
range["total_pikachu_sightings"] =
sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)
["pikachu_sightings"].sum()
I'm trying to understand what the general approach/design should look like as I'd like to aggregate with other functions too, sum just seems like the easiest for an example. Sorry if this is an obvious question - I'm new to pandas!
A sketch of a vectorized solution:
Start with a p as in piRSquared's answer.
Make sure date_ cols have datetime64 dtypes, i.e.:
df['date_start'] = pd.to_datetime(df.date_time)
Then calculate cumulative sums:
psums = p.cumsum()
and
result = psums.asof(df.date_end) - psums.asof(df.date_start)
It's not yet the end, though. asof returns the last good value, so it sometimes will take the exact start date and sometimes not (depending on your data). So, you have to adjust for that. (If the date frequency is day, then probably moving the index of p an hour backwards, e.g. -pd.Timedelta(1, 'h'), and then adding p.asof(df.start_date) might do the trick.)
First make sure that pikachu_sightings has a datetime index and is sorted.
p = pikachu_sightings.squeeze() # force into a series
p.index = pd.to_datetime(p.index)
p = p.sort_index()
Then make sure your date_start and date_end are datetime.
df.date_start = pd.to_datetime(df.date_start)
df.date_end = pd.to_datetime(df.date_end)
Then its simply
df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1)
0 6
1 12
2 20
3 54
4 22
dtype: int64

Simplest way to find the difference between two dates in pandas

I'm trying to find the difference between two dates in a multi index data frame that is the result of a pivot table operation.
The data frame contains three columns. The first is a measurement the second is the end date and the third is the start date.
I've been able to successfully add a third multi index column to the data frame but only to make the result of reach cell zero
Pt["min"]["start_date"] = 0 but when I try to subtract the two dates I get a string error and appending .Dt.Days to the end of each column results in an error as well.
What is the simplest way to find the difference in days between two dates in a multi index pandas data frame?
You can select Multiindex in columns by tuples and subtract columns:
print (df)
a
meas end start
0 7 2015-04-05 2015-04-01
1 8 2015-04-07 2015-04-02
2 9 2015-04-14 2015-04-04
#if dtypes not datetime
df[('a','end')] = pd.to_datetime(df[('a','end')])
df[('a','start')] = pd.to_datetime(df[('a','start')])
df[('a','diff')] = df[('a','end')] - df[('a','start')]
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4 days
1 8 2015-04-07 2015-04-02 5 days
2 9 2015-04-14 2015-04-04 10 days
If need output in days:
df[('a','diff')] = (df[('a','end')] - df[('a','start')]).dt.days
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4
1 8 2015-04-07 2015-04-02 5
2 9 2015-04-14 2015-04-04 10

Categories