Pandas create column from another dataframe if certain conditions match - python

Hey all I am trying to create a new column in a dataframe based on if certain conditions are meet. The end goal is go have all rows that condition is unoccupied in a column as long as the building, floor, and location matches. And time is greater then the occupied time.
Sample CSV File
I tried looking at this beforehand but I don't believe that it fits what I am trying to do. Other Stack Overflow Post
Would love to get pointed into the right direction for this.
current code that I am playing around with: (Also attempted with a loop but I no longer have the code to post it below)
[from IPython.display import display
df = pd.read_csv("/Users/username/Desktop/test.csv")
df2 = pd.DataFrame()
df2['Location'] = df.Location
df2['Type'] = df.Type
df2['Floor'] = df.Floor
df2['Building'] = df.Building
df2['Time'] = df['Date/Time']
df2['Status'] = df['Status']
df2 = df[~df['Condition'].isin(['Unoccupied'])]
df2['Went Unoccupied'] = np.where((df2['Location']==df['Location'])&(df2['Time'] < df['Date/Time']))

The OP tried to add the unoccupied time for each row that has Condition == occupied. It seems the data is well sorted and alternates between occupied and unoccupied. Thus, we shift the dataset backward and create a new column time_of_next_row. Then, query for the condition that df1.Condition == "Occupied".
df["time_of_next_row"] = df.shift(-1)["Date/Time"]
df_occ = df1[df1.Condition == "Occupied"]

Related

Populating new Dataframe in Python taking too long, need to remove explicit recursive loops to improve performance

I am building a Python code to analyze the growth of COVID-19 across different nations, I am using the OWID database to get the latest values each time the code is run:
data = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
data.to_csv('covid.csv')
data
OWID not just provides the CSV file but also XLSX and JSON formats, JSON has a 3D structure even, might that help with the efficiency?
I am trying to create a new Dataframe with the country name as the column headings and date range containing all the listed dates as the index:
data['date'] = pd.to_datetime(data['date'])
buffer = 30
cases = pd.DataFrame(columns=data['location'].drop_duplicates(),
index=pd.date_range(start= data['date'].min() - datetime.timedelta(buffer), end=data['date'].max()))
deaths = pd.DataFrame(columns=data['location'].drop_duplicates(),
index=pd.date_range(start= data['date'].min() - datetime.timedelta(buffer), end=data['date'].max()))
I am doing differentials on the values so I need to make sure each consecutive element is at equal time-steps (1 day).
The database does not have all the dates within the daterage for most countries, many of them even have data missing for dates in the middle of the range. All I could think of was using recursive loops to populate the new dataframe:
location = data['location'].drop_duplicates()
date_range = pd.date_range(data['date'].min(), data['date'].max())
for l, t in itertools.product(location, date_range):
c = data.loc[(data['location'] == l) & (data['date'] == t), 'total_cases']
d = data.loc[(data['location'] == l) & (data['date'] == t), 'total_deaths']
if c.size != 0:
cases[l][t] = c.iloc[0]
if d.size != 0:
deaths[l][t] = d.iloc[0]
This gets the job done but it takes more than 20 Min to complete on my fairly good PC. I know there is some way to do this without using explicit loops but I am new to python.
Here is the faster implementation.
The key functions are pivot and reindex.
You can use interpolate function for smarter filling of NaN values.
import pandas as pd
filename = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
df = pd.read_csv(
filename,
# parse dates while reading
parse_dates=["date"],
# select subset of columns
usecols=["date", "location", "total_cases", "total_deaths"],
)
locations = df["location"].unique()
date_range = pd.date_range(df["date"].min(), df["date"].max())
# select needed columns and reshape
cases = (
# turn location into columns and date into index
pd.pivot(df, index="date", columns="location", values="total_cases")
# fill missing dates
.reindex(date_range)
# fill missing locations
.reindex(columns=locations)
# fill NaN in total_cases using the previous value
# another good option is .interpolate("time")
.fillna(method="ffill")
# sort columns
.sort_index(axis=1)
)
# use the same logic for `total_deaths`
...

Is there a way to loop through pandas dataframe and drop windows of rows dependent on condition?

Problem Summary - I have a dataframe of ~10,000 rows. Some rows contain data aberrations that I would like to get rid of, and those aberrations are tied to observations made at certain temperatures (one of the data columns).
What I've tried - My thought is that the easiest way to remove the rows of bad data is to loop through the temperature intervals, find the maximum index that is less than each of the temperature interval observations, and use the df.drop function to get rid of a window of rows around that index. Between every temperature interval at which bad data is observed, I reset the index of the dataframe. However, it seems to be completely unstable!! Sometimes it nearly works, other times it throws key errors. I think my problem may be in working with the data frame "in place," but I don't see another way to do it.
Example Code:
Here is an example with a synthetic dataframe and a function that uses the same principles that I've tried. Note that I've tried different renditions with .loc and .iloc (commented out below).
#Create synthetic dataframe
import pandas as pd
import numpy as np
temp_series = pd.Series(range(25, 126, 1))
temp_noise = np.random.rand(len(temp_series))*3
df = pd.DataFrame({'temp':(temp_series+temp_noise), 'data':(np.random.rand(len(temp_series)))*400})
#calculate length of original and copy original because function works in place.
before_length = len(df)
df_dup = df
temp_intervals = [50, 70, 92.7]
window = 5
From here, run a function based on the dataframe (df), the temperature observations (temp_intervals) and the window size (window):
def remove_window(df, intervals, window):
'''Loop through the temperature intervals to define a window of indices around given temperatures in the dataframe to drop. Drop the window of indices in place and reset the index prior to moving to the next interval.
'''
def remove_window(df, intervals, window):
for temp in intervals[0:len(intervals)]:
#Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
#Define window of indices to remove from the df
drop_indices = list(range(cent_index-window, cent_index+window))
#Use df.drop
df.drop(drop_indices, inplace=True)
df.reset_index(drop=True)
return df
So, is this a problem with he funtcion I've defined or is there a problem with df.drop?
Thank you,
Brad
It can be tricky to repeatedly delete parts of the dataframe and keep track of what you're doing. A cleaner approach is to keep track of which rows you want to delete within the loop, but only delete them outside of the loop, all at once. This should also be faster.
def remove_window(df, intervals, window):
# Create a Boolean array indicating which rows to keep
keep_row = np.repeat(True, len(df))
for temp in intervals[0:len(intervals)]:
# Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
# Define window of indices to remove from the df
keep_row[range(cent_index - window, cent_index + window)] = False
# Delete all unwanted rows at once, outside the loop
df = df[keep_row]
df.reset_index(drop=True, inplace=True)
return df

PYTHON: Filtering a dataset and truncating a date

I am fairly new to python, so any help would be greatly appreciated. I have a dataset that I need to filter down to specific events. For example, I have a column with dates and I need to know what dates are in the current month and have happened within the past week. The column is called POS_START_DATE with dates formatted like 2019-01-27T00:00:00-0500. I need to truncate that date and compare it to the previous week. No luck so far.
Here is my code so far:
## import data package
import datetime
## assign date variables
today = datetime.date.today()
six_day = datetime.timedelta(days = 6)
## Create week parameter
week = today + six_day
## Statement to extract recent job movements
if fields.POS_START_DATE < week and fields.POS_START_DATE > today:
out1 += in1
Here is sample of the table:
Sample Table
I am looking for the same table filtered down to only rows that happened within one week. The bottom of the sample table(not shown) will have dates in this month. I'd like the final output to only show those rows, and any other rows in the current month of November.
I am not too sure to understand what is your expected output, but this will help you create an extra column which will be used as flag for those cases that fulfill with the condition you state in your if-statement:
import numpy as np
fields['flag_1'] = np.where(((fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)),1,0)
This will generate an extra column in your dataframe with a 1 for the cases that meet the criteria you stated. Finally you can perform this calculation to get the total of cases that actually met the criteria:
total_cases = fields['flag_1'].sum()
Edit:
If you need to filter the data with only the cases that meet the criteria you can either use pandas filtering with the original if-statement (without creating the extra flag field) like this:
df_filtered = fields[(fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)]
Or, if you created the flag, then much simpler:
df_filtered = fields[fields['flag'] == 1]
Both should work to generate a new dataframe, with only the cases that match your criteria.

beginner panda change row data based upon code

I'm a beginner in panda and python, trying to learn it.
I would like to iterate over panda rows, to apply simple coded logic.
Instead of fancy mapping functions, I just want simple coded logic.
So then I can easily adapt it later for other coded logic rules as well.
In my dataframe dc,
I like to check if column AgeUnkown == 1 (or >0 )
And if so it should move the value of column Age to AgeUnknown.
And then make Age equal to 0.0
I tried various combinations of my below code but it won't work.
# using a row reference #########
for index, row in dc.iterrows():
r = row['AgeUnknown']
if (r>0):
w = dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Another attempt
for index in dc.index:
r = dc.at[index,'AgeUnknown'].[0] # also tried .sum here
if (r>0):
w= dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Also tried
if(dc[index,'Age']>0 #wasnt allowed either
Why isn't this working as far as I understood a dataframe should be able to be addressed like above.
I realize you requested a solution involving iterating the df, but I thought I'd provide one that I think is more traditional.
A non-iterating solution to your problem is something like this- 1) get all the indexes that meet your criteria 2) set those indexes of the df to what you want.
# indexes where column AgeUnknown is >0
inds = dc[dc['AgeUnknown'] > 0].index.tolist()
# change the indexes of AgeUnknown to to the Age column
dc.loc[inds, 'AgeUnknown'] = dc.loc[inds, 'Age']
# change the Age to 0 at those indexes
dc.loc[inds, 'Age'] = 0

What is the most efficient way to count the number of instances occurring within a time frame in python?

I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:
numSurgeries = [0 for shift in range(len(df.Date))]
for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
for shift in range(len(df.DateTime)):
if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
numSurgeries[shift] += 1
So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.
EDIT:
Example of OR data file
Example of df data file
Without example data, it's not absolutely clear what you want. But this should help you vectorise:
numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}
The output is a dictionary mapping integer shift to numSurgeries.
As mentioned above, it is hard to answer without example data.
However, a boolean mask sounds fitting. See Select dataframe rows between two dates.
Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
Locate all values in df that fit this range.
df_shift = df.loc[date_mask]
Count the number of instances in the new df_shift.
num_surgeries = len(df_shift.index())
Cycle through all shifts.
def count_shifts(df, shift, results_df, start_shift, end_shift):
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
df_shift = df.loc[date_mask]
num_surgeries = len(df_shift.index())
return(num_surgeries)
# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)
Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.

Categories