Multiple sets of criteria to subset python dataframe with Pandas - python

I have a small data set that I am trying to filter out to create an even smaller dataframe. The issue I'm having is I don't know how to get the sets of criteria nested inside one another to work correctly.
The code below is the closest I have been able to get. It should be looking in the larger data frame with columns for 'material', 'waterbody', and 'mosscat'; and only returning those that satisfy the combination of stone, marsh, and average OR brick, river, average.
dfy = dfx[
(dfx['material']=='stone') &
(dfx['waterbody']=='marsh') &
(dfx['mosscat']=='average'),
(dfx['material']=='brick') &
(dfx['waterbody']=='river') &
(dfx['mosscat']=='average')]

Making your code readable makes it easier to see your mistakes. I edited your question to add whitespace, now let's edit some more to fix it:
isStoneMarsh = dfx['material'] == 'stone' & dfx['waterbody'] == 'marsh'
isBrickRiver = dfx['material'] == 'brick' & dfx['waterbody'] == 'river'
isAverage = dfx['mosscat'] == 'average'
dfy = dfx[(isStoneMarsh | isBrickRiver) & isAverage]
I lifted isAverage up a level so it's only evaluated once.

Related

how to make changes to a existing column based on multiple conditions in python csv

So I am working on data processing and I want to make changes to a column called "temp_coil" based on the condition in other columns.
This is the one I made and it is giving me error.
df.temp_coil[df.outdoor_temperature < 20 & df.cooling_state == 1 & df.temp_coil == 0] = 4500
I want the answer to be 4500 if outdoor_temperature column is less than 20, cooling_state column is 1 and temp_coil column itself is 0. All the conditions need to be met.
The error it is giving me is type and value error when I run it.
If this is already answered please let me know, I couldn't find an example that matched my problem.
You must use parentheses so that the truth table is unambiguous. For your example, the correct code would be:
df.temp_coil[(df.outdoor_temperature < 20) & (df.cooling_state == 1) & (df.temp_coil == 0)] = 4500
Or:
df.temp_coil[df.outdoor_temperature.lt(20) & df.cooling_state.eq(1) & df.temp_coil.eq(0)] = 4500

Define new column based on matching values between multiple columns in two dataframes

I'm currently trying to define a class label for a dataset I'm building. I have two different datasets that I need to consult, with df_port_call being the one that will ultimately contain the class label.
The conditions in the if statements need to be satisfied for the row to receive a class label of 1. Basically, if a row exists in df_deficiency that matches the if statement conditions listed below, the Class column in df_port_call should get a label of 1. But I'm not sure how to vectorize this and the loop is running very slowly (will take about 8 days to terminate). Any assistance here would be great!
df_port_call["Class"] = 0
for index, row in tqdm(df_port_call.iterrows()):
for index_def, row_def in df_deficiency.iterrows():
if row['MMSI'] == row_def['Primary VIN'] or row['IMO'] == row_def['Primary VIN'] or row['SHIP NAME'] == row_def['Vessel Name']:
if row_def['Inspection Date'] >= row['ARRIVAL IN USA (UTC)'] and row_def['Inspection Date'] <= row['DEPARTURE (UTC)']:
row['Class'] = 1
Without input data and expected outcome, it's difficult to answer. However you can use something like this with np.where:
df_port_call['Class'] = \
np.where(df_port_call['MMSI'].eq(df_deficiency['Primary VIN'])
| df_port_call['IMO'].eq(df_deficiency['Primary VIN'])
| df_port_call['SHIP NAME'].eq(df_deficiency['Vessel Name'])
& df_deficiency['Inspection Date'].between(df_port_call['ARRIVAL IN USA (UTC)'],
df_port_call['DEPARTURE (UTC)']),
1, 0)
Adapt to your code but I think this is the right way.

Pandas filter based on aggregate values

I am using the data found here: Kaggle NFL Data. I am attempting to filter the data based on the number of pass attempts per player.
Reading in all data to variable all_nfl_data. I then would like to do this:
all_pass_plays = all_nfl_data[all_nfl_data.PlayType == 'Pass']
passers_under_100 = all_pass_plays.groupby('Passer').transform('size') <= 100
I cannot figure out how to correctly filter based on the above logic. I am trying to filter for players which have less than 100 pass attempts in total. The goal is to filter the full dataframe based on this number, not just return the player names themselves. Appreciate the help :)
You can do with isin (PS: trying to fix your code)
all_pass_plays = all_nfl_data[all_nfl_data.PlayType == 'Pass']
passers_under_100 = all_pass_plays.groupby('Passer').size()<= 100
afterfilterdf=all_nfl_data[all_nfl_data['Passer'].isin(passers_under_100[passers_under_100].index)]
Alternative solution in one line
passers_under_100 = all_pass_plays.groupby('Passer').filter(lambda x : x['Passer'].size <= 100)
Corresponding documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html

Pandas create column from another dataframe if certain conditions match

Hey all I am trying to create a new column in a dataframe based on if certain conditions are meet. The end goal is go have all rows that condition is unoccupied in a column as long as the building, floor, and location matches. And time is greater then the occupied time.
Sample CSV File
I tried looking at this beforehand but I don't believe that it fits what I am trying to do. Other Stack Overflow Post
Would love to get pointed into the right direction for this.
current code that I am playing around with: (Also attempted with a loop but I no longer have the code to post it below)
[from IPython.display import display
df = pd.read_csv("/Users/username/Desktop/test.csv")
df2 = pd.DataFrame()
df2['Location'] = df.Location
df2['Type'] = df.Type
df2['Floor'] = df.Floor
df2['Building'] = df.Building
df2['Time'] = df['Date/Time']
df2['Status'] = df['Status']
df2 = df[~df['Condition'].isin(['Unoccupied'])]
df2['Went Unoccupied'] = np.where((df2['Location']==df['Location'])&(df2['Time'] < df['Date/Time']))
The OP tried to add the unoccupied time for each row that has Condition == occupied. It seems the data is well sorted and alternates between occupied and unoccupied. Thus, we shift the dataset backward and create a new column time_of_next_row. Then, query for the condition that df1.Condition == "Occupied".
df["time_of_next_row"] = df.shift(-1)["Date/Time"]
df_occ = df1[df1.Condition == "Occupied"]

Python pandas loop efficient through two dataframes with different lengths

I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]

Categories