I have a pandas dataframe where I want to loop through each row of duplicate patient IDs, and grab the latest CONDITION_STOP value.
Below is a screenshot of the dataframe:
Case 1:
For example, for Patient = '90008189-5c4f-f24d-ecc4-e41919d547e1' (rows 21-22), I expect to return back CONDITION_DESCRIPTION = 'COVID-19', because the CONDITION_STOP value in row 22 is greater (later) than the CONDITION_STOP value in row 21.
Case 2:
The situation is a little different where the CONDITION_STOP = 'NaT' for a Patient.
For example, for Patient = '0a47e17e-70a1-c91b-9f50-b06804878e2b' (rows 28-29), I expect to return back CONDITION_DESCRIPTION = 'COVID-19' where CONDITION_STOP = 'NaT'.
Thus far, I have tried using drop_duplicates, but this works in case 1, but not case 2.
pd_covid_conditions1 = pd_covid_conditions.sort_values(by=['CONDITION_STOP'], ascending = False)\
.drop_duplicates(subset=['PATIENT'], keep = 'first')
The result I get is this (expected):
However, when I try to run it for the 'Nat' scenario, the result is not as expected (I am expecting it to return CONDITION_DESCRIPTION = 'COVID-19'):
It doesn't appear that drop_duplicates() function will work for my scenario. Can anyone advise what I should change?
I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.
Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks
I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.
I have a dataframe with 4 columns: "Date" (in string format), "Hour" (in string format), "Energia_Attiva_Ingresso_Delta" and "Energia_Attiva_Uscita_Delta".
Obviously for every date there are multiple hours. I'd like to calculate a column for the overall dataframe, but on a daily base. Basically: the operation of the function must be calculated for every single date.
So, I thought to iter over the single values of the date column and to filter the dataframe with .loc, then pass the filtered df to the function. In the function I have to re-filter the df with loc (for the purpose of the calculation).
Here's the code I wrote and as you can see in the function i need to operate iterativelly on the row with the maximum value of 'Energia_Ingresso_Delta'; to do so I use again the .loc function:
#function
def optimize(df):
min_index = np.argmin(df.Margine)
max_index = np.argmax(df.Margine)
Energia_Prelevata_Da_Rete = df[df.Margine < 0]['Margine'].sum().round(1)
Energia_In_Eccesso = df[df.Margine > 0]['Margine'].sum().round(1)
carico_medio = (Energia_In_Eccesso / df[df['Margine']<0]['Margine'].count()).round(1)
while (Energia_In_Eccesso != 0):
max_index = np.argmax(df.Energia_Ingresso_Delta)
df.loc[max_index, 'Energia_Attiva_Ingresso_Delta'] = df.loc[max_index,'Energia_Attiva_Ingresso_Delta'] + carico_medio
Energia_In_Eccesso = (Energia_In_Eccesso - carico_medio).round(1)
#Call function with "partial dataframe". The dataframe is called "prova"
for items in list(prova.Data.unique()):
function(prova.loc[[items]])
But I keep getting this error:
"None of [Index(['2021-05-01'], dtype='object')] are in the [index]"
Can someone help me? :)
Thanks in advance
Considering a function (apply_this_function) that will be applied to a dataframe:
# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}
# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself
In df_throway.appy(.), where does the function take the "passed_row" parameter? Or what value is this function taking? My assumption is that by the structure of apply(), the function takes values from row i starting at 1?
I am referring to the information obtained here
When you apply a function to a DataFrame with axis=1, then
this function is called for each row from the source DataFrame
and by convention its parameter is called row.
In your case this function returns (from each call) the original
row (actually a Series object), with 2 new elements added.
Then apply method collects these rows, concatenates them
and the result is a DataFrame with 2 new columns.
You wrote takes values from row i starting at 1. I would change it to
takes values from each row.
Writing starting at 1 can lead to misunderstandings, since when your
DataFrame has a default index, its values start from 0 (not from 1).
In addition, I would like to propose 2 corrections to your code:
Create your DataFrame passing data (your code sample does not
contain creation of df):
df_throwaway = pd.DataFrame(data)
Define your function as:
def apply_this_function(row):
row['new_col'] = True
row['added'] = pd.Timestamp.now()
return row
i.e.:
name the parameter as just row (everybody knows that this row
has been passed by apply method),
instead of datetime.datetime.now() use pd.Timestamp.now()
i.e. a native pandasonic type and its method.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)