Beginner Python: how to update function to run multiple arguments through it - python

I have created a function that creates a pandas dataframe where I have created a new column that combines the first/middle/last name of an employee. I am then calling the function based on the python index(EmployeeID). I am able to run this function successfully for one employee. I am having trouble updating the function to be able to run multiple EmployeeIDs at once. Let's say I wanted to run 3 employee IDs through the function. How would I update this function to allow for that?
def getFullName(EmpID):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")

In general, if you want an argument to be able to consist of one or more records, you can use a list or tuple to represent it.
In practice for this example, because python is dynamically typed and because the .loc function of the pandas Dataframes can also take a list of values as arguments, you dont have to change anything. Just pass a list of employee ids as EmpID.
Without knowing how the EmpIDs look like, it is hard to give an example.
But you can try it out, by calling your function with
getFullName(EmpID)
and with
getFullName([EmpID, EmpID])
The first call should print you the record once and the the second line should print you the record twice. You can replace EmpID with any working id (see df.index).
The documentation I liked above has some minimal examples to play around with.
PS: There is a bit of danger in passing a list to .loc. If you pass an EmpID that does not exist, pandas will currently only give a warning (in future version it will give a KeyError. For any unknown EmpID it will create a new row in the result with NaNs as values. From the documentation example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df.loc[['viper', 'sidewinder']]
Will return
max_speed shield
viper 4 5
sidewinder 7 8
Calling it with missing indices:
print(df.loc[['viper', 'does not exist']])
Will produce
max_speed shield
viper 4.0 5.0
does not exist NaN NaN

You could add in an array of EmpIDs.
empID_list = [empID01, empID02, empID03]
Then you would need to use a for loop:
for empID in empID_list:
doStuff()
Or you just use your fuction as the function in the for loop.
for empID in empID_list:
getFullName(empID)

Let's say you have this list of employee IDs:
empIDs = [empID1, empID2, empID3]
You need to then pass this list as an argument instead of a single employee ID.
def getFullName(empIDs):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
for EmpID in empIDs:
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")

One way or another the if EmpID in df.index: will need to be rewritten. I suggest you pass a list called employee_ids as the input, then do the following (the first two lines are to wrap a single ID in a list, it is only needed if you still want to be able to pass a single ID):
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
In the old days, df.loc would accept missing labels and just not return anything, but in recent versions it raises an error. reindex will give you a row for every ID in employee_ids, with NaN as the value if the ID wasn't in the index. We therefore select the column EmployeeName and then drop the missing values with dropna.
Now, the only thing left to do is handle the output. The DataFrame has a (boolean) attribute called empty, which can be used to check whether any IDs were found. Otherwise we'll want to print the values of recs, which is a Series.
Thus:
def getFullName(employee_ids):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
if rec.empty:
print("UNKNOWN")
else:
print(rec.values)
(as an aside, you may like to know that a python convention is to use snake_case for function and variable names and CamelCase for class names)

Related

Pandas Apply function referencing column name

I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.
Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks
I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Why do I get a series inside an apply/assign function in pandas. Want to use each value to look up a dict

I have a dict of countries and population:
population_dict = {"Germany": 1111, .... }
In my df (sort_countries) I have a column called 'country' and I want to add another column called 'population' from the dictionary above (matching 'country' with 'population'):
population_df = sort_countries.assign(
population=lambda x: population_dict[x["country"]], axis = 1)
population_df.head()
which gives the error: TypeError: 'Series' objects are mutable, thus they cannot be hashed.
Why is x["country"] a Series when I would imagine it should return just the name of the country.
This bit of pandas always confuses me. In my lambdas I would expect x to be a row and I just select the country from that row. Instead len(x["country"]) gives me 192 (the number of my countries, the whole Series).
How else can I match them using lambdas and not a separate function?
Note that x["country"] is a Series, albeit a single element one, this cannot be used to index the dictionary. If you want just the value associated with it, use x["country"].item().
However, a better approach tailor made for this kind of thing is using df.map:
population_df["population"] = population_df["country"].map(population_dict)
map will automatically map keys taken from population_df["country"] and map them to their appropriate values in population_dict.
Also:
population_df["population"] = population_df.apply(lambda x: population_dict[x["country"]], axis=1)
works.
Or:
population_df["population"] = population_df[["country"]].applymap(lambda x: population_dict[x])

Categories