I have these 4 functions that I use to modify a dataframe (without returning anything as per my intention).
The first 3 functions work perfectly fine. The dataframe gets modified according to the function, but the 4th (drop_na) function doesn't seem to work.
It's supposed to drop all rows with NA on the specified column name, but it doesn't work. No error is thrown out when I run the function. Any ideas why this happens and how to fix this (without return if possible).
Thanks!
def composite_key(dframe, new_key, key1, key2):
dframe[new_key] = dframe[key1]+"-"+dframe[key2].astype(str)
def drop_col(dframe, colnames):
dframe = dframe.drop_duplicates(subset=colnames, keep='first')
def split_column(dframe, arg: list):
dframe[arg[0]] = dframe[arg[1]].str.split(',', n=-1, expand=True).loc[:, :(len(arg[0])-1)]
def drop_na(dframe, colname):
dframe = dframe.loc[dframe[colname].notna()]
Usually for dropping na values for a specific column you use this subset
df.dropna(subset=['col_name'])
Related
I have a dataframe with 8 columns that i would like to run below code (i tested it works on a single column) as a function to map/apply over all 8 columns.
click here for sample of dataframe
all_adj_noun = []
for i in range(len(bigram_df)):
if len([bigram_df['adj_noun'][i]]) >= 1:
for j in range(len(bigram_df['adj_noun'][i])):
all_adj_noun.append(bigram_df['adj_noun'][i][j])
However, when i tried to define function the code returns an empty list when it is not empty.
def combine_bigrams(df_name, col_name):
all_bigrams = []
for i in range(len(df_name)):
if len([df_name[col_name][i]]) >= 1:
for j in range(len(df_name[col_name][i])):
return all_bigrams.append(df_name[col_name][i][j])
I call the function by
combine_bigrams(bigram_df, 'adj_noun')
May I know is there anything that I may be doing wrong here?
The problem is that you are returning the result of .append, which is None
However, there is a better (and faster) way to do this. To return a list with all the values present in the columns, you can leverage Series.agg:
col_name = 'adj_noun'
all_bigrams = bigram_df[col_name].agg(sum)
I try to apply the following code (minimal example) to my 2 Million rows DataFrame, but for some reason .apply returns more than one row to the function and breaks my code. I'm not sure what changed, but the code did run before.
def function(row):
return [row[clm1], row[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Did anyone get an idea or a similar issue?
Important without swifter everything works fine, but too slow due to the amount of rows.
This should work ==>
def function(row_different_name):
return [row_different_name[clm1], row_different_name[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Try changing the name of function parameter rwo to some other name.
based on this previous answer what you are trying to do should work if you change it like this:
def function(row):
return [row.swifter[clm1], row.swifter[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.apply(function, axis=1, result_type='expand')
this because apply on a column lacks result_type as arg, while apply on a dataframe has it
axis=1 means column, so it will insert it vertically. Is that what you want? Try removing axis=1
I have created a function that creates a pandas dataframe where I have created a new column that combines the first/middle/last name of an employee. I am then calling the function based on the python index(EmployeeID). I am able to run this function successfully for one employee. I am having trouble updating the function to be able to run multiple EmployeeIDs at once. Let's say I wanted to run 3 employee IDs through the function. How would I update this function to allow for that?
def getFullName(EmpID):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
In general, if you want an argument to be able to consist of one or more records, you can use a list or tuple to represent it.
In practice for this example, because python is dynamically typed and because the .loc function of the pandas Dataframes can also take a list of values as arguments, you dont have to change anything. Just pass a list of employee ids as EmpID.
Without knowing how the EmpIDs look like, it is hard to give an example.
But you can try it out, by calling your function with
getFullName(EmpID)
and with
getFullName([EmpID, EmpID])
The first call should print you the record once and the the second line should print you the record twice. You can replace EmpID with any working id (see df.index).
The documentation I liked above has some minimal examples to play around with.
PS: There is a bit of danger in passing a list to .loc. If you pass an EmpID that does not exist, pandas will currently only give a warning (in future version it will give a KeyError. For any unknown EmpID it will create a new row in the result with NaNs as values. From the documentation example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df.loc[['viper', 'sidewinder']]
Will return
max_speed shield
viper 4 5
sidewinder 7 8
Calling it with missing indices:
print(df.loc[['viper', 'does not exist']])
Will produce
max_speed shield
viper 4.0 5.0
does not exist NaN NaN
You could add in an array of EmpIDs.
empID_list = [empID01, empID02, empID03]
Then you would need to use a for loop:
for empID in empID_list:
doStuff()
Or you just use your fuction as the function in the for loop.
for empID in empID_list:
getFullName(empID)
Let's say you have this list of employee IDs:
empIDs = [empID1, empID2, empID3]
You need to then pass this list as an argument instead of a single employee ID.
def getFullName(empIDs):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
for EmpID in empIDs:
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
One way or another the if EmpID in df.index: will need to be rewritten. I suggest you pass a list called employee_ids as the input, then do the following (the first two lines are to wrap a single ID in a list, it is only needed if you still want to be able to pass a single ID):
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
In the old days, df.loc would accept missing labels and just not return anything, but in recent versions it raises an error. reindex will give you a row for every ID in employee_ids, with NaN as the value if the ID wasn't in the index. We therefore select the column EmployeeName and then drop the missing values with dropna.
Now, the only thing left to do is handle the output. The DataFrame has a (boolean) attribute called empty, which can be used to check whether any IDs were found. Otherwise we'll want to print the values of recs, which is a Series.
Thus:
def getFullName(employee_ids):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
if rec.empty:
print("UNKNOWN")
else:
print(rec.values)
(as an aside, you may like to know that a python convention is to use snake_case for function and variable names and CamelCase for class names)
I have an Excel file :
Test_Case Value
Case_1 0.988532846
Case_2 0.829241525
Case_3 0.257209267
Case_4 0.871698313
Case_5 0.63913665
with pandas I have seen that we can get a column like this :
import pandas as pd
myExcelFile = "data.xlsx"
readExcelFile = pd.read_excel(myExcelFile, sheet_name=0, index=0)
testCaseColumn = readExceFile.Test_Case
result :
0 Case_1
1 Case_2
2 Case_3
3 Case_4
4 Case_5
The name of the column can be change, and I would like to create a function with two arguments to get the column I want :
def getColumn(readExceFile, columnName):
return readExceFile.columnName
I would like to know how can I attribute the name of the column to my readExcelFile parameter ?
Thanks for your help
You can use getattr.
def getColumn(readExceFile, columnName):
return getattr(readExceFile, columnName)
Since your_dataframe.column_name works only with column names w/o space character and you've mentioned that column name can be changed, you can call for column name with your_dataframe.loc[:,'column_name'] (see Alexander Céciles comment).
On the other hand, if your dataset has always the same structure (n-columns, first one with some categorical data, second one with values, etc.) then you can call it also directly with your_dataframe.iloc[:,0], with 0 being your first column-of-interest in the set.
Finally if you really need to have a separate function (besides at least those two I've mentioned) which returns exactly the same output then you may use this:
def get_column(your_dataframe, column_name):
return your_dataframe.loc[:,column_name]
... what is highly non-pythonic way of writing the code (see Zen of Python)
I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.