Pandas Apply function referencing column name - python

I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.

Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks

I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.

Related

Conditional Strip / Replace based on length of string

I need to remove the space from a Dataframe of UK postcodes, but only those that contain seven characters.
Client Postcode lat long
4 CF1 1DA 51.479690 -3.182190
42640 CF951AF 51.481196 -3.171039
Is it possible to add a len() element to:
df['Client Postcode'] = df1['Client Postcode'].str.replace(" ","")
Here are two ways to conditionally change or create a new column:
First, numpy.where -
this function lets you return value x or y depending on a condition. In your case, return either the original postcode or the postcode without " " depending on the number of characters.
condition = df1['Client Postcode'].str.len()==7
df1['Client Postcode Clean'] = np.where(condition, df1['Client Postcode'].str.replace("", ""), df1['Client Postcode'])
You can use this method to either create a new column (like I did above) or change the original column.
Another way would be to use pandas slicing. You can use the loc accessor to find the rows you want to change and overwrite them.
condition = df1['Client Postcode'].str.len()==7
df1.loc[condition, 'Client Postcode'] = df1.loc[condition, 'postcode'].str.replace(" ","")
This method is harder to use to create a new column as it will return NaNs for rows that do not satisfy the condition.
Just to offer up one more alternative, one could iterate through the dataframe and scrub the post code as in the following code snippet.
import pandas as pd
df = pd.DataFrame([['CF1 1DA', 51.479690, -3.182190], ['CF951AF', 51.481196, -3.171039]], columns=['Client Postcode', 'Lat.', 'Long.'])
for i in range(len(df.index) - 1):
if (len(df['Client Postcode'][i]) == 7):
df['Client Postcode'] = df['Client Postcode'].str.replace(" ","")
print(df)
Hope that helps.
Regards.

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

How to search for specific text within a Pandas dataframe column?

I am wanting to identify all instances within my Pandas csv file that contains text for a specific column, in this case the 'Notes' column, where there are any instances the word 'excercise' is mentioned. Once the rows are identified that contain the 'excercise' keyword in the 'Notes' columnn, I want to create a new column called 'ExcerciseDay' that then has a 1 if the 'excercise' condition was met or a 0 if it was not. I am having trouble because the text can contain long string values in the 'Notes' column (i.e. 'Excercise, Morning Workout,Alcohol Consumed, Coffee Consumed') and I still want it to identify 'excercise' even if it is within a longer string.
I tried the function below in order to identify all text that contains the word 'exercise' in the 'Notes' column. No rows are selected when I use this function and I know it is likely because of the * operator but I want to show the logic. There is probably a much more efficient way to do this but I am still relatively new to programming and python.
def IdentifyExercise(row):
if row['Notes'] == '*exercise*':
return 1
elif row['Notes'] != '*exercise*':
return 0
JoinedTables['ExerciseDay'] = JoinedTables.apply(lambda row : IdentifyExercise(row), axis=1)
Convert boolean Series created by str.contains to int by astype:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int)
For not case sensitive:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False)
.astype(int)
You can also use np.where:
JoinedTables['ExerciseDay'] = \
np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)
Another way would be:
JoinedTables['ExerciseDay'] =[1 if "exercise" in x else 0 for x in JoinedTables['Notes']]
(Probably not the fastest solution)

For Looping error in pyspark

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?
Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))
You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Categories