How to search for specific text within a Pandas dataframe column?

How to search for specific text within a Pandas dataframe column? - python

I am wanting to identify all instances within my Pandas csv file that contains text for a specific column, in this case the 'Notes' column, where there are any instances the word 'excercise' is mentioned. Once the rows are identified that contain the 'excercise' keyword in the 'Notes' columnn, I want to create a new column called 'ExcerciseDay' that then has a 1 if the 'excercise' condition was met or a 0 if it was not. I am having trouble because the text can contain long string values in the 'Notes' column (i.e. 'Excercise, Morning Workout,Alcohol Consumed, Coffee Consumed') and I still want it to identify 'excercise' even if it is within a longer string.
I tried the function below in order to identify all text that contains the word 'exercise' in the 'Notes' column. No rows are selected when I use this function and I know it is likely because of the * operator but I want to show the logic. There is probably a much more efficient way to do this but I am still relatively new to programming and python.
def IdentifyExercise(row):
if row['Notes'] == '*exercise*':
return 1
elif row['Notes'] != '*exercise*':
return 0
JoinedTables['ExerciseDay'] = JoinedTables.apply(lambda row : IdentifyExercise(row), axis=1)

Convert boolean Series created by str.contains to int by astype:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int)
For not case sensitive:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False)
.astype(int)

You can also use np.where:
JoinedTables['ExerciseDay'] = \
np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)

Another way would be:
JoinedTables['ExerciseDay'] =[1 if "exercise" in x else 0 for x in JoinedTables['Notes']]
(Probably not the fastest solution)

Related

Conditional Strip / Replace based on length of string

I need to remove the space from a Dataframe of UK postcodes, but only those that contain seven characters.
Client Postcode lat long
4 CF1 1DA 51.479690 -3.182190
42640 CF951AF 51.481196 -3.171039
Is it possible to add a len() element to:
df['Client Postcode'] = df1['Client Postcode'].str.replace(" ","")

Here are two ways to conditionally change or create a new column:
First, numpy.where -
this function lets you return value x or y depending on a condition. In your case, return either the original postcode or the postcode without " " depending on the number of characters.
condition = df1['Client Postcode'].str.len()==7
df1['Client Postcode Clean'] = np.where(condition, df1['Client Postcode'].str.replace("", ""), df1['Client Postcode'])
You can use this method to either create a new column (like I did above) or change the original column.
Another way would be to use pandas slicing. You can use the loc accessor to find the rows you want to change and overwrite them.
condition = df1['Client Postcode'].str.len()==7
df1.loc[condition, 'Client Postcode'] = df1.loc[condition, 'postcode'].str.replace(" ","")
This method is harder to use to create a new column as it will return NaNs for rows that do not satisfy the condition.

Just to offer up one more alternative, one could iterate through the dataframe and scrub the post code as in the following code snippet.
import pandas as pd
df = pd.DataFrame([['CF1 1DA', 51.479690, -3.182190], ['CF951AF', 51.481196, -3.171039]], columns=['Client Postcode', 'Lat.', 'Long.'])
for i in range(len(df.index) - 1):
if (len(df['Client Postcode'][i]) == 7):
df['Client Postcode'] = df['Client Postcode'].str.replace(" ","")
print(df)
Hope that helps.
Regards.

Pandas Apply function referencing column name

I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.

Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks

I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.

Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]

There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)

Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()

So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).

A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)

This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.

Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

Add new column to Pandas dataframe using conditional values from another column

I would like to add a new column retailer_relationship, to my dataframe.
I would like each row value of this new column to be 'TRUE' if the retailer column value starts with any items within the list retailer_relationship, and 'FALSE' otherwise.
What I've tried:
list_of_relationships = ("retailer1","retailer2","retailer3")
for i in df.index:
for relationship in list_of_relationships:
if df.iloc[i]['retailer'].str.startswith(relationship):
df.at[i, 'retailer_relationship'] = "TRUE"
else:
df.at[i, 'retailer_relationship'] = "FALSE"

You can use a regular expression combining the ^ special character, which specifies the beginning of the string, with another regex matching every element of retailer_relationship, since startswith does not accept regexes:
import re
regex = re.compile('^' + '|'.join(list_of_relationships))
df['retailer_relationship'] = df['retailer'].str.contains(regex).map({True: 'TRUE', False: 'FALSE'})
Since you want the literal strings 'TRUE' and 'FALSE', we can then use map to convert the booleans to strings.
An alternative method that is slightly faster, though I don't think that'll matter:
df['retailer_relationship'] = df['retailer'].str.contains(regex).transform(str).str.upper()

See if this works for you. It would help to share a sample of your df or a dummy data representing it.
df.loc['retailer_relationship'] = False
df.loc[df['retailer'].isin(retailer_relationship),'retailer_relationship'] = True

You still can using startswith in pandas
df['retailer_relationship'] = df['retailer'].str.startswith(tuple(retailer_relationship))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to search for specific text within a Pandas dataframe column? - python

Convert boolean Series created by str.contains to int by astype: JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int) For not case sensitive: JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False) .astype(int)

You can also use np.where: JoinedTables['ExerciseDay'] = \ np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)

Another way would be: JoinedTables['ExerciseDay'] =[1 if "exercise" in x else 0 for x in JoinedTables['Notes']] (Probably not the fastest solution)

Related

Conditional Strip / Replace based on length of string

Pandas Apply function referencing column name

How do I pull the index(es) and column(s) of a specific value from a dataframe?

Select all rows in Python pandas

Add new column to Pandas dataframe using conditional values from another column

Categories

Resources