Keep part of string based on certain characters in a DataFrame column - python

I know there have been a lot of questions around this topic but I didn't find any that described my problem. I have a df, with a specific column that looks like this:
colA
['drinks/coke/diet', 'food/spaghetti']
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza']
['drinks/coke/diet', 'drinks/coke']
...
The values of colA are a string NOT a list. What I want to achieve is a new column, where I only keep part of the values that contain 'coke'. Coke can be repeated any number of times in the string, and be in any place. The values between '' don't always contain en equal number of values seperated by /.
So the result should look like this:
colA colB
['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza'] 'drinks/coke'
['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'
...
I've tried calling a function:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA) if 'coke' in s], meta=str)
But this one keeps throwing errors that I don't know how to solve.

You could split on comma and explode to create a Series. Then use str.contains to create a boolean mask that you could use to filter the items that contain the word "coke". Finally join the strings back across indices:
s = df['colA'].str.split(',').explode()
df['colB'] = s[s.str.contains('coke')].groupby(level=0).apply(','.join).str.strip('[]')
Output:
colA colB
0 ['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
1 ['drinks/water', 'drinks/tea', 'drinks/coke', ... 'drinks/coke'
2 ['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'

Try splitting the string into a list and then making the check for coke in the list, something like this:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA.split("/")) if 'coke' in s], meta=str)

Related

Pandas; Trying to split a string in a column with | , and then list all strings, removing all duplicates

I'm working on a data frame for a made up TV show. In this dataframe, are columns: "Season","EpisodeTitle","About","Ratings","Votes","Viewership","Duration","Date","GuestStars",Director","Writers", With rows listed as ascending numerical values.
In this data frame, my problem relates to two columns; 'Writers' and 'Viewership'. In the Writers column, some of the columns have multiple writers, separated with " | ". In the Viewership column, each column has a float value between 1 and 23, with a max of 2 decimal places.
Here's a condensed example of the data frame I'm working with. I am trying to filter the "Writers" column, and then determine the total average viewership for each individual writer:
df = pd.DataFrame({'Writers' : ['John Doe','Jennifer Hopkins | John Doe','Ginny Alvera','Binny Glasglow | Jennifer Hopkins','Jennifer Hopkins','Sam Write','Lawrence Fieldings | Ginny Alvera | John Doe','John Doe'], 'Viewership' : '3.4','5.26','22.82','13.5','4.45','7.44','9'})
The solution I came up with to split the column strings:
df["Writers"]= df["Writers"].str.split('|', expand=False)
This does split the string, but in some cases will leave whitespace before and after commas. I need the whitespace removed, and then I need to list all writers, but only list each writer once.
Second, for each individual writer, I would like to have columns stating their total average viewership, or a list of each writer, stating what their total average viewership was for all episodes they worked on:
["John Doe : 15" , "Jennifer Hopkins : 7.54" , "Lawrence Fieldings : 3.7"]
This is my first post here, I really appreciate any help!
# I believe in newer versions of pandas you can split cells to multiple rows like this
# here is a reference https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows
df2 =df.assign(Writers=df.Writers.str.split('|')).explode('Writers').reset_index(drop=True)
#to remove whitespaces just use this
#this will remove white spaces at the beginning and end of every cell in that column
df2['Writers'] = df2['Writers'].str.strip()
#if you want to remove duplicates, then do a groupby
# this will combine (sum) duplicate, you can use any other mathematical aggregation
# function as well (you can replace sum() by mean())
df2.groupby(['writers']).sum()

Remove words in each row in a column of dataframe from another list of words in a column of another dataframe

I want to subtract or remove the words in one dataframe from another dataframe in each row.
This is the main table/columns of a pyspark dataframe.
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
Below is another pyspark dataframe. The words in this dataframe needs to be removed from the above main table in column cust_text wherever the words occur in each row. For example, 'want' will be removed from every row wherever it shows up in 1st dataframe.
+-------+
|column1|
+-------+
| want|
|because|
| need|
| hello|
| a|
| have|
| go|
+-------+
This can be done in pyspark or pandas. I have tried googling the solution using Python, Pyspark, pandas, but still not able to remove the words from the main table based on a single column table.
The result should look like this:
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
If you want to remove just the word in the corresponding line of df2, you could do that as follows, but it will probably be slow for large data sets, because it only can partially can use fast C implementations:
# define your helper function to remove the string
def remove_string(ser_row):
return ser_row['cust_text'].replace(ser_row['remove'], '')
# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)
The result looks like:
Out[145]:
0 hi fine i to go
1 i need lines hold
2 i have the 60 packs
3 can you teach
dtype: object
If however, you want to remove all words in your df2 column from every column, you need to do it differntly. Unfortunately str.replace does not help here with regular strings, unless you want to call it for every line in your second dataframe.
So if your second dataframe is not too large, you can create a regular expression to make use of str.replace.
import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')
The output is:
Out[184]:
0 hi fine i to
1 i lines hold
2 i the 60 packs
3 can you teach
Name: cust_text, dtype: object
If you don't like the repeated spaces, that remain, you can just perform something like:
df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')
Addition: what, if not only the text without the words is relevant, but the words themselves as well. How can we get the words, which were replaced. Here is one attempt, which would work, if one character can be identified, which will not appear in the text. Let's assume this character is a #, then you could do (on the original column value without replacement):
# enclose each keywords in #
ser_matched= df1['cust_text'].replace({replace: r'#\1#'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a #
ser_matched= ser_matched.replace({r'^(.*)#.*$': r'\1', '^#': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'#+')
This solution would be specific to pandas. If I understand your challenge correctly, you want to remove all words from column cust_text that occur in column1 of the second DataFrame. Lets give the corresponding DataFrames the names: df1 and df2. This is how you would do this:
for i in range(len(df1)):
sentence = df1.loc[i, "cust_text"]
for j in range(len(df2)):
delete_word = df2.loc[j, "column1"]
if delete_word in sentence:
sentence = sentence.replace(delete_word, "")
df1.loc[i, "cust_text"] = sentence
I have assigned variables to certain data points in these dataframes (sentence and delete_word), but that is just for the sake of understanding. You can easily condense this code to a few lines shorter by not doing that.

Python Pandas regex outputting NaN

I have a pandas dataframe column with characters like this (supposed to be a dictionary but became strings after scraping into a CSV):
{"id":307,"name":"Drinks","slug":"food/drinks"...`
I'm trying to extract the values for "name", so in this case it would be "Drinks".
The code I have right now (shown below) keeps outputting NaN for the entire dataframe.
df['extracted_category'] = df.category.str.extract('("name":*(?="slug"))')
What's wrong with my regex? Thanks!
Better to convert it into dataframe you can use eval and pd.Series for that like
# sample dataframe
df
category
0 {"id":307,"name":"Drinks","slug":"food/drinks"}
df.category.apply(lambda x : pd.Series(eval(x)))
id name slug
0 307 Drinks food/drinks
Or convert only string to dictionary using eval
df['category'] = df.category.apply(eval)
df.category.str["name"]
0 Drinks
Name: category, dtype: object
Hi #Ellie check also this approach:
x = {"id":307,"name":"Drinks","slug":"food/drinks"}
result = [(key, value) for key, value in x.items() if key.startswith("name")]
print(result)
[('name', 'Drinks')]
So, firstly the outer-most parenthesis in ("name":*(?="slug")) need to go because these represent the first group and the extracted value would then be equal to the first group which is not where the value of 'name' lies.
A simpler regex to try would be "name":"(\w*)" (Note: make sure to keep the part of the regex that you want to be extracted inside the parenthesis). This regex looks for the following string:
"name":"
and extracts all the alphabets that follow it (\w*) before stopping at another double quotation mark.
You can test your regex at: https://regex101.com/

Pattern Match in List of Strings, Create New Column in pandas

I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2

How to split a column with string values like '[title:item]['title2:item]'...etc into a dictionary with pandas

I am trying to clean some data in a dataframe. In particular a column that displays like this:
0 [Bean status:Whole][Type of Roast:Medium][Coff...
1 [Type of Roast:Espresso][Coffee Type:Blend]
2 [Bean status:Whole][Type of Roast:Dark][Coffee...
3 [Bean status:Whole][Type of Roast:Light][Coffe...
4 NaN
5 [Roaster:Little City][Type of Roast:Light][Cof...
Name: options, dtype: object
My goal is to split this into four columns and assign the corresponding value to the columns to look something like this:
Roaster Bean Status Type of Roast Coffee Type
0 NaN Whole Medium Blend
1 NaN NaN Espresso Blend
..
5 Littl... Whole Light Single Origin
I've tried df.str.split('[', expand=True) but it is not suitable because the options are not always present or in the same position.
My thoughts were to try to split the strings into a dictionary and store that dictionary in a new dataframe, then join the two dataframes together. However, I'm getting lost trying to store the column into a dictionary. I tried doing this: https://www.fir3net.com/Programming/Python/python-split-a-string-into-a-dictionary.html like so:
roasts = {}
roasts = dict(x.split(':') for x in df['options'][0].split('[]'))
print(roasts)
and I get this error:
ValueError: dictionary update sequence element #0 has length 4; 2 is required
I tried investigating what was going on here by storing to a list instead:
s = ([x.split(':') for x in df['options'][0].split('[]')])
print(s)
[['[Bean status', 'Whole][Type of Roast', 'Medium][Coffee Type', 'Blend]']]
So I see the code is not splitting the string up how I would like, and have played around substituting a single bracket into those various locations without proper results.
Is it possible to get this column into a dictionary or will I have to resort to regex?
Using AmiTavory's sample data
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
Combination of re.findall and str.split
import re
import pandas as pd
pd.DataFrame([
dict(
x.split(':')
for x in re.findall('\[(.*?)\]', v)
)
for v in df.options
])
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso
You might use
df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Example
df = pd.DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
>>> df.options.apply(
lambda s: pd.Series({e.split(':')[0]: e.split(':')[1] for e in s[1: -1].split('][')}))
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso
Explanation
Say you start with a string like
s = '[Bean status:Whole][Type of Roast:Medium]'
Then
s[1: -1]
removes the first and last parentheses.
Then,
split('][')
splits the dividers
Then,
e.split(':')[0]: e.split(':')[1]
for each of the splits, maps the first part to the second part.
Finally, create a Series from this.

Categories