Remove digits from a list of strings in pandas column - python

I have this pandas dataframe
0 Tokens
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'
All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column
I want to remove the digits, and because it with another words,
the digits cannot be removed.
I have tried this code:
def remove_digits(tokens):
"""
Remove digits from a string
"""
return [''.join([i for i in tokens if not i.isdigit()])]
df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()
but it only joined the strings, and I clearly do not want to do that.
My desired output:
0 Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'

This is possible using pandas methods, which are vectorised so more efficient that looping.
import pandas as pd
df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})
col = "Tokens"
df[col] = (
df[col]
.explode()
.str.replace("\d+", "", regex=True)
.groupby(level=0)
.agg(list)
)
# Tokens
# 0 [rice, XXX, g]
# 1 [beer, XXX, cc]
Here we use:
pandas.Series.explode to convert the Series of lists into rows
pandas.Series.str.replace to replace occurrences of \d (number 0-9) with "" (nothing)
pandas.Series.groupby to group the Series by index (level=0) and put them back into lists (.agg(list))

Here's a simple solution -
df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'],
['beer', 'XXX', '750cc']]})
def remove_digits_from_string(s):
return ''.join([x for x in s if not x.isdigit()])
def remove_digits(l):
return [remove_digits_from_string(s) for s in l]
df["Tokens"] = df.Tokens.apply(remove_digits)

You can use to_list + re.sub in order to update your original dataframe.
import re
for index, lst in enumerate(df['Tokens'].to_list()):
lst = [re.sub('\d+', '', i) for i in lst]
df.loc[index, 'Tokens'] = lst
print(df)
Output:
Tokens
0 [rice, XXX, g]
1 [beer, XXX, cc]

Related

Using regex sub on a df.column with apply in pandas df

I have a df as such:
data= [{'Employees at store': 18, 'store': 'Mikes&Carls P#rlor', 'hair cut inch':'$3'},
{'Employees at store': 5, 'store': 'Over-Top', 'hair cut inch': '$9'}]
df = pd.DataFrame(data)
df
& have
df=df.apply(lambda x: x.astype(str).str.lower().str.replace(' ','_')
if isinstance(x, object)
else x)
working for repalacing spaces with underscores. I know that you can link these per How to replace multiple substrings of a string? .
And I also know that the link the exact string, not a subpart of it having tried:
df=df.apply(lambda x: x.astype(str).str.lower().str.
replace(' ','_').str.
replace('&','and').str.
replace('#','a') if isinstance(x, object) else x)
I think I have to use re.sub and do something like this re.sub('[^a-zA-Z0-9_ \n\.]', '', my_str)
and can't figure out how to build it into my apply(lambda...) function.
You can pass a callable to str.replace. Use a dictionary with the list of replacements and use the get method:
maps = {' ': '_', '&': 'and', '#': 'a'}
df['store'].str.replace('[ &#]', lambda m: maps.get(m.group(), ''), regex=True)
output:
0 MikesandCarls_Parlor
1 Over-Top
Name: store, dtype: object
applying on all (string) columns
cols = df.select_dtypes('object').columns
maps = {' ': '_', '&': 'and', '#': 'a', '$': '€'}
df[cols] = df[cols].apply(lambda col: col.str.replace('[ &#$]', lambda m: maps.get(m.group(), ''), regex=True))
output:
Employees at store store hair cut inch
0 18 MikesandCarls_Parlor €3
1 5 Over-Top €9
replacement per column
cols = df.select_dtypes('object').columns
maps = {'store': {' ': '_', '&': 'and', '#': 'a', '$': '€'},
'hair cut inch': {'$': '€'}
}
df[cols] = df[cols].apply(lambda col: col.str.replace('[ &#$]',
lambda m: maps.get(col, {}).get(m.group(), ''),
regex=True))

pandas: exact match does not work in an if AND condition

I have two dataframes as follows:
data = {'First': [['First', 'value'],['second','value'],['third','value','is'],['fourth','value','is']],
'Second': ['noun','not noun','noun', 'not noun']}
df = pd.DataFrame (data, columns = ['First','Second'])
and
data2 = {'example': ['First value is important', 'second value is important too','it us good to know',
'Firstap is also good', 'aplsecond is very good']}
df2 = pd.DataFrame (data2, columns = ['example'])
and I have written the following code that would filter out the sentences from df2 if there is a match in df for the first word of the sentence, only if in the second column we have a match for the word 'noun'. so basically there are two conditions.
def checker():
result =[]
for l in df2.example:
df['first_unlist'] = [','.join(map(str, l)) for l in df.First]
if df.first_unlist.str.match(pat=l.split(' ', 1)[0]).any() and df.Second.str.match('noun').any():
result.append(l)
return result
however, i realized that i get ['First value is important', 'second value is important too'] as the output when I run the function, which shows that the second condition for 'noun' filter only does not work. so my desired output would be ['First value is important'].
I have also tried .str.contains() and .eq() but I still got the same output
I would suggest filtering out df before trying to match:
def checker():
result = []
for l in df2.example:
first_unlist = [x[0] for x in df.loc[df.Second == 'noun', 'First']
if l.split(' ')[0] in first_unlist:
result.append(l)
return result
checker()
['First value is important']

Find a column name and retaining certain string in that entire column values

I would like to format the "status" column in a csv and retain the string inside single quotation adjoining comma ('sometext',)
Example:
Input
as in row2&3 - if more than one values are found in any column values then it should be concatenated with a pipe symbol(|)Ex. Phone|Charger
Expected output should get pasted in same status column like below
My attempt (not working):
import pandas as pd
df = pd.read_csv("test projects.csv")
scol = df.columns.get_loc("Status")
statusRegex = re.
compile("'\t',"?"'\t',") mo = statusRegex.search (scol.column)
Let say you have df as :
df = pd.DataFrame([[[{'a':'1', 'b': '4'}]], [[{'a':'1', 'b': '2'}, {'a':'3', 'b': '5'}]]], columns=['pr'])
df:
pr
0 [{'a': '1', 'b': '4'}]
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}]
df['comb'] = df.pr.apply(lambda x: '|'.join([i['a'] for i in x]))
df:
pr comb
0 [{'a': '1', 'b': '4'}] 1
1 [{'a': '1', 'b': '2'}, {'a': '3', 'b': '5'}] 1|3
import pandas as pd
# simplified mock data
df = pd.DataFrame(dict(
value=[23432] * 3,
Status=[
[{'product.type': 'Laptop'}],
[{'product.type': 'Laptop'}, {'product.type': 'Charger'}],
[{'product.type': 'TV'}, {'product.type': 'Remote'}]
]
))
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# save to desired column
df['output'] = df['Status'].apply(da_piper) # apply the method to the Status col
Additional help: You do not need to use read_excel since csv is not an excel format. It is comma separated values which is a standard format. in this case you can just do this:
import pandas as pd
# make a method to do the desired formatting / extration of data
def da_piper(cell):
"""extracts product.type and concatenates with a pipe"""
vals = [_['product.type'] for _ in cell] # get only the product.type values
return '|'.join(vals) # join them with a pipe
# read csv to dataframe
df = pd.read_csv("test projects.csv")
# apply method and save to desired column
df['Status'] = df['Status'].apply(da_piper) # apply the method to the Status col
Thank you all for the help and suggestions. Please find the final working codes.
df = pd.read_csv('test projects.csv')
rows = len(df['input'])
def get_values(value):
m = re.findall("'(.+?)'",value)
word = ""
for mm in m:
if 'value' not in str(mm):
if 'autolabel_strategy' not in str(mm):
if 'String Matching' not in str(mm):
word += mm + "|"
return str(word).rsplit('|',1)[0]
al_lst =[]
ans_lst = []
for r in range(rows):
auto_label = df['autolabeledValues'][r]
answers = df['answers'][r]
al = get_values(auto_label)
ans = get_values(answers)
al_lst.append(al)
ans_lst.append(ans)
df['a'] = al_lst
df['b'] = ans_lst
df.to_csv("Output.csv",index=False)

Python - counting non alphanumeric characters in a Pandas dataframe

I am trying to count the occurrences of characters in a column in a Pandas DataFrame. For example, I want to know in total how many times the character A appears in the column. The problem occurs when there's a non-alphanumeric character.
Here's a minimum reproducible example:
import pandas as pd
df = pd.DataFrame(data = ['AA', 'BA', 'ABA'], columns = ['col1'])
charset = set("".join(list(df['col1'])))
print(charset)
This is the set of characters in the column:
{'B', 'A'}
for char in charset:
print(char, ' ', sum(df['col1'].str.count(char)))
This is the number of times each character appears in the column:
B 2
A 5
Trying the same again, except with a few non-alphanumeric characters:
df2 = pd.DataFrame(data = ['AA+', 'BA', 'ABA('], columns = ['col1'])
charset = set("".join(list(df2['col1'])))
print(charset)
As expected, the set of characters:
{'(', 'B', '+', 'A'}
However trying to count the characters now fails:
for char in charset:
print(char, ' ', sum(df2['col1'].str.count(char)))
error: missing ), unterminated subpattern at position 0
Is there some way to escape the non-alphanumeric characters, or otherwise get the counts I am looking for?
Because input in Series.str.count is regex, is possible use re.escape:
pat : str
Valid regular expression.
df2 = pd.DataFrame(data = ['AA+', 'BA', 'ABA('], columns = ['col1'])
#list is not necessary
charset = set("".join(df2['col1']))
print(charset)
{'(', 'B', 'A', '+'}
import re
for char in charset:
#used pandas sum
print(char, ' ', df2['col1'].str.count(re.escape(char)).sum())
( 1
B 2
A 5
+ 1
Slightly extending what you have already done, you can use a conditional dictionary comprehension to check that each character in charset is an ASCII letter..
from string import ascii_letters
>>> {char: df['col1'].str.count(char).sum() for char in charset
if char in ascii_letters}
{'B': 2, 'A': 5}

Delete specific string from array column python dataframe

I'm trying to remove string '$A' from column a array elements.
But below code doesn't seems to work.
In the below code I'm trying to replace $A string with empty string (it doesn't work though) also, instead I would like to just delete that string.
df = pd.DataFrame({'a': [['$A','1'], ['$A', '3','$A'],[]], 'b': ['4', '5', '6']})
df['a'] = df['a'].replace({'$A': ''}, regex=True)
print(df['a'])
replace doesn't check inside the list element, you'll have to use loops/apply in this case:
df['a'] = df.a.apply(lambda x: [s for s in x if s != '$A'])
df
# a b
#0 [1] 4
#1 [3] 5
#2 [] 6

Categories