Using regex sub on a df.column with apply in pandas df

Using regex sub on a df.column with apply in pandas df - python

I have a df as such:
data= [{'Employees at store': 18, 'store': 'Mikes&Carls P#rlor', 'hair cut inch':'$3'},
{'Employees at store': 5, 'store': 'Over-Top', 'hair cut inch': '$9'}]
df = pd.DataFrame(data)
df
& have
df=df.apply(lambda x: x.astype(str).str.lower().str.replace(' ','_')
if isinstance(x, object)
else x)
working for repalacing spaces with underscores. I know that you can link these per How to replace multiple substrings of a string? .
And I also know that the link the exact string, not a subpart of it having tried:
df=df.apply(lambda x: x.astype(str).str.lower().str.
replace(' ','_').str.
replace('&','and').str.
replace('#','a') if isinstance(x, object) else x)
I think I have to use re.sub and do something like this re.sub('[^a-zA-Z0-9_ \n\.]', '', my_str)
and can't figure out how to build it into my apply(lambda...) function.

You can pass a callable to str.replace. Use a dictionary with the list of replacements and use the get method:
maps = {' ': '_', '&': 'and', '#': 'a'}
df['store'].str.replace('[ &#]', lambda m: maps.get(m.group(), ''), regex=True)
output:
0 MikesandCarls_Parlor
1 Over-Top
Name: store, dtype: object
applying on all (string) columns
cols = df.select_dtypes('object').columns
maps = {' ': '_', '&': 'and', '#': 'a', '$': '€'}
df[cols] = df[cols].apply(lambda col: col.str.replace('[ &#$]', lambda m: maps.get(m.group(), ''), regex=True))
output:
Employees at store store hair cut inch
0 18 MikesandCarls_Parlor €3
1 5 Over-Top €9
replacement per column
cols = df.select_dtypes('object').columns
maps = {'store': {' ': '_', '&': 'and', '#': 'a', '$': '€'},
'hair cut inch': {'$': '€'}
}
df[cols] = df[cols].apply(lambda col: col.str.replace('[ &#$]',
lambda m: maps.get(col, {}).get(m.group(), ''),
regex=True))

Related

Pandas converting Column of Lists to Column of Text Data Pre-Processing

I have a data set that looks like this:
sentiment
text
positive
['chewy', 'what', 'dhepburn', 'said']
neutral
['chewy', 'plus', 'you', 've', 'added']
and I want to convert it to this:
sentiment
text
positive
chewy what dhepburn said
neutral
chewy plus you ve added
I basically want to convert the 'text' column, which is made up of lists, into a column of text.
I've done multiple versions of this code:
def joinr(words):
return ','.join(words)
#df['text'] = df.apply(lambda row: joinr(row['text']), axis=1)
#df['text'] = df['text'].apply(lambda x: ' '.join([x]))
df['text'] = df['text'].apply(joinr)
and I keep getting something that resembles this:
sentiment
text
positive
['c h e w y', 'w h a t', 'd h e p b u r n', 's a i d']
neutral
['c h e w y', 'p l u s', 'y o u', 'v e', 'a d d e d']
This is apart of data pre-processing for a ML model. I'm working in Google Colab (similar to Juypter Notebook).

I believe your problem is the axis = 1 you don't need that
data = {
'sentiment' : ['positive', 'neutral'],
'text' : ["['chewy', 'what', 'dhepburn', 'said']", "['chewy', 'plus', 'you', 've', 'added']"]
}
df = pd.DataFrame(data)
df['text'] = df['text'].apply(lambda x : x.replace('[', '')).apply(lambda x : x.replace(']', '')).apply(lambda x : x.replace("'", ''))
df['text'] = df['text'].apply(lambda x : x.split(','))
df['text'] = df['text'].agg(' '.join)
df

Use join:
df['test'].str.join(' ')
Demonstration:
df = pd.DataFrame({'test': [['chewy', 'what', 'dhepburn', 'said']]})
df['test'].str.join(' ')
Output:
0 chewy what dhepburn said
Name: test, dtype: object
Based on the comment:
#Preparing data
string = """sentiment text
positive ['chewy', 'what', 'dhepburn', 'said']
neutral ['chewy', 'plus', 'you', 've', 'added']"""
data = [x.split('\t') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
#Solution
df['text'].apply(lambda x: eval(x)).str.join(' ')
Also, you can use more simply:
df['text'].str.replace("\[|\]|'|,",'')
Output:
0 chewy what dhepburn said
1 chewy plus you ve added
Name: text, dtype: object

If you have a string representation of a list you can use:
from ast import literal_eval
df['text'] = df['text'].apply(lambda x: ' '.join(literal_eval(x)))
If really you just want to remove the brackets and commas, use a regex:
df['text'] = df['text'].str.replace('[\[\',\]]', '', regex=True)
Output:
sentiment text
0 positive chewy what dhepburn said
1 neutral chewy plus you ve added

Remove digits from a list of strings in pandas column

I have this pandas dataframe
0 Tokens
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'
All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column
I want to remove the digits, and because it with another words,
the digits cannot be removed.
I have tried this code:
def remove_digits(tokens):
"""
Remove digits from a string
"""
return [''.join([i for i in tokens if not i.isdigit()])]
df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()
but it only joined the strings, and I clearly do not want to do that.
My desired output:
0 Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'

This is possible using pandas methods, which are vectorised so more efficient that looping.
import pandas as pd
df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})
col = "Tokens"
df[col] = (
df[col]
.explode()
.str.replace("\d+", "", regex=True)
.groupby(level=0)
.agg(list)
)
# Tokens
# 0 [rice, XXX, g]
# 1 [beer, XXX, cc]
Here we use:
pandas.Series.explode to convert the Series of lists into rows
pandas.Series.str.replace to replace occurrences of \d (number 0-9) with "" (nothing)
pandas.Series.groupby to group the Series by index (level=0) and put them back into lists (.agg(list))

Here's a simple solution -
df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'],
['beer', 'XXX', '750cc']]})
def remove_digits_from_string(s):
return ''.join([x for x in s if not x.isdigit()])
def remove_digits(l):
return [remove_digits_from_string(s) for s in l]
df["Tokens"] = df.Tokens.apply(remove_digits)

You can use to_list + re.sub in order to update your original dataframe.
import re
for index, lst in enumerate(df['Tokens'].to_list()):
lst = [re.sub('\d+', '', i) for i in lst]
df.loc[index, 'Tokens'] = lst
print(df)
Output:
Tokens
0 [rice, XXX, g]
1 [beer, XXX, cc]

Splitting a string after certain characters?

I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"

Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']

Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']

Forming an array from items in list of lists

I am trying to create an array from data in a list of lists.
ac_name = 'ac'
dat = [['ab=55', 'ac=25', 'db =57', 'dc =44'],
['ab=75','ac =12', 'cg =11', 'pt =95'],
['ab=17', 'ac=62'],
['ab=97', 'aa=501', 'dc=12', 'dd=19']]
So I want to get a list that looks like this
ac = ['ac=25','ac=12','ac=62','']
and from this get
ac_values = [25,12,62,'']
All in all I want to convert dat into one large array.
I know this doesnt work because it is going through every item, so the output is however many elements there are in dat.
ac = []
for d in dat:
for c in d:
if ac_name in c:
ac.append(c)
else:
ac.append('')

As I mentioned in comment, your else block is inside the nested loop which means that for all the items in each list if the condition is not executed you'll have an empty string. You can use a flag to see whether the if block is executed in nested loop and append an empty string to the final result.
In [6]: ac = []
...: for d in dat:
...: flag = True
...: for c in d:
...: if ac_name in c:
...: ac.append(c)
...: flag = False
...: if flag:
...: ac.append('')
...:
In [7]: ac
Out[7]: ['ac=25', 'ac =12', 'ac=62', '']
But since this is not a much Pythonic way for dealing with problem, instead you can use generator expressions and next() function as following to create a dictionary out of expected result. In this case you can easily access keys or values as well.
In [19]: result = dict((ind, next((i for i in d if i.startswith(ac_name)), '=').split('=')[1]) for ind, d in enumerate(dat))
In [20]: result
Out[20]: {0: '25', 1: '12', 2: '62', 3: ''}
In [21]: result.keys() # shows number of sub-lists in your original list
Out[21]: dict_keys([0, 1, 2, 3])
In [22]: result.values()
Out[22]: dict_values(['25', '12', '62', ''])

ac_name = 'ac'
datas = [['ab=55', 'ac=25', 'db =57', 'dc =44'],
['ab=75','ac =12', 'cg =11', 'pt =95'],
['ab=17', 'ac=62'],
['ab=97', 'aa=501', 'dc=12', 'dd=19'],
['ab=55', 'ac=25', 'db =57', 'dc =44'],
['ab=75','ac =12', 'cg =11', 'pt =95'],
['ab=17', 'ac=62'],
['ab=97', 'aa=501', 'dc=12', 'dd=19']]
lst = []
for i,data in enumerate(datas):
for d in data:
if ac_name in d:
lst.append(d.split('=')[-1])
if i == len(lst):
lst.append('')
print(lst)
Output
['25', '12', '62', '', '25', '12', '62', '']

You can use itertools.chain to flatten your list of lists. Then use a list comprehension to filter and split elements as required.
from itertools import chain
res = [int(i.split('=')[-1]) for i in chain.from_iterable(dat) \
if i.startswith('ac')]
print(res)
[25, 12, 62]

There are many ways to do this as folks have shown. Here is one way using list comprehension and higher order functions:
In [14]: ["" if not kv else kv[0].split('=')[-1].strip() for kv in [filter(lambda x: x.startswith(ac_name), xs) for xs in datas]]
Out[14]: ['25', '12', '62', '']
If an exact key "ac" is desired, can use regular expressions too:
import re
p = re.compile(ac_name + '\s*')
["" if not kv else kv[0].split('=')[-1].strip() for kv in [filter(lambda x: p.match(x), xs) for xs in datas]]

After some puzzling, I found a possible solution
Process each element in each sublist individually: if it contains 'ac', then strip the 'ac=' part. If not, just return an empty string ''.
Then concatenate all elements in each sublist using string.join(). This will return a list of strings with either the number string, e.g. '25', or an empty string.
Finally, conditionally convert each string to integer if possible. Else just return the (empty) string.
ac = [int(cell_string) if cell_string.isdigit() else cell_string for cell_string in
[''.join([cell.split('=')[1] if ac_name in cell else '' for cell in row]) for row in data]]
Output:
[25, 12, 62, '']
edit:
If you want to extend it to multiple column names, e.g.:
col_name = ['ac', 'dc']
Then just extend this:
cols = [[int(cell_string) if cell_string.isdigit() else cell_string for cell_string in
[''.join([cell.split('=')[1] if name in cell else '' for cell in row]) for row in data]] for name in col_name]
Output:
[[25, 12, 62, ''], [44, '', '', 12]]

Try this:
ac_name = 'ac'
ac = []
ac_values = []
for value in dat:
found = False
for item in value:
if ac_name in item:
ac.append(item)
ac_values.append(item.split('=')[-1])
found = True
if not found:
ac.append(' ')
ac_values.append(' ')
print(ac)
print(ac_values)
Output:
['ac= 25', 'ac = 12', 'ac=62', ' ']
[' 25', ' 12', '62', ' ']

This will work for any length of ac_name:
ac_name = 'ac'
ac = []
ac_values=[]
for i in dat:
found=False
for j in i:
if j[:2]==ac_name:
ac.append(j)
ac_values.append(int(j[len(ac_name)+2:]))
found=True
if not found:
ac.append("")
ac_values.append("")
print(ac)
print(ac_values)

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.

import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']

This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']

import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input

def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result

To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().

Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex sub on a df.column with apply in pandas df - python

Related

Pandas converting Column of Lists to Column of Text Data Pre-Processing

Remove digits from a list of strings in pandas column

Splitting a string after certain characters?

Forming an array from items in list of lists

Product code looks like abcd2343, how to split by letters and numbers?

Categories

Resources