Multiple character replacement in a column - python

I am trying to replace some string characters with a single character, I can do it with a multiple lines of code but I was wondering if there is something like this to do it in a single line?
df['Column'].str.replace(['_','-','/'], ' ')
I can write 3 lines of code for normal str.replace() and change those strings one by one but I don't think that would be efficient.

Pandas Dataframe Str replace takes regex pattern or string as first argument. So you can provide a regex to change multiple patterns
code:
import pandas as pd
check_df = pd.DataFrame({"Column":["abc", "A_bC", "A_b-C/d"]})
check_df['Column'].str.replace("_|-|/", " ")
Output:
0 abc
1 A bC
2 A b C d
Name: Column, dtype: object

you can use a regular expression with an alternating group:
df['Column'].str.replace(r"_|-|/", " ", regex=True)
| means "either of these".
or you can use str.maketrans to make a translation table and use .str.translate:
df['Column'].str.translate(str.maketrans(dict.fromkeys("_-|", " ")))
Note that this is for 1-length characters' translation.
If characters are dynamically produced, e.g., within a list, then re.escape("|".join(chars)) can be used for the first way, and "".join(chars) for the second way. re.escape for the first one is for special characters' escaping, e.g., if "$" is to be replaced, since it is the end-of-string anchor in regexes, we need to have written "\$" instead, which re.escape will take care.

You could use a character class [/_-] listing the characters that you want to replace.
Note that if you have multiple consecutive characters and you replace them with a space, you will get space gaps. If you don't want that, you can repeat the character class with a + to match 1 or more characters and replace that match with a single space.
If you don't want the leading and trailing spaces, you can use .str.strip()
Example
import pandas as pd
df = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df['Column'] = df['Column'].str.replace(r"[/_-]", ' ')
print(df)
print("\n---------v2---------\n")
df_v2 = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df_v2['Column'] = df_v2['Column'].str.replace(r"[/_-]+", ' ').str.strip()
print(df_v2)
Output
Column
0 a b c d
1 a b
---------v2---------
Column
0 a b c d
1 a b

Related

Python: string not splitting correctly at "|||" substring

I have a column in Pandas DataFrame that stores long strings, in which different chunks of information are separated by a "|||".
This is an example:
"intermediation|"mechanical turk"|precarious "public policy" ||| intermediation|"mechanical turk"|precarious high-level
I need to split this column into multiple columns, each column containing the string between the separators "|||".
However, while running the following code:
df['query_ids'].str.split('|||', n=5, expand = True)
What I get, however, are splits done for every single character, like this:
0 1 2 3 4 5
0 " r e g ulatory capture"|"political lobbying" policy-m...
I suspect it's because "|" is a Python operator, but I cannot think of a suitable workaround.
You need to escape |:
df['query_ids'].str.split('\|\|\|', n=5, expand=True)
or to pass regex=False:
df['query_ids'].str.split('|||', n=5, expand=True, regex=False)

Replace only specified occurrences of string - Python Regex

I am playing around with re module in python, what i am stuck at is I want to replace only specified occurrences of the string.
for example
import re
string = "aabbaabbaabbabbaabbaa"
#I want to replace only 3rd time 'bb' appeared in the string with white space
string = re.sub("bb"," ",string,3) #if iI do this all first 3 occurrences got replaced
print(string)
output
aa aa aa aabbaabbaa
any idea how to to replace only 3rd occurrence
so the output would look like This
aabbaabbaa aabbaabbaa
This may not be the perfect way but it is a solution:
string = re.sub('bb',' ',string, 3)
string = re.sub(' ','bb',string,2)
This is just an alternative solution I can think of.
Modify the regex so it only matches the third occurrence?
re.sub(r'(.*?bb.*?bb.*?)bb', r'\1 ', string, 1)
This could be extended to a large number of repetitions like r'(.*?(bb.*?){9999})bb'

Extract first word for each row in a column under multiple conditions

I have a dataset contains a column of string. it looks like
df.a=[['samsung/windows','mobile unknown','chrome/android']].
I am trying to obtain the first word of each row to replace the current string, e.g.[['samsung','mobile','chrome']]
I applied:
df.a=df.a.str.split().str.get(0)
this gives me the first word but with "/"
df.a=[words.split("/")[0] for words in df.a]
this only splits the strings that contains "/"
Can I get the expected result using one line?
use re.findall() and get only alpha numeric
import re
df['a'] = df['a'].apply(lambda x : re.findall(r"[\w']+",x)[0])
You can pass regex syntax directly to the split function to split on / or ' ' with the pipe character |, but his solution only works if those are the only delimiters in your data
dfa=pd.Series(['samsung/windows','mobile unknown','chrome/android'])
dfa.str.split(r'/| ')
0 [samsung, windows]
1 [mobile, unknown]
2 [chrome, android]
The pandas function extract do exactly what you want:
Extract capture groups in the regex pat as columns in a DataFrame
df['a'].str.extract(r"(\w+)", expand=True)
# 0
# 0 samsung
# 1 mobile
# 2 chrome

How to perform str.strip in dataframe and save it with inplace=true?

I have dataframe with n columns. And I want to perform a strip to strings in one of the columns in the dataframe. I was able to do it, but I want this change to reflect in the original dataframe.
Dataframe: data
Name
0 210123278414410005
1 101232784144610006
2 210123278414410007
3 21012-27841-410008
4 210123278414410009
After stripping:
Name
0 10005
1 10006
2 10007
3 10008
4 10009
5 10010
I tried the below code and strip was successful
data['Name'].str.strip().str[13:]
However if I check dataframe, the strip is not reflected.
I am looking for something like inplace parameter.
String methods (the attributes of the .str attribute on a series) will only ever return a new Series, you can't use these for in-place changes. Your only option is to assign it back to the same column:
data['Name'] = data['Name'].str.strip().str[13:]
You could instead use the Series.replace() method with a regular expression, and inplace=True:
data['Name'].replace(r'(?s)\A\s*(.{,13}).*(?<!\s)\s*\Z', r'\1', regex=True, inplace=True)
The regular expression above matches up to 13 characters after leading whitespace, and ignores trailing whitespace and any other characters beyond the first 13 after whitespace is removed. It produces the same output as .str.strip().str[:13], but makes the changes in place.
The pattern is using a negative look-behind to make sure that the final \s* pattern matches all whitespace elements at the end before selecting between 0 and 13 characters of what remains. The \A and \Z anchors make it so the whole string is matched, and the (?s) at the start switches the . pattern (dot, any character except newlines) to include newlines when matching; this way an input value like ' foo\nbar ' is handled correctly.
Put differently, the \A\s* and (?<!\s)\s*\Z patterns act like str.strip() does, matching all whitespace at the start and end, respectively, and no more. The (.{,13)).* pattern matches everything in between, with the first 13 characters of those (or fewer, if there are not enough characters to match after stripping) captured as a group. That one group is then used as the replacement value.
And because . doesn't normally match \n characters, the (?s) flag at the start tells the regex engine to match newline characters anyway. We want all characters to be included after stripping, not just all except one.
data['Name'].str.strip().str[13:] returns you the new transformed column, but it is not changing in-place data (inside the dataframe). You should write:
data['Name'] = data['Name'].str.strip().str[13:]
to write the transformed data to the Name column.
I agree with the other answers that there's no inplace parameter for the strip function, as seen in the documentation for str.strip.
To add to that: I've found the str functions for pandas Series usually used when selecting specific rows. Like df[df['Name'].str.contains('69'). I'd say this is a possible reason that it doesn't have an inplace parameter -- it's not meant to be completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or we'll consistently get "last 5 characters" instead!
As per yatu's comment: you should reassign the Series with stripped values to the original column.
data['Name'] = data['Name'].str.strip().str[13:]
It is interesting to note that pandas DataFrames do work on numpy beneath.
There is also the idea to do broadcasting operations in numpy.
Here is the example I had in mind:
import numpy as np
import pandas as pd
df=pd.DataFrame([['210123278414410005', '101232784144610006']])
dfn=df.to_numpy(copy=False) #numpy array
df=pd.DataFrame(np.frompyfunc(lambda dfn: dfn[13:],1,1)(dfn) )
print(df) #10005 10006
This doesn't answer your question but it is just another option (it creates a new datafram from the numpy array though).

Python: replace whole word dictionary values in pandas df with dictionary key

Question:
I need to match and replace on the whole words in the pandas df column 'messages' with the dictionary values. Is there any way I can do this within the df["column"].replace command? Or do I need to find another way to replace whole words?
Background:
in my pandas data frame I have a column of text messages that contain English human names keys i'm trying to replace with dictionary value of "First Name". The specific column in the data frame looks like this, where you can see "tommy" as a single name.
tester.df["message"]
message
0 what do i need to do
1 what do i need to do
2 hi tommy thank you for contacting app ...
3 hi tommy thank you for contacting app ...
4 hi we are just following up to see if you read...
The dictionary is created from a list I extracted from the 2000 census data base. It has many different first names that could match inline text including 'al' or 'tom', and if i'm not careful could place my value "First Name" everywhere across the pandas df column messages:
import requests
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
#turn into dictionary with "First Name" as value
str1 = dict((el, 'FirstName') for el in str1)
Now I want to replace whole words within the DF column "message" that match the dictionary keys with the 'FirstName' value. Unfortunately when I do the following it replaces the text in messages where it matches even the short names like "al" or 'tom".
In [254]: tester["message"].replace(str1, regex = True)
Out[254]:
0 wFirstNamet do i neFirstName to do
1 wFirstNamet do i neFirstName to do
2 hi FirstNameFirstName tFirstName you for conFi...
3 hi FirstNameFirstName tFirstName you for conFi...
4 hi we are just followFirstNameg up to FirstNam...
Name: message, dtype: object
Any help matching and replacing the whole key with value is appreciated!
Update / attempt to fix 1: Tried adding some regular expression features to match whole words only**
I tried adding a break character to each word within the extracted string that the dictionary of which the dictionary is constructed. Unfortunately the single slashes are limited words that get turned into double slashes and won't match the dictionary key -> value replace.
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
l = requests.get('https://deron.meranda.us/data/popular-last.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#add regex before
string = 'r"\\'
endstring = '\\b'
list1 = [ string + x + endstring for x in list1]
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
##if we do print(str1) it shows one backslash
##turn to list ..but print() doesn't let us have one backlash anymore
str1 = [x.strip() for x in str1.split(',')]
#turn to dictionary with "firstname"
str1 = dict((el, 'FirstName') for el in str1)
And then when I try to match and replace with the updated dictionary keys with the break regular expressions, I get a bad escape
tester["message"].replace(str1, regex = True)
" Traceback (most recent call last):
error: bad escape \j "
This might be the right direction, but the backslash to double backslash conversion seems to be tricky...
First you need to prepare the list of names such that it matches the name preceded by either the beginning of the string (^) or a whitespace (\s) and followed by either a whitespace or the end of the string ($). Then you need to make sure to preserve the preceding and following element (via backreferences). Assuming you have a list first_names which contains all first names that should be replaced:
replacement_dict = {
r'(^|\s){}($|\s)'.format(name): r'\1FirstName\2'
for name in first_names
}
Let's take a look at the regex:
( # Start group.
^|\s # Match either beginning of string or whitespace.
) # Close group.
{} # This is where the actual name will be inserted.
(
$|\s # Match either end of string or whitespace.
)
And the replacement regex:
\1 # Backreference; whatever was matched by the first group.
FirstName
\2 # Backreference; whatever was matched by the second group.

Categories