Replacing multiple chars in string with another character in Python - python

I have a list of strings I want to check if each string contains certain characters, if it does then replace the characters with another character.
I have something like below:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\\n', '\\t', '=']
word = 'Ad{min > HR'
for c in list(word):
if c in invalid_chars:
word = word.replace(c, '_')
print (word)
>>> Admin_>_HR
I am trying to convert this into a function using list comprehension but I am strange characters...
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([word.replace(ch, char_replace) for ch in list(word) if ch in checklist])
print(replace_chars(word, invalid_chars))
>>> Ad_min > HRAd{min_>_HRAd{min_>_HR

Try this general pattern:
''.join([ch if ch not in invalid_chars else '_' for ch in word])
For the complete function:
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([ch if ch not in checklist else char_replace for ch in word])
Note: no need to wrap string word in a list(), it's already iterable.

This might be a good use for str.translate(). You can turn your invalid_chars into a translation table with str.maketrans() and apply wherever you need it:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\n', '\t', '=']
invalid_table = str.maketrans({k:'_' for k in invalid_chars})
word = 'Ad{min > HR'
word.translate(invalid_table)
Result:
'Ad_min_>_HR'
This will be especially nice if you need to apply this translation to several strings and more efficient since you don't need to loop through the entire invalid_chars array for every letter, every time which you will if you us if x in invalid_chars inside a loop.

This is easier with regex. You can search for a whole group of characters with a single substitution call. It should perform better too.
>>> import re
>>> re.sub(f"[{re.escape(''.join(invalid_chars))}]", "_", word)
'Ad_min_>_HR'
The code in the f-string builds a regex pattern that looks like this
>>> pattern = f"[{re.escape(''.join(invalid_chars))}]"
>>> print(repr(pattern))
'[\\ ,;\\{\\}\\(\\)\\\n\\\t=]'
>>> print(pattern)
[\ ,;\{\}\(\)\
\ =]
That is, a regex character set containing each of your invalid chars. (The backslash escaping ensures that none of them are interpreted as a regex control character, regardless of which characters you put in invalid_chars.) If you had specified them as a string in the first place, the ''.join() would not be required.
You can also compile the pattern (using re.compile()) if you need to re-use it on multiple words.

Related

How to remove punctuation from a string [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 3 years ago.
One of the project that I've been working on is to create a word counter, and to do that, I have to effectively remove all punctuation from a string.
I have tried using the split method and split at punctuation, however, this will later make the list very weird (from separating at a word to having a list that has 5 words). I then tried to have a list or a string full of punctuation, and use a for loop to eliminate all punctuation, but both are not successful
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
for i in content_string.lower():
if i in punctuation:
i = i.replace[i," "]
else:
i = i
It says that
"TypeError: 'type' object is not subscriptable"
This message appears both when using a string or using a list.
There is a mix with parenthesis versus square brackets.
list and replace are functions, arguments are passed with parenthesis.
Also, try to describe your algorithm with words:
example:
For all forbidden characters, i want to remove them from my content (replace with space)
Here is an implementation you can start with:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
for i in punctuation:
content_string = content_string.replace(i, " ")
To create a list, you use l = [...] not l = list[...], and functions/methods (such as str.replace) are called with parenthesis, not square brackets, however, you can use re.sub to do this in a much better and simpler way:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'] # '(', ')' not `()`
import re
new_string = re.sub('|'.join(map(re.escape, punctuation)), '', content_string)
print(new_string)
Output:
This is a test to see whether or not the code can eliminate punctuation
Your error
"TypeError: 'type' object is not subscriptable"
comes from the line
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
To define a list you either use brackets [ ] without the keyword list, or if you use list you have to put parenthesis (although in this case converting a list into a list is redundant)
# both options will work, but the second one is redundant and therefore wrong
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'])
Notice that the last element () must be splitted in two elements ( and )
Now to achieve what you want in an efficient way, use a conditional comprehension list
''.join([i if i not in punctuation else ' ' for i in content_string])
result:
'This is a test to see whether or not the code can eliminate punctuation'
Notice that according to your code, you are not removing the punctuation symbols but replacing them for spaces.
There are multiple bugs in the code.
First one:
The list keyword is obsolete.
If you wanted to use it, you would need to add parentheses () so that the call would be properly done on the items in the already defined list.
BAD punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
BETTER punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '()'])
But simply defining the list with regular [] syntax would be enough, and also more efficient than a list() call.
Second one:
You will not be able to replace parentheses with the if i in punctuation: check.
This is because they are a two character long string, and you are iterating over single characters of your string. So you will always compare '(' or ')' with '()'.
A possible fix - add parentheses separately to the punctuation list as single characters.
Third bug, or rather an obsolete else instruction:
else:
i = i
This servers no purpose whatsoever, you should skip the else instruction.
Fourth, the most apparent bug:
In your for loop you are editing i variable which is a copy of a single character from the string that you are iterating over. You should perform the change on the original string, this could be done with the usage of enumerate - only if you first turned your string into a list, so that you could modify its values.
for i, char in enumerate(list(content_string.lower())):
if char in punctuation:
content_string[i] = ' '
Anyway, the goal you are trying to achieve can come down to a one-liner, using a list comprehension and a string join on the resulting list afterwards:
content_string = ''.join([char if char not in punctuation else ' ' for char in content_string.lower()])

Python: delete all characters before the first letter in a string

After a thorough search I could find how to delete all characters before a specific letter but not before any letter.
I am trying to turn a string from this:
" This is a sentence. #contains symbol and whitespace
To this:
This is a sentence. #No symbols or whitespace
I have tried the following code, but strings such as the first example still appear.
for ch in ['\"', '[', ']', '*', '_', '-']:
if ch in sen1:
sen1 = sen1.replace(ch,"")
Not only does this fail to delete the double quote in the example for some unknown reason but also wouldn't work to delete the leading whitespace as it would delete all of the whitespace.
Thank you in advance.
Instead of just removing white spaces, for removing any char before first letter, do this :
#s is your string
for i,x in enumerate(s):
if x.isalpha() #True if its a letter
pos = i #first letter position
break
new_str = s[pos:]
import re
s = " sthis is a sentence"
r = re.compile(r'.*?([a-zA-Z].*)')
print r.findall(s)[0]
Strip all whitespace and punctuation:
>>> text.lstrip(string.punctuation + string.whitespace)
'This is a sentence. #contains symbol and whitespace'
Or, an alternative, find the first character that is an ascii letter. For example:
>>> pos = next(i for i, x in enumerate(text) if x in string.ascii_letters)
>>> text[pos:]
'This is a sentence. #contains symbol and whitespace'
This is a very basic version; i.e. it uses syntax that beginners in Python will easily understand.
your_string = "1324 $$ '!' '' # this is a sentence."
while len(your_string) > 0 and your_string[0] not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz":
your_string = your_string[1:]
print(your_string)
#prints "this is a sentence."
Pros: Simple, no imports
Cons: The while loop could be avoided if you feel comfortable using list comprehensions.
Also, the string that you're comparing to could be simpler using regex.
Drop everything up to the first alpha character.
import itertools as it
s = " - .] * This is a sentence. #contains symbol and whitespace"
"".join(it.dropwhile(lambda x: not x.isalpha(), s))
# 'This is a sentence. #contains symbol and whitespace'
Alternatively, iterate the string and test if each character is in a blacklist. If true strip the character, otherwise short-circuit.
def lstrip(s, blacklist=" "):
for c in s:
if c in blacklist:
s = s.lstrip(c)
continue
return s
lstrip(s, blacklist='\"[]*_-. ')
# 'This is a sentence. #contains symbol and whitespace'
You can use re.sub
import re
text = " This is a sentence. #contains symbol and whitespace"
re.sub("[^a-zA-Z]+", " ", text)
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

Escaping regex unicode string in Python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"
What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

How can I simplify this function?

Is there any way to simplify this function? Specifically, I'd like to rewrite it with fewer lines of indentation.
# split string (first argument) at location of separators (second argument, should be a string)
def split_string(text, separators):
text = ' ' + text + ' '
words = []
word = ""
for char in text:
if char not in separators:
word += char
else:
if word:
words.append(word)
word = ""
if not words:
words.append(text)
return words
Try using re.split, for example:
re.split('[%s]' % (separators),string)
The [] creates a regular expression character class to split on.
Your code seems to produce
>>> split_string("foo.,.bar", ".,")
[' foo']
but your comment says
split_string("foo.,.bar", ".,") will return ["foo", "bar"]
Assuming the comment is what's intended, then I'd use itertools.groupby (I hate using regexes):
from itertools import groupby
def splitter(text, separators):
grouped = groupby(text, lambda c: c in separators)
return [''.join(g) for k,g in grouped if not k]
which gives
>>> splitter("foo.,.bar", ".,")
['foo', 'bar']
groupby returns an iterator over consecutive terms grouped by some function -- in this case, lambda c: c in separators -- of the terms.
You should use the split() method. Taken from the official documentation:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string.
If maxsplit is given, at most maxsplit splits are done (thus, the list will have
at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no
limit on the number of splits (all possible splits are made).
If sep is given, consecutive delimiters are not grouped together and are deemed
to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
The sep argument may consist of multiple characters (for example,
'1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a
specified separator returns [''].
If sep is not specified or is None, a different splitting algorithm is applied:
runs of consecutive whitespace are regarded as a single separator, and the result
will contain no empty strings at the start or end if the string has leading or
trailing whitespace. Consequently, splitting an empty string or a string consisting
of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
You can do:
myString = "Some-text-here"
splitWords = myString.split("-")
The above code will return a List of with the words separated. I used "-" as the delimiter, you may assign any delimiter you like. Default is the "space" delimiter like this:
myString = "Some text here"
splitWords = myString.split()

Categories