Delete unknown special character - python

Delete special character
s="____Ç_apple___ _______new A_____"
print(re.sub('[^0-9a-zA-Z]\s+$', '', s))
result = ____Ç_______________apple___ _______new A_____
s="____Ç_apple___ _______new A_____"
print(re.sub('[^0-9a-zA-Z]', '', s))
result= applenewA
final
result = apple new A
but i cannot get it
i want to delete Ç and _ and maintain space and English

Since you want to consolidate multiple spaces into one space, and then remove characters that are not words or spaces, you should do it in two separate regex substitutions:
print(re.sub(r'[^0-9a-zA-Z ]+', '', re.sub(r'\s+', ' ', s)))
This outputs:
apple new A

You want 'apple new A' for the result, right?
s="____Ç_apple___ _______new A_____"
result = re.sub('[^a-zA-Z|\s]+', '', s) # apple new A
result = ' '.join(result.split()) # apple new A
print(result)

Related

General regex about english words, numbers, length and specific symbols

I am working in an NLP task and I want to do a general cleaning for a specific purpose that doesn't matter to explain further.
I want a function that
Remove non English words
Remove words that are full in capital
Remove words that have '-' in the beginning or the end
Remove words that have length less than 2 characters
Remove words that have only numbers
For example if I have the following string
'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
the output should be
'George play 123_134 _123pantis'
The function that I have created already is the following:
def clean(text):
# remove words that aren't in english words (isn't working)
#text = re.sub(r'^[a-zA-Z]+', '', text)
# remove words that are in capital
text = re.sub(r'(\w*[A-Z]+\w*)', '', text)
# remove words that start or have - in the middle (isn't working)
text = re.sub(r'(\s)-\w+', '', text)
# remove words that have length less than 2 characters (is working)
text = re.sub(r'\b\w{1,2}\b', '', text)
# remove words with only numbers
text = re.sub(r'[0-9]+', '', text) (isn't working)
return text
The output is
- play _ foot-ball _ _pantis ελλαδα
which is not what I need. Thank you very much for your time and help!
You can do this in single re.sub call.
Search using this regex:
(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*
and replace with empty string.
RegEx Demo
Code:
import re
s = 'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
r = re.sub(r'(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*', '', s)
print (r)
# George play 123_134 _123pantis
Online Code Demo

Pandas remove non-alphanumeric characters from string column

with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters.
Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']

python bypass re.finditer match when searched words are in a defined expression

I have a list of words (find_list) that I want to find in a text and a list of expressions containing those words that I want to bypass (scape_list) when it is in the text.
I can find all the words in the text using this code:
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'\W{}\W'.format(word)
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
final_list.append(word_.group(0))
The final_list is:
[' name ', ' name ', ' name ', ' Name.', ' small ', ' Small ', ' Small ']
Is there a way to bypass expressions listed in scape_list and obtain a final_list like this one:
[' name ', ' name ', ' Name.', ' small ']
final_list and scape_list are always being updated. So I think that regex is a good approach.
You can capture the word before and after the find_list word using the regex and check whether both the combinations are not present in the scape_list. I have added comments where I have changed the code. (And better change the scape_list to a set if it can become large in future)
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'(\w*\W)({})(\W\w*)'.format(word) # change the regex to capture adjacent words
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
if ((word_.group(1) + word_.group(2)).strip().lower() not in scape_list
and (word_.group(2) + word_.group(3)).strip().lower() not in scape_list): # added this condition
final_list.append(word_.group(2)) # changed here
final_list
['name', 'name', 'Name', 'small']

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

I'm not terribly familiar with Python regex, or regex in general, but I'm hoping to demystify it all a bit more with time.
My problem is this: given a string like ' Apple Banana Cucumber Alphabetical Fruit Whoops', I'm trying to use python's 're.findall' module to result in a list that looks like this: my_list = [' Apple', ' Banana', ' Cucumber', ' Alphabetical Fruit', ' Whoops']. In other words, I'm trying to find a regex expression that can [look for a bunch of whitespace followed by some non-whitespace], and then check if there is a single space with some more non-whitespace characters after that.
This is the function I've written that gets me cloooose but not quite:
re.findall("\s+\S+\s{1}\S*", my_list)
Which results in:
[' Apple ', ' Banana ', ' Cucumber ', ' Alphabetical Fruit']
I think this result makes sense. It first finds the whitespace, then some non-whitespace, but then it looks for at least one whitespace (which leaves out 'Whoops'), and then looks for any number of other non-whitespace characters (which is why there's no space after 'Alphabetical Fruit'). I just don't know what character combination would give me the intended result.
Any help would be hugely appreciated!
-WW
You can do:
\s+\w+(?:\s\w+)?
\s+\w+ macthes one or more whitespaces, followed by one or more of [A-Za-z0-9_]
(?:\s\w+)? is a conditional (?, zero or one) non-captured group ((?:)) that matches a whitespace (\s) followed by one or more of [A-Za-z0-9_] (\w+). Essentially this is to match Fruit in Alphabetical Fruit.
Example:
In [701]: text = ' Apple Banana Cucumber Alphabetical Fruit Whoops'
In [702]: re.findall(r'\s+\w+(?:\s\w+)?', text)
Out[702]:
[' Apple',
' Banana',
' Cucumber',
' Alphabetical Fruit',
' Whoops']
Your pattern works already, just make the second part (the 'compound word' part) optional:
\s+\S+(\s\S+)?
https://regex101.com/r/Ua8353/3/
(fixed \s{1} per #heemayl)

Categories