Regex - get all digits and letters (with non roman chars) [duplicate] - python

I need to use regex to strip punctuation at the start and end of a word. It seems like regex would be the best option for this. I don't want punctuation removed from words like 'you're', which is why I'm not using .replace().

You don't need regular expression to do this task. Use str.strip with string.punctuation:
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'
>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"

I think this function will be helpful and concise in removing punctuation:
import re
def remove_punct(text):
new_words = []
for word in text:
w = re.sub(r'[^\w\s]','',word) #remove everything except words and space
w = re.sub(r'_','',w) #how to remove underscore as well
new_words.append(w)
return new_words

If you persist in using Regex, I recommend this solution:
import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me
Note also that you can pass your own punctuation marks:
my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
'/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_',
'`', '{', '|', '}', '~', '»', '«', '“', '”']
punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19

You can remove punctuation from a text file or a particular string file using regular expression as follows -
new_data=[]
with open('/home/rahul/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
# You can add and remove punctuations as per your choice
#removing stop words in hungarian text and english text and
#display the unpunctuated string
# To remove from a string, replace new_data with new_str
# new_str = "My name$## is . rahul -~"
for word in all_words:
if word not in punctuations:
new_data.append(word)
print (new_data)
P.S. - Do the identation properly as per required.
Hope this helps!!

Related

How do I find the first element of intersection of a string from a list?

For example, I want to get "," printed given the following string and list because it's the first character of the string that appears in my list of characters.
my_list = [',', '.', ';', ':']
my_string = "Hello world, I am a programmer."
The whole intersection list would be ',' and '.' with ',' being the first and therefore what I want to print
I've tried the following code, but is there a shorter way to do it?
my_list = [',', '.', ';', ':']
my_string = "Hello world, I am a programmer."
my_set = set(my_string).intersection(my_list)
my_list2 = [my_string.find(i) for i in my_set]
my_list2.sort()
num1 = my_list2[0]
print(my_string[num1])
From what I understand you want to find the first character that appears in the string. With the character options being what you specify. If this is the case could you do something like this?
my_chars = [',', '.', ';', ':']
my_string = "Hello world, I am a programmer."
for char in my_string:
if char in my_chars:
print(char)
break
You can use next:
my_set = {',', '.', ';', ':'}
my_string = "Hello world, I am a programmer."
output = next((x for x in my_string if x in my_set), '')
print(output) # ,
If there are no common characters, it returns '' (an empty string).

Join split words and punctuation correctly

So I have this list:
list1 = ['hi', 'there', '!', 'i', 'work', 'for', 'Spencer', '&', 'Co']
I want to join the list together and have some of the punctuation join to the words, but others not to:
I am currently using:
list1 = " ".join()
re.sub(r' (?=\W)', '', list1)
This makes every punctuation join to the previous element.
hi there! i work for Spencer& Co
But
I want:
hi there! i work for Spencer & Co
I personally avoid using regular expressions since pure logical solutions are more easy to understand to me. Here is a short solution you could use for your above example:
list1 = ['hi', 'there', '!', 'i', 'work', 'for', 'Spencer', '&', 'Co']
output = ""
for part in list1:
output += " " + part + " "
output = [1:-1]
The last line removes the starting space character and the ending space character.
You could use a negated character set with your look-ahead and include your special character(s):
>>> re.sub(r' (?=[^\w&])', '', list1) # include &
'hi there! i work for Spencer & Co'

Split a string in python with spaces and punctuations mark , unicode characters , etc.

I want to split string like this:
string = '[[he (∇((comesΦf→chem,'
based on spaces, punctuation marks also unicode characters. I mean, what I expect in output is in following mode:
out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']
I am using
re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)
for this case, but it returned following output:
output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']
Please tell me how can i solve this problem.
Without using regexes and assuming words only contain ascii characters:
from string import ascii_letters
from itertools import groupby
LETTERS = frozenset(ascii_letters)
def is_alpha(char):
return char in LETTERS
def split_string(text):
for key, tokens in groupby(text, key=is_alpha):
if key: # Found letters, join them and yield a word
yield ''.join(tokens)
else: # not letters, just yield the single tokens
yield from tokens
Example result:
In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']
If you are using a python version less than 3.3 you can replace yield from tokens with:
for token in tokens: yield token
If you are on python2 keep in mind that split_string accepts a unicode string.
Note that modifying the is_alpha function you can define different kinds of grouping. For example if you wanted to considered all unicode letters as letters you could do: is_alpha = str.isalpha (or unicode.isalpha in python2):
In [3]: is_alpha = str.isalpha
In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']
Note the 'comesΦf' that before was splitted.
Hope i halp.
In [33]: string = '[[he (∇((comesΦf→chem,'
In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']

How to remove specific symbols from a text using python?

I have a string like this:
string = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
These are the symbols I want to remove from my String.
!, #, #, %, ^, &, *, (, ), _, +, =, `, /
What I have tried is:
listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
exceptionals = set(chr(e) for e in listofsymbols)
string.translate(None,exceptionals)
The error is:
an integer is required
Please help me doing this!
Try this
>>> my_str = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> my_str.translate(None, '!##%^&*()_+=`/')
This is my text of 2013-02-11, it contained characters like this Exceptional
Also, please refrain from naming variables that are already built-in names or part of the standard library.
How about this? I've also renamed string to s to avoid it getting mixed up with the built-in module string.
>>> s = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
>>> print ''.join([i for i in s if i not in listofsymbols])
This is my text of 2013-02-11, it contained characters like this Exceptional
Another proposal, easily expandable to more complex filter criteria or other input data type:
from itertools import ifilter
def isValid(c): return c not in "!##%^&*()_+=`/"
print "".join(ifilter(isValid, my_string))

python regular expression

I am newbie to python. I have an array of words and each word has to be checked to see whether it contains any special characters or digits. If it contains so then i have to skip that word. How should i do this?
Does it have to be a regular expression? If not, you can use the isalpha() string method.
My reading of the problem is that you want to discard any words that contain non-alphabetical characters. Try the following:
>>> array = ['hello', 'hello2', '?hello', '?hello2']
>>> filtered = filter(str.isalpha, array)
>>> print filtered
['hello']
You could also write it as a list comprehension:
>>> filtered = [word for word in array if word.isalpha()]
>>> print filtered
['hello']
If there are only a few characters you want to exclude then use a blacklist, otherwise use a white list.
import string
abadword="""aaaa
bbbbb"""
words=["oneGoodWord", "a,bc",abadword, "xx\n",'123',"gone", "tab tab", "theEnd.","anotherGoodWord"]
bad=list(string.punctuation) #string.punctuation='!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
bad+=['\n','\t','1'] #add some more characters you don't want
bad+=['one'] #this is redundant as in function skip set(word) becomes a set of word's characters. 'one' cannot match a character.
print bad #bad = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\n', '\t', '1', 'one']
bad=set(bad)
def skip(word):
return len(set(word) & bad)==0 #word has no characters in common with bad word
print "good words:"
print filter(skip,words) #prints ['oneGoodWord', 'gone', 'anotherGoodWord']

Categories