python regular expression - python

I am newbie to python. I have an array of words and each word has to be checked to see whether it contains any special characters or digits. If it contains so then i have to skip that word. How should i do this?

Does it have to be a regular expression? If not, you can use the isalpha() string method.

My reading of the problem is that you want to discard any words that contain non-alphabetical characters. Try the following:
>>> array = ['hello', 'hello2', '?hello', '?hello2']
>>> filtered = filter(str.isalpha, array)
>>> print filtered
['hello']
You could also write it as a list comprehension:
>>> filtered = [word for word in array if word.isalpha()]
>>> print filtered
['hello']

If there are only a few characters you want to exclude then use a blacklist, otherwise use a white list.
import string
abadword="""aaaa
bbbbb"""
words=["oneGoodWord", "a,bc",abadword, "xx\n",'123',"gone", "tab tab", "theEnd.","anotherGoodWord"]
bad=list(string.punctuation) #string.punctuation='!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
bad+=['\n','\t','1'] #add some more characters you don't want
bad+=['one'] #this is redundant as in function skip set(word) becomes a set of word's characters. 'one' cannot match a character.
print bad #bad = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\n', '\t', '1', 'one']
bad=set(bad)
def skip(word):
return len(set(word) & bad)==0 #word has no characters in common with bad word
print "good words:"
print filter(skip,words) #prints ['oneGoodWord', 'gone', 'anotherGoodWord']

Related

Partitioning a string with multiple delimiters

I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']

Regex - get all digits and letters (with non roman chars) [duplicate]

I need to use regex to strip punctuation at the start and end of a word. It seems like regex would be the best option for this. I don't want punctuation removed from words like 'you're', which is why I'm not using .replace().
You don't need regular expression to do this task. Use str.strip with string.punctuation:
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'
>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"
I think this function will be helpful and concise in removing punctuation:
import re
def remove_punct(text):
new_words = []
for word in text:
w = re.sub(r'[^\w\s]','',word) #remove everything except words and space
w = re.sub(r'_','',w) #how to remove underscore as well
new_words.append(w)
return new_words
If you persist in using Regex, I recommend this solution:
import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me
Note also that you can pass your own punctuation marks:
my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
'/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_',
'`', '{', '|', '}', '~', '»', '«', '“', '”']
punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19
You can remove punctuation from a text file or a particular string file using regular expression as follows -
new_data=[]
with open('/home/rahul/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
# You can add and remove punctuations as per your choice
#removing stop words in hungarian text and english text and
#display the unpunctuated string
# To remove from a string, replace new_data with new_str
# new_str = "My name$## is . rahul -~"
for word in all_words:
if word not in punctuations:
new_data.append(word)
print (new_data)
P.S. - Do the identation properly as per required.
Hope this helps!!

How to query python list with multiple characters in each one

I know the title is a little messy and if someone wants to fix it, they are more than welcome to.
Anyways, I'm having trouble querying a python list with multiple values, I have looked on other Stackoverflow questions and none of seem to match what I'm looking for.
So, this is the code I have so far, its supposed to use a for loop statement, so that it goes through each character and then uses and if in statements to check whether a character in the user input matches anything in the list.
In my example, it only uses symbols, but hopefully that shouldn't be much of a problem
Anyways here is the code
string = input("What symbol character would you like to check")
symbols=[' ', '!', '#', '$', '%', '&', '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~',"'"]
def symbol():
for string in symbols:
if string in symbols:
return True
elif string not in symbols:
return False
if symbol():
print('ok')
if not symbol():
print('What Happened?')
*Update, I also need solution to be able to accept letters and numbers as well as the symbols.
For example, if user enters !a, that it will still detect the '!' and evaluate to True.
If you loop over the input string, you should be able to get the solution you're looking for. How about something like this?
input_string = raw_input("What symbol character would you like to check? ")
symbols = [' ', '!', '#', '$', '%', '&', '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~',"'"]
def symbol(input_str):
for char in input_str:
if char in symbols:
return True
return False
if symbol(input_string):
print('ok')
else:
print('What Happened?')
raw_input() avoids some trouble. Before I changed that, I was getting an unexpected EOF while parsing error. I changed the names of your variables to help a bit and avoid potential conflicts.
By moving the return False line outside the for loop, it lets the loop check every character in the input string first. If it checks every one, and nothing matches, then it will default to returning False.
Also, you have two calls to symbol() in your question, which I don't think is necessary. One if can check for a True return value. Lacking that, we move to the else statement and can safely know that the function returned False.

Split a string in python with spaces and punctuations mark , unicode characters , etc.

I want to split string like this:
string = '[[he (∇((comesΦf→chem,'
based on spaces, punctuation marks also unicode characters. I mean, what I expect in output is in following mode:
out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']
I am using
re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)
for this case, but it returned following output:
output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']
Please tell me how can i solve this problem.
Without using regexes and assuming words only contain ascii characters:
from string import ascii_letters
from itertools import groupby
LETTERS = frozenset(ascii_letters)
def is_alpha(char):
return char in LETTERS
def split_string(text):
for key, tokens in groupby(text, key=is_alpha):
if key: # Found letters, join them and yield a word
yield ''.join(tokens)
else: # not letters, just yield the single tokens
yield from tokens
Example result:
In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']
If you are using a python version less than 3.3 you can replace yield from tokens with:
for token in tokens: yield token
If you are on python2 keep in mind that split_string accepts a unicode string.
Note that modifying the is_alpha function you can define different kinds of grouping. For example if you wanted to considered all unicode letters as letters you could do: is_alpha = str.isalpha (or unicode.isalpha in python2):
In [3]: is_alpha = str.isalpha
In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']
Note the 'comesΦf' that before was splitted.
Hope i halp.
In [33]: string = '[[he (∇((comesΦf→chem,'
In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']

How to remove specific symbols from a text using python?

I have a string like this:
string = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
These are the symbols I want to remove from my String.
!, #, #, %, ^, &, *, (, ), _, +, =, `, /
What I have tried is:
listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
exceptionals = set(chr(e) for e in listofsymbols)
string.translate(None,exceptionals)
The error is:
an integer is required
Please help me doing this!
Try this
>>> my_str = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> my_str.translate(None, '!##%^&*()_+=`/')
This is my text of 2013-02-11, it contained characters like this Exceptional
Also, please refrain from naming variables that are already built-in names or part of the standard library.
How about this? I've also renamed string to s to avoid it getting mixed up with the built-in module string.
>>> s = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
>>> print ''.join([i for i in s if i not in listofsymbols])
This is my text of 2013-02-11, it contained characters like this Exceptional
Another proposal, easily expandable to more complex filter criteria or other input data type:
from itertools import ifilter
def isValid(c): return c not in "!##%^&*()_+=`/"
print "".join(ifilter(isValid, my_string))

Categories