Python re adding space when splitting - python

I am using the following regex to split a phrase passed in as a string, into a list of words.
Because there might be other letters, I'm using the UTF flag. This works great in most cases:
phrase = 'hey look out'
word_list = re.split(r'[\W_]+', unicode(phrase, 'utf-8').lower(), flags=re.U)
word_list [u'hey', u'look', u'out']
But, if the phrase is a sentence that ends with a period like this, it will create a blank value in the list:
phrase = 'hey, my spacebar_is broken.'
word_list [u'hey', u'my', u'spacebar', u'is', u'broken', u'']
My work around is to use
re.split(r'[\W_]+', unicode(phrase.strip('.'), 'utf-8').lower(), flags=re.U)
but I wanted to know is there was a way to solve it within the regex expression?

\W selects non-word characters. Since . is a non-word character, the string is split on it. Since there is nothing after the period, you get an empty string. If you want to avoid this, you'll need to either strip the separator characters of the ends of the string
phrase = re.sub(r'^[\W_]+|[\W_]+$', '', phrase)
or filter the resulting array to remove empty strings.
word_list = [word for word in word_list if word]
Alternatively, you can get the words by matching them directly rather than splitting:
words = re.findall(r'[^\W_]+', phrase)

Related

Search through a list of strings for a word that has a variable character

Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"

Match a word ending with pattern in a sentence

How to find word(S) in a sentence that end with a pattern using regex
I have list of patterns I want to match within a sentence
For example
my_list = ['one', 'this']
sentence = 'Someone dothis onesome thisis'
Result should return only words that end with items from my_list
['Someone','dothis'] only
since I do not want to match onesome or thisis
You can end your pattern with the word boundary metacharacter \b. It will match anything that is not a word character, including the end of the string. So, in that specific case, the pattern would be (one|this)\b.
To actually create a regex from your my_list variable, assuming that no reserved characters are present, you can do:
import re
def words_end_with(sentence, my_list):
return re.findall(r"({})\b".format("|".join(my_list)), sentence)
If you're using Python 3.6+, you can also use an f-string, to do this formatting inside the string itself:
import re
def words_end_with(sentence, my_list):
return re.findall(fr"({'|'.join(my_list)})\b", sentence)
See https://www.regular-expressions.info/wordboundaries.html
You can use the following pattern:
\b(\w+(one|this))\b
It says match whole words within word boundaries (\b...\b), and within whole words match any word character (\w+) followed by the literal one or this ((one|this))
https://regex101.com/r/UzhnSw/1/

How can I delete specific words or string from a python string without trimming it from other words in Python?

I would like to know how to remove specific words from a string in python without deleting them from other words they composed.
For example if I want to remove 'is' from the following sentence:
s = 'isabelle is in Paris'
The .replace() function delete 'is' in 'isabelle' and in 'Paris':
s = 'isabelle is in Paris'
s.replace('is', '')
It gives me abelle in Par But I want isabelle in Paris. Is there a way to delete only 'is'?
I tried: s.replace(' is ', '') with a space each side of 'is' but in this case 'is' is not removed in the string s = 'Isabelle is, as you know, in Paris'
Thank you
Use a regular expression instead of replacing an ordinary string. You can then use \b in the regexp to match word boundaries.
import re
s = re.sub(r'\bis\b', '', s)

Removing any pattern (word or regex) defined in a list from a string

I have a list
forbidden_patterns=['Word1','Word2','Word3','\d{4}']
and a string :
string1="This is Word1 a list thatWord2 I'd like to 2016 be readableWord3"
What is the way to have string1 to have all the patterns and words defined in forbidden_patterns removed so it ends with :
clean_string="This is a list that I'd like to be readable"
The \d{4} is to remove the year pattern which in this case is 2016
List comprehension are very welcome
Here you are:
import re
forbidden_patterns = ['Word1', 'Word2', 'Word3', '\d{4}']
string = "This is Word1 a list thatWord2 I'd like to 2016 be readableWord3"
for pattern in forbidden_patterns:
string = ''.join(re.split(pattern, string))
print(string)
Essentially, this code goes through each of the patterns in forbidden_patterns, splits string using that particular pattern as a delimiter (which removes the delimiter, in this case the pattern, from the string), and joins it back together into a string for the next pattern.
EDIT
To get rid of the extra spaces, put the following line as the first line in the for-loop:
string = ''.join(re.split(r'\b{} '.format(pattern), string))
This line checks to see if the pattern is a whole word, and if so, removes that word and one of the spaces. Make sure that this line goes above string = ''.join(re.split(pattern, string)), which is "less specific" than this line.
import re
new_string = string1
for word in forbidden_words:
new_string = re.sub(word, '', new_string)
Your new_string would be the one you want. Though, it's a bit long and removing some words leaving you with 2 spaces as This is a list that I'd like to be readable

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Categories