Extracting whole words - python

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.
Currently I'm extracting all alphabetical sequences with '[a-z]+'. This is an okay approximation, but it drags a lot of rubbish out with it.
Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and ignores any alphabetical sequences with illegal bounds.
However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21' would NOT extract 'pie', but 'http://foo.com' would extract ['http', 'foo', 'com'].
I tried lookahead and lookbehind assertions, but they were applied per-character (so for example re.findall('(?<!\d)[a-z]+(?!\d)', 'pie21') would return 'pi' when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)) but it didn't help.
More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.

If you restrict yourself to ASCII letters, then use (with the re.I option set)
\b[a-z]+\b
\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.
To also allow other non-ASCII letters, you can use something like this:
\b[^\W\d_]+\b
which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.
[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:
\b([a-zA-Z]+)\b
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.
You can the \b sequence, and others, over at the python manual
EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
(?!\d) # negative look-ahead for numbers
(?<!\d) # negative look-behind for numbers

What about:
import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
Note that:
split explodes your string into potential candidates => returns a list of "potential words"
set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)
EDIT : added some explanations

Sample code
print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')
or
s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)

Related

Returning text when text contains key element using regex

I am writing a function in python using regex that should return text when an element of that text is matched but the outputs I'm getting aren't as expected and I'm not sure what is going wrong.
My function is as below:
def latin_ish_words(text):
latin = re.findall('tion|ex|ph|ost', text, re.I)
return latin
When I pass latin_ish_words("This functions as expected")) it returns the elements 'tion' and 'ex' rather than 'functions' and 'expected'
If someone could tell me where I've gone wrong, I'd be most appreciative!
Many thanks,
Andrew
The function returns matching text - and that's what you saw. If you want to look for specific string within words, your search should state that.
I think \w*(?:tion|ex|ph|ost)\w* should help you find what you're expecting (you may need to enable greedy matching).
Let's look at the modifications:
\w - matches a "word-character" (letters in upper- or lowercase, digits or underscore)
* - previous pattern needs to match between zero and unlimited times
(?: - followed by a match of the rx within (..)
So basically we're just allowing word characters before and after. If you wanted to be more strict and only accept letters, use [A-z]* instead of \w*.

Remove continuous occurrence of vowels together in a string using Python

I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware

Inverse regex match on group in Python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.

How to use regex to match Uppercase words that are non-consecutive duplicates of lowercase words else where in a file

I am a noob to regex and need some help. I have a basic understanding of regex character classes, anchors, and lookarounds but this specific use-case is proving difficult for me.
I am trying to parse an output of a script that runs monthly that outputs a list of users for inventory purposes. I want to use regex to parse the file to match these conditions:
Regex to find lower case words that are repeated as Uppercase words. These Uppercase words are not on the same line, are Non-consecutive occurrences, and can be on new lines or else where in the file.
I need a regex that is able to display the duplicate upper case matches
Another regex that deletes the duplicate upper case matches
Here is an example of the file output I am trying to parse:
"hello","2018-11-19","unitelife"
"world","2018-11-09","unitelife"
"foo","2018-11-16","unitelife"
"bar","2018-10-05","unitelife"
"hello123","2018-09-06","unitelife"
"HELLO123","2018-11-18","unitelife"
"FOO","2018-11-20","unitelife"
"WOWMUCHHAPPY","2018-10-20","unitelife"
"suchjoy","2017-11-28","unitelife"
The desired matches I am looking for are:
HELLO123
FOO
I have tried a number of combinations suggested by the referenced URL discussions below but nothing seems to work for me. Either I am trying incorrect combinations or this feature is not possible. In addition, most topics are discussing consecutive words or letters/characters.
Regex for non-consecutive upper-case words
Regex for non-consecutive upper-case words PART DEUX
PCRE Regex non-consecutive repeating
Regex for non-consecutive uppercase with lowercase
Regex non-consecutive chars
I want to apologize in advanced if this was discussed before or if it has been categorized incorrectly. Please let me know rather than down voting it so that I may edit, close, or reclassify the question to meet the standards accordingly.
Sincerely,
A humbly committed student
You can use the pattern
(?sm)^"([a-z\d]+)"(?i)(?=(?:[^\n]*\n)+?"(?=\1")(?-i)[A-Z\d]+")
https://regex101.com/r/nM3iBH/2
The idea is, start by being case-sensitive (no i), and match the lower-cased string at the beginning of the line inside the quotes. Then, enable the case-insensitive flag, and lookahead for the same string at the beginning of the line inside quotes. Re-enable case sensitivity, and match that same string again inside the lookahead, permitting only upper-case letters (and digits).
Note that this (like any regex algorithm for this problem) has O(N^2) complexity, because with any match, you need to check the entire remaining substring for the (uppercased) match.
Also note the use of "s rather than the \bs you were thinking of - using "s instead is more precise given this sort of input and will result in significantly fewer steps overall.
The pattern is quite strict, for the sake of reducing steps. Separated out:
(?sm)^"([a-z\d]+)" - Initial flags, capture quoted lower-cased word at the beginning of a line
(?i) - Enable case-insensivity so that the future \1 backreference will work properly
(?=(?:[^\n]*\n)+?"(?=\1")(?-i)[A-Z\d]+") Big lookahead for:
(?:[^\n]*\n)+ - Match non-newline characters, followed by a newline character
"(?=\1") Backreference the original matched word inside quotes at the beginning of a line
(?-i) Re-enable case-sensitivity so the capitalization can be checked
[A-Z\d]+" - Match capital letters and digits, followed by a "
In order to replace the capital word line with the empty string, instead of using a big lookahead, match everything between the lower-case word and the upper-case word in a group (so you have two groups, the lower-case word, and everything following it up until the upper-case word), then match the upper-case word, and replace with only the first two groups (thereby replacing out the upper-case word's line):
(?sm)^("[a-z\d]+")(?i)((?:[^\n]*\n)*[^\n]*)\n(?=\1)(?-i)"[A-Z\d]+"[^\n]*
replace with
\1\2
(or the equivalent in your environment)
https://regex101.com/r/nM3iBH/3
Note that if you have overlapping matches, you'll have to do this iteratively until no matches remain.
Here's a solution using just bash, with no regular expressions:
> cat filter.sh
#!/bin/bash
declare -A lower=()
declare -A upper=()
while IFS= read -r line; do
eval "words=( $(tr ',' ' ' <<< "$line") )"
for w in "${words[#]}"; do
[[ "${w^^}" = "$w" ]] && upper["$w"]=1 || lower["$w"]=1
done
done
for u in "${!upper[#]}"; do
exists=${lower["${u,,}"]+foo}
[[ -n "$exists" ]] && echo "$u"
done
There are a couple tricks I'm using here.
First, I'm using associative arrays to weed out repeats. For instance, if "HELLO123" appears multiple times in the file, it will only be counted once.
Secondly, I'm parsing the CSV by using tr to replace the commas with spaces, and then using eval to parse the string into an array, taking advantage of the fact that the individual words are always enclosed in double quotes.
Lastly, I'm using [[ "${w^^}" = "$w" ]] as a test to check if a word contains all uppercase letters. The ${w^^} syntax is a bash-ism that converts a variable to uppercase. I also use ${u,,} in the second loop, which converts $u to lowercase. Note that if you have a word with a mix of uppercase and lowercase letters, it will be counted as a lowercase word. If that doesn't match your expectations, you could change the logic.
The first loop just reads from stdin, splits each line into individual words, and then categorizes each word as either uppercase or lowercase. Again, I'm using associative arrays so that each word (regardless of case) only gets counted once.
The second loop just loops through the keys of the upper associative array ({${!upper[#]}), which are just all the uppercase words encountered in the input. For each word, it checks to see if the matching lowercase word was also encountered. The ${lower["${u,,}"]+foo} syntax is just checking to see if the lowercase word exists in the lower array. The foo part is just an arbitrary string. You could just as well use bar or exists or abc. This is how you check for the existence of a key in an associative array in bash. If the key exists in the array, then the expression will evaluate to "foo", otherwise it will evaluate to an empty string. That's what the subsequent -n test is checking for.
Example:
> cat input.txt
"hello","2018-11-19","unitelife"
"world","2018-11-09","unitelife"
"foo","2018-11-16","unitelife"
"bar","2018-10-05","unitelife"
"hello123","2018-09-06","unitelife"
"HELLO123","2018-11-18","unitelife"
"FOO","2018-11-20","unitelife"
"WOWMUCHHAPPY","2018-10-20","unitelife"
"suchjoy","2017-11-28","unitelife"
> cat input.txt | ./filter.sh
FOO
HELLO123
NOTE: Please don't use eval in production code. It is prone to all sorts of abuse and mishaps due to unexpected things appearing in the input. For example, consider what would happen if you inserted the following line into the input:
"); rm -rf *; foo=("
Then eval would end up evaluating the string "words=(); rm -rf *; foo=()". Definitely not good. I only used eval here as a quick and dirty way to parse the CSV. There are much better (and safer) ways to parse CSV in bash. The point of this solution is the use of associative arrays to keep track of uppercase and lowercase words, while filtering out duplicates.
Edit: Also note that FOO and HELLO123 appear out of order in the output. This is due to the fact that associative arrays don't store the keys in the order in which you created them. So when you do ${!hash[#]}, it's a crapshoot what order the keys will be in. If that's an issue for you, you could keep a separate regular array to preserve the order.
not with regex but using awks toupper() and tolower() functions
$ awk -F, '{lower=tolower($1)} lower==$1 {a[$1];next}
toupper($1)==$1 && lower in a{print $1}' file
"HELLO123"
"FOO"
if the field is lowercase add to the set, if uppercase and the lowercase is in the set print.
here the order is important (lowercase should appear before the uppercase) as in your example. If not, need to convert to a two-pass version. Also easy to remove the quotes if not needed.

The most elegant way to find n words in String with the particular word

There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be
For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.

Categories