Simple Negative Lookahead [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I get this:
import re;
print re.findall(r"q(?=u)", "qqq queen quick qeel")
> ['q', 'q'] # for queen and quick
But I don't get this:
import re;
print re.findall(r"q(?!=u)", "qqq queen quick qeel")
> ['q', 'q', 'q', 'q', 'q', 'q'] # every q matches
I expected only 4 qs to match because the negative lookahead should see that in the word qeel for example, the letter after the q is not u.
What gives?

It is
import re
print(re.findall(r"q(?!u)", "qqq queen quick qeel"))
# ---^---
# ['q', 'q', 'q', 'q']
Without the =, that is. Otherwise, you don't want to have =u in front which is true for all of your qs here. In general, a positive lookahead is formed via (?=...) whereas a negative one is just (?!...).
Sidenote: the ; is not needed at the end of the line unless you want to write everything in one line which is not considered "Pythonic" but totally valid:
import re; print(re.findall(r"q(?!u)", "qqq queen quick qeel"))

Related

How to edit individual character formats in string (beyond just upper())?

I am trying to create a sort of version of Wordle in python (just for practice).
I am having difficulty communicating to the player which letters in their guess match (or closely match) the letters in the target word.
I can highlight matches (i.e. where the letter is in the right place) using uppercase, but I don't know how to differentiate between letters which have a match somewhere in the target word and letters which do not appear at all. The relevant code is below:
def compare_words(word,guess):
W = list(word)# need to turn the strings into list to try to compare each part
G = list(guess)
print(W) # printing just to track the two words
print(G)
result =[ ] # define an empty list for our results
for i in range(len(word)):
if guess[i] == word[i]:
result.append(guess[i].upper())
elif guess[i] in word:
result.append(guess[i])
else:
result.append(" ")
print (result)
return result
# note, previous functions ensure the length of the "word" and "guess" are the same and are single words without digits
x = compare_words("slide","slips")
['s', 'l', 'i', 'd', 'e']
['s', 'l', 'i', 'p', 's']
['S', 'L', 'I', ' ', 's']
As you can see, the direct matches are upper, the other matches are unchanged and the "misses" are left out. This is not what I want, are usually the whole guess is spat back out with font change or colours to indicate the matches.
I have looked into bolding and colours but it all at the point of printing. I need something built into the list itself, but I am unsure if I can do this. Any ideas?
Cheers

Match all [A-Z] but not duplicates [duplicate]

This question already has answers here:
regex to match a word with unique (non-repeating) characters
(3 answers)
Closed 4 years ago.
I need to match all upper case letters in a string, but not duplicates of the same letter in python I've been using
from re import compile
regex = compile('[A-Z]')
variables = regex.findall('(B or P) and (P or not Q)')
but that will match ['B', 'P', 'P', 'Q'] but I need ['B', 'P', 'Q'].
Thanks in advance!
You can use negative lookahead with a backreference to avoid matching duplicates:
re.findall(r'([A-Z])(?!.*\1.*$)', '(B or P) and (P or not Q)')
This returns:
['B', 'P', 'Q']
And if order matters do:
print(sorted(set(variables),key=variables.index))
Or if you have the more_itertools package:
from more_itertools import unique_everseen as u
print(u(variables))
Or if version >= 3.6:
print(list({}.fromkeys(variables)))
Or OrderedDict:
from collections import OrderedDict
print(list(OrderedDict.fromkeys(variables)))
All reproduce:
['B', 'P', 'Q']

Filter words from a file based on their number of syllables

I need to identify complex words from a .txt file.
I am trying to use nltk but no such module exist.
Complex words are words in the text that contains more than two syllables.
I would use Pyphen. This module has a Pyphen class used for hyphenation. One of its methods, positions(), returns the number of places in a word where it can be split:
>>> from pyphen import Pyphen
>>> p = Pyphen(lang='en_US')
>>> p.positions('exclamation')
[2, 5, 7]
If the word "exclamation" can be split in three places, it has four syllables, so you just need to filter all words with more than one split place.
. . .
But I noted you tagged it as an [t:nltk] question. I'm not experienced with NLTK myself but the question suggested by #Jules has a nice suggestion in this aspect: to use the cmudict module. It gives you a list of pronunciations of a word in American English:
>>> from nltk.corpus import cmudict
>>> d = cmudict.dict()
>>> pronounciations = d['exasperation']
>>> pronounciations
[['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']]
Luckily, our fist word has only one pronounciation. It is represented as a list of strings, each one representing a phoneme:
>>> phonemes = pronounciations[0]
>>> phonemes
['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']
Note that vowel phonemes have a number at the end, indicating stress:
Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L
So, we just need to count the number of phonemes with digits at the end:
>>> vowels = [ph for ph in phonemes if ph[-1].isdigit()]
>>> vowels
['EH2', 'AE2', 'ER0', 'EY1', 'AH0']
>>> len(vowels)
5
. . .
Not sure which is the best option but I guess you can work your problem out from here.

Is there a better way to check for vowels in the first position of a word?

I'm trying to check for a vowel as the first character of a word. For my code I currently have this:
if first == 'a' or first == 'e' or first == 'i' or first == 'o' or first == 'u':
I was wondering is there a much better way to do this check or is this the best and most efficient way?
You can try like this using the in:
if first.lower() in 'aeiou':
or better like
if first.lower() in ('a', 'e', 'i', 'o', 'u'):
Better create a set of vowels, like this
>>> vowels = set('aeiouAEIOU')
>>> vowels
set(['a', 'A', 'e', 'i', 'o', 'I', 'u', 'O', 'E', 'U'])
and then check if first is one of them like this
>>> if first in vowels:
...
Note: The problem with
if first in 'aeiouAEIOU':
approach is, if your input is wrong, for example, if first is 'ae', then the test will fail.
>>> first = 'ae'
>>> first in 'aeiouAEIOU'
True
But ae is clearly not a vowel.
Improvement:
If it is just a one-time job, where you don't care to create a set beforehand, then you can use if first in 'aeiouAEIOU': itself, but check the length of first first, like this
>>> first = 'ae'
>>> len(first) == 1 and first in 'aeiouAEIOU'
False
Here is the regex approach:
from re import match
if match(r'^[aieou]', first):
...
This regular expression will match if the first character of "first" is a vowel.
If your function is returning boolean value then easiest and simplest way will be
`bool(first.lower() in 'aeiou')`
Or
return first.lower() in 'aeiou'

How to find double occurrence of a letter in a word [duplicate]

This question already has answers here:
RegExp match repeated characters
(6 answers)
Closed 8 years ago.
I have string :-
s = 'bubble'
how to use regular expression to get a list like:
['b', 'u', 'bb', 'l', 'e']
I want to filter single as well as double occurrence of a letter.
This should do it:
import re
[m.group(0) for m in re.finditer('(.)\\1*',s)]
For 'bubbles' this returns:
['b', 'u', 'bb', 'l', 'e', 's']
For 'bubblesssss' this returns:
['b', 'u', 'bb', 'l', 'e', 'sssss']
You really have two questions. The first question is how to split the list, the second is how to filter.
The splitting takes advantage of back references in a pattern. In this case we'll construct a pattern the will find one or two occurrences of a letter then construct a list from the search results. The \1 in the code block refers to the first parenthesized expression.
import re
pattern = re.compile(r'(.)\1?')
s = "bubble"
result = [x.group() for x in pattern.finditer(s)]
print(result)
To filter the list stored in result you could use a list comprehension that filters on length.
filtered_result = [x for x in result if len(x) == 2]
print(filtered_result)
You could just get the set of duplications directly by tweaking the regular expression.
pattern2 = re.compile(r'(.)\1')
result2 = [x.group() for x in pattern2.finditer(s)]
print(result2)
The output from running the above is:
['b', 'u', 'bb', 'l', 'e']
['bb']
['bb']

Categories