Python Regex: Apostrophes only when placed within letters, not as quotation marks - python

define each word to be the longest contiguous sequence of alphabetic characters (or just letters), including up to one apostrophe if that apostrophe is sandwiched between two letters.
[a-z]+[a-z/'?a-z]*[a-z$]
It doesn't match the letter 'a'.

Something like this should work
[a-zA-Z]*(?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z])[a-zA-Z]*
Match 0 or more letters [a-zA-Z]*? followed by either an apostrophe surrounded by 2 letters or a single letter (?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z]) then match 0 or more letters [a-zA-Z]*
For just lowercase letters
[a-z]*(?:[a-z]\'[a-z]|[a-z])[a-z]*

I'd use:
^(?:[a-z]+|[a-z]+'[a-z]+)$
with re.IGNORECASE
Demo & explanation

You seem to misunderstand the character class notation. The stuff between [ and ] is a list of characters to match. It does not make sense to list the same character multiple times, and basically all characters except ] and - (and initial ^ for negation) simply match themselves, i.e. lose their regex special meaning.
Lets's rephrase your requirement. You want an alphabetic [a-z] repeated one or more times +, optionally followed by an apostrophe and another sequence of alphabetics.
[a-z]+('[a-z]+)?
In some regex dialects, you might prefer the non-capturing opening parenthesis (?: instead of plain (.

Related

Match Pattern based on multiple special characters

Regex to match more than one special characters after a string
I am trying to come up with regex to match in the order of importance as below
String plus 2 or more special characters followed by some word
String plus 1 special character followed by some word
String (and no special characters) followed by some word
I am able to match all patterns with below regex
re.compile(r'keyword\W*\s*(\S*)', re.IGNORECASE|re.MULTILINE|re.UNICODE)
but it does not differentiate between different scenarios after keyword.
for example:
considering keyword is the string above
If I have string 'keyword-+blah' I should be able to match with 1 only
if I have string 'keyword-blah' I should be able to match with 2 only
if I have String 'keywordblah' I should be able to match with 3 only
You could use a character class to specify which chars you consider to be special. Then use a quantifier {0,2} to match a repetition of 0, 1 or 2 times.
The following \w+ matches 1+ times a word character.
Note that \S matches a non whitespace char so that would also match - or +
keyword[+-]{0,2}\w+
Regex demo

Regex to get non-alphanumeric strings between alphanumeric strings

Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']

Split string at capital letter but only if no whitespace

Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results
This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Python, Regular Expression: How to remove letter.letter(a.b) from string?

How can I remove combination of letter-dot-letter (example F.B) from string in python ? I tried using regex:
abre = re.sub(r"\b\w+\.\w+#",'',abre)
but it does not remove these sequences it just prints me the same unchanged string. I also tried removing all dots and then remove words smaller than 2 letters, but in that case I loose real words.
What I have: C.P.A. Certification Program, Accounting
What I want to get: Certification Program, Accounting
The length of the sequence is not always known and the letters are also unknown.
You seem to want to remove words that consist of dot-separated uppercase letters.
Use
abre = re.sub(r"\b(?:[A-Z]\.)+(?!\w)",'',abre)
See the regex demo. To also remove a trailing whitespace, you may add \s* at the end. If there must be at least two letters, replace + with {2,}.
Details:
\b - leading word boundary
(?:[A-Z]\.)+ - one or more sequences of
[A-Z] - an uppercase ASCII letter
\. -a dot
(?!\w) - not followed with a word char
you can use replace :
>>> string="rgoa.bwtg.rgqra.bergeg"
>>> string.replace("a.b", "")
'rgowtg.rgqrergeg'

Regex to match a string with 2 capital letters only

I want to write a regex which will match a string only if the string consists of two capital letters.
I tried - [A-Z]{2}, [A-Z]{2, 2} and [A-Z][A-Z] but these only match the string 'CAS' while I am looking to match only if the string is two capital letters like 'CA'.
You could use anchors:
^[A-Z]{2}$
^ matches the beginning of the string, while $ matches its end.
Note in your attempts, you used [A-Z]{2, 2} which should actually be [A-Z]{2,2} (without space) to mean the same thing as the others.
You need to add word boundaries,
\b[A-Z]{2}\b
DEMO
Explanation:
\b Matches between a word character and a non-word character.
[A-Z]{2} Matches exactly two capital letters.
\b Matches between a word character and a non-word character.
You could try:
\b[A-Z]{2}\b
\b matches a word boundary.
Try =
^[A-Z][A-Z]$
Just added start and end points for the string.

Categories