Python, Regular Expression: How to remove letter.letter(a.b) from string? - python

How can I remove combination of letter-dot-letter (example F.B) from string in python ? I tried using regex:
abre = re.sub(r"\b\w+\.\w+#",'',abre)
but it does not remove these sequences it just prints me the same unchanged string. I also tried removing all dots and then remove words smaller than 2 letters, but in that case I loose real words.
What I have: C.P.A. Certification Program, Accounting
What I want to get: Certification Program, Accounting
The length of the sequence is not always known and the letters are also unknown.

You seem to want to remove words that consist of dot-separated uppercase letters.
Use
abre = re.sub(r"\b(?:[A-Z]\.)+(?!\w)",'',abre)
See the regex demo. To also remove a trailing whitespace, you may add \s* at the end. If there must be at least two letters, replace + with {2,}.
Details:
\b - leading word boundary
(?:[A-Z]\.)+ - one or more sequences of
[A-Z] - an uppercase ASCII letter
\. -a dot
(?!\w) - not followed with a word char

you can use replace :
>>> string="rgoa.bwtg.rgqra.bergeg"
>>> string.replace("a.b", "")
'rgowtg.rgqrergeg'

Related

Python Regex: Apostrophes only when placed within letters, not as quotation marks

define each word to be the longest contiguous sequence of alphabetic characters (or just letters), including up to one apostrophe if that apostrophe is sandwiched between two letters.
[a-z]+[a-z/'?a-z]*[a-z$]
It doesn't match the letter 'a'.
Something like this should work
[a-zA-Z]*(?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z])[a-zA-Z]*
Match 0 or more letters [a-zA-Z]*? followed by either an apostrophe surrounded by 2 letters or a single letter (?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z]) then match 0 or more letters [a-zA-Z]*
For just lowercase letters
[a-z]*(?:[a-z]\'[a-z]|[a-z])[a-z]*
I'd use:
^(?:[a-z]+|[a-z]+'[a-z]+)$
with re.IGNORECASE
Demo & explanation
You seem to misunderstand the character class notation. The stuff between [ and ] is a list of characters to match. It does not make sense to list the same character multiple times, and basically all characters except ] and - (and initial ^ for negation) simply match themselves, i.e. lose their regex special meaning.
Lets's rephrase your requirement. You want an alphabetic [a-z] repeated one or more times +, optionally followed by an apostrophe and another sequence of alphabetics.
[a-z]+('[a-z]+)?
In some regex dialects, you might prefer the non-capturing opening parenthesis (?: instead of plain (.

Regex to find and list any three characters enclosed dashes and last match in the string

My regex finds the three letters enclosed dashes but only returns the first second one in the string
(?:-)([A-Z]{3})+?(?:-)
I am trying to figure out a regex that finds all three letters enclosed in dashes only thus ignoring the first one ABC
ABC-FOUR-ONE-FIVE-TWO
Can there be a regex that lists only ONE and TWO (matches all except the first one
You may use
re.findall(r'-([A-Z]{3})(?![^-])', text)
Or, its equivalent
re.findall(r'-([A-Z]{3})(?=-|$)', text)
See the regex demo and Python demo
Pattern details
- - a hyphen
([A-Z]{3}) - Capturing group 1: three uppercase letters
(?=-|$) / (?![^-]) - match (but do not consume) a - or end of string position.
Try something like this (-[A-Za-z]{3}(-|$)) (tested it at https://regex101.com/)
This regex says: Match a dash, then 3 [A-Za-z] characters and then finally the "-" character or "end of string"

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

Split string at capital letter but only if no whitespace

Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results
This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Categories