Python regex, fetch names from a string - python

I have a string in the form of:
"[NUM : NAME : NUM]: [NUM : NAME : NUM]:..."
I want to be able to extract all the NAMEs out of this string. The NAME can have any character, ranging from alphabet to punctuation symbols and numbers. NUM is only in the form of [0-9]+
I tried issuing this command:
re.findall(r"\[[0-9]+\:([.]+)\:[0-9]+\]", string)
But instead of giving what I requested, it would bunch up a few [NUM : NAME : NUM]s into the [.]+ group, which is also correct in terms of this regex, but not what I need.
Any help would be much appreciated.

Try this:
re.findall(r"\[[0-9]+\:(.+?)\:[0-9]+\]", string)
Adding the ? after the + is non-greedy. Greedy means that the + will take as many characters as possible while still matching and it is greedy by default. By adding the ? you are telling it to be non-greedy, which means the + will take the minimum number of characters to match.
The above will work if there are no spaces between num, :, and name.
If there are spaces then use:
re.findall(r"\[[0-9]+ \: (.+?) \: [0-9]+\]", string)

First problem is that you have enclosed . inside a character class.
So, you have lost the meaning of ., and it only matches just a
dot(.).
Secondly, you are not considering spaces after the numbers in your
string.
Thirdly, you need to use reluctant quantifier with your .+ in the
center. So, replace - ([.]+) with (.+?).
Fourthly, you don't need to escape your colons (:).
You can try out this: -
>>> re.findall(r'\[[0-9]+[ ]*:(.+?):[ ]*[0-9]+\]', string)
6: [' NAME ', ' NAME ']

Related

Replacing specific characters after a string match

I'm looking to replace specific characters in numbers I'm extracting but I cannot figure out how to do so.
Here "," is the float separator and (' or .) are thousands separators. I can match this way :
>>> myString = "I buy 456'123,45 then 45.654 then 123. I'm 30."
>>> re.findall(r"(?:\d+)(?:['|.]\d+)+(?:[,]\d+)?", myString)
["456'123,45", '45.654']
I want to replace in my string all thousands separator to get this :
>>> newString
"I buy 456123,45 then 45654 then 123. I'm 30."
I'm pretty sure I need to use groups and subgroups in order to replace what I want but I don't know how to deal with groups when "()+" is present, the length of the number can also be very long
(e.g : 123'456'789'123'456'789,123)
Thanks
You may use re.sub with
(?<=\d)['.](?=\d)
and replace with an empty string. See the regex demo.
Details
(?<=\d) - (positive lookbehind) a digit must appear immediately to the left of the current location
['.] - a single quote or a dot
(?=\d) - (positive lookahead) a digit must appear immediately to the right of the current location.
Python:
re.sub(r"(?<=\d)['.](?=\d)", "", myString)

Remove spaces from string after and before letter

I have a quite a few sums of strings that look like this: "a name / another name / something else".
I want to get to this: "a name/another name/something else".
Basically removing the spaces before and after the forward slashes only (not between the words themselves).
I know nothing about programming but I looked and found that this can be done with Python and Regex. I was a bit overwhelmed though with the amount of information I found.
You can use the pattern:
(?:(?<=\/) | (?=\/))
(?: Non capturing group.
(?<=\/) Lookbehind for /.
| OR
(?=\/) Positive lookahead for /.
) Close non capturing group.
You can try it live here.
Python snippet:
import re
str = 'a name / another name / something else'
print(re.sub(r'(?:(?<=\/) | (?=\/))','',str))
Prints:
a name/another name/something else
There's no need for regex here, since you're simply replacing a string of literals.
str = "a name / another name / something else"
print(str.replace(" / ", "/"))
Here is an answer without using regex that I feel is easier to understand
string = "a name / another name / something else"
edited = "/".join([a.strip() for a in string.split("/")])
print(edited)
output:
a name/another name/something else
.join() joins elements of a sequence by a given seperator, docs
.strip() removes beginning and trailing whitespace, docs
.split() splits the string into tokens by character, docs
This pattern will match for any amount of whitespace surrounding / and remove it. I think the regex is relatively easy to understand
\s*([\/])\s*
Has a capturing group that matches the backslash (that's what the middle part is). The s* parts match whitespace (at least one up to any amount of times).
You can then replace these matched strings with just a / to get rid of all the whitespace.
str1 being your string:
re.sub(" / ", "/" ,str1)
Use the following code to remove all spaces before and after the / character:
import re
str = 'a name / another name / something else'
str = re.sub(r'(?:(?<=\/)\s*|\s*(?=\/))','', str)
Check this document for more information.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

What does the following re code mean?

I don't understand this code in Python. I know what the function does but I don't get how and what those \\ do in the replacement string.
import re
caps="([A-Z])"
pre="(Mr|mr|Mr|St|st|ST|Mrs|MRS|mrs|Ms|MS|ms|Dr|DR|dr|miss|Miss|MISS)[\.\.\.]"
def tokenize_sentence(text):
text=re.sub(" ?"+pre,"\\1<dot>",text)
text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
print(text)
tokenize_sentence("Mr. Ansh sahajpal A.B.C.")
The output is:
Mr<dot> Ansh sahajpal A<prd>B<prd>C<prd>
If you can explain me how does the \\1, \\2 and \\3 does in the sub functions it will be amazing. Also I changed the input to Mr... Ansh sahajpal and in the line 5 I made the following change:
text=re.sub(" ?"+pre,"\\1<dot>\\2<dot>",text)
What I thought it will do is replace the first . and the second . but it gave me an error.
Please any explanation will do.
The \1, \2, etc refer to matched subexpressions (enclosed in parentheses) in a regular expression. \1 is the first expression matched, \2 is the second, etc. They are used in the replacement string to mark where matched subexpressions are to be modified.
Common convention for matching subexpressions is to use parenthesis. Here is an example:
str = 'an example word:cat!!'
print (re.sub (r'word:(\w+)', r'\0dog', str))
which says match any number of alphanumeric characters following a colon and yields:
an example word:dog!!
In this case, (\w+) is a grouped expression for the set of alphanumeric characters including "_".

Correct usage of regex lookahead in python

I have a string that is supposed to list some dollar amounts and looks like this:
4000.05 . 5.200000000 300.650000 2000 .
It is ultimately supposed to look like this:
4000.05 5200000000 300650000 2000
with all non-decimal periods removed. I am attempting to use this regex to remove all periods that are not followed by two numbers and then a non-numeric character:
re.sub(".(?!([0-9])?!([0-9])?=([0-9]))","",f)
but this ends up emptying the entire string. How can I accomplish this?
First of all, a dot is a meta-character in regex, that matches any character. You need to escape it. Or put in a character class, where meta-characters don't have any special meaning. Of course you need to escape the closing brackets ], which will otherwise be taken as the end of character class.
Secondly your negative look-ahead is flawed.
Try something like this:
re.sub(r'[.](?![0-9]{2}\W)',"",s)
You need something like this.
string = '4000.05 . 5.200000000 300.650000 2000 .'
print re.sub(r'[.](?![0-9]{2}\D)', '', string)
The regular expression:
[.] any character of: '.'
(?! look ahead to see if there is not:
[0-9]{2} any character of: '0' to '9' (2 times)
\D match non-digits (all but 0-9)
) end of look-ahead

Categories