I have a regex that picks up address from data.
The regex looks for 'Address | address | etc.' and picks up the characters following that until occurrence of 4 integers together.
This includes special characters, which need to be stripped.
I run a loop to exclude all the special characters as seen in the code below. I need the code to drop only the special characters that are present in front of the first alphanumeric character.
Input (from OCR on an image):
Service address β_Unit8-10 LEWIS St, BERRI,SA 5343
possible_addresses = list(re.findall('Address(.* \d{4}|)|address(.*\d{4})|address(.*)', data))
address = str(possible_addresses[0])
for k in address.split("\n"):
if k != ['(A-Za-Z0-9)']:
address_2 = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
Got now:
address : β_Unit 8 - 10 LEWIS ST, BERRI SA 5343
address_2 : Unit 8 10 LEWIS ST BERRI SA 5343
[\W_]* captures the special chars.
import re
data='Service address β_Unit8-10 LEWIS St, BERRI,SA 5343'
possible_addresses = re.search('address[\W_]*(.*?\d{4})', data,re.I)
address = possible_addresses[1]
print('Address : ' address)
Address : Unit8-10 LEWIS St, BERRI,SA 5343
I'm guessing that the expression we wish to design here should be swiping everything from address to a four-digit zip, excluding some defined chars such as _. Let's then start with a simple expression with an i flag, such as:
(address:)[^_]*|[^_]*\d{4}
Demo
Any char that we do not want to have would go in here [^_]. For instance, if we are excluding !, our expression would become:
(address:)[^_!]*|[^_!]*\d{4}
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(address:)[^_]*|[^_]*\d{4}"
test_str = ("Address: Unit 8 - 10 LEWIS ST, BERRI SA 5343 \n"
"Address: Got now: __Unit 8 - 10 LEWIS ST, BERRI SA 5343\n"
"aDDress: Unit 8 10 LEWIS ST BERRI SA 5343\n"
"address: any special chars here !##$%^&*( 5343\n"
"address: any special chars here !##$%^&*( 5343")
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Related
I need to parse a line of text and separate in parts and add it to a list, thing that i was able to do with the help of re.parse('regexp'). The thing is that i get some text that i dont want that match on this, but i need to know where is it, and how to detect it and of course what is it, to show an error.
the code matches and filters out all perfectly, the thing is i need to filter out the 12 and the 32 that are not matching the regexp
import re
str = '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 3 c -3 ltr'
a=re.compile(r'[+-]?[0-9]*\.[0-9]+\s*[a-z]+|[+-]?[0-9]*\s*[a-z]+')
r=a.findall(str)
print (r)
Initial String:
str= '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 1 3 c -3 ltr'
list parsed, correctly
['455c', '2v', '12tv', ' v', '0.5b', '-3b', '-b', '+b', '-3li', ' b', '-0.5b', '3 c', '-3 ltr']
list that i need as well and any other string not matched ie: (/%&$%)
[12, 32, 1]
My guess is that if we might not want to collect the digits only, then we would be starting with a simple expression:
\b([\d]{1,}\s)\b|([\w+-.]+)
with two parts:
\b([\d]{1,}\s)\b
are our undesired digits, and
([\w+-.]+)
has our desired outputs.
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\b([\d]{1,}\s)\b|([\w+-.]+)"
test_str = "12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 3 c -3 ltr"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
I've solved this by myself by replacing the correctly parsed on the initial string, so i get the difference then split to get it as a list
str = '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 1 3 c -3 ltr'
a=re.compile(r'[+-]?[0-9]*\.[0-9]+\s*[a-z]+|[+-]?[0-9]*\s*[a-z]+')
r=a.findall(str)
print (r)
errors = str
for t in r:
errors = errors.replace(t, '', 1)
errors = errors.split()
print(errors)
lrgstPlace = features[0]
strLrgstPlace = str(lrgstPlace)
longtide = re.match("r(lat=)([\-\d\.]*)",strLrgstPlace)
print (longtide)
This is how my features list looks like
Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3)
Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)
Why does the regex cant match anything?Its just gives me "None" as a result.
I think you meant to put the r outside the quotes:
r"(lat=)([\-\d\.]*)"
Your original expression works fine, we might want to slightly modify it, if we wish to just extract the lat numbers:
(?:lat=)([0-9\.\-]+)(?:,)
where ([0-9\.\-]+) would capture our desired lat, and we wrap it with two non-capturing groups:
(?:lat=)
(?:,)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:lat=)([0-9\.\-]+)(?:,)"
test_str = "Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3) Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.
The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)
my text is
my_text = """ ["supra","value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="}};</script> """
i want to extract the value which is
ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==
i've tried this
extract_posted_data = re.search(r'(\"value\": \")(\w*)', my_text)
print (extract_posted_data.group(2))
and this is what i received
ddad7f1eada3c52c66cmh6ZG8tf
it isnt extracting the complete value
Thanks
- is not included in \w (and also = is not included)
You'll need to use: [\w=-]* instead of \w*
The regex that you're looking for is r"\"value\":\"(\S+)\"" and the required string for you is available in group(1) of the match
Here's a live link to the regex with your test string to test. Regex101 also has code generators that you could use to generate the required python code and test.
https://regex101.com/r/p2N524/1
import re
regex = r"\"value\":\"(\S+)\""
test_str = "[\"supra\",\"value\":\"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==\"}};</script> "
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Result
Match 1 was found at 9-123: "value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="
Group 1 found at 18-122: ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==
Given a name string, I want to validate a few basic conditions:
-The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis.
-The string doesn't contain digits and is of length < 40
I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?
You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])
Live Demo
You may want to add more like \p{Sc} to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! πΊπ» \nWow cool song! πΊπ»Wow cool song! πΊπ»Wow cool song! πΊπ»\n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.