Strip special characters in front of the first alphanumeric character

Strip special characters in front of the first alphanumeric character - python

I have a regex that picks up address from data.
The regex looks for 'Address | address | etc.' and picks up the characters following that until occurrence of 4 integers together.
This includes special characters, which need to be stripped.
I run a loop to exclude all the special characters as seen in the code below. I need the code to drop only the special characters that are present in front of the first alphanumeric character.
Input (from OCR on an image):
Service address —_Unit8-10 LEWIS St, BERRI,SA 5343
possible_addresses = list(re.findall('Address(.* \d{4}|)|address(.*\d{4})|address(.*)', data))
address = str(possible_addresses[0])
for k in address.split("\n"):
if k != ['(A-Za-Z0-9)']:
address_2 = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
Got now:
address : —_Unit 8 - 10 LEWIS ST, BERRI SA 5343
address_2 : Unit 8 10 LEWIS ST BERRI SA 5343

[\W_]* captures the special chars.
import re
data='Service address —_Unit8-10 LEWIS St, BERRI,SA 5343'
possible_addresses = re.search('address[\W_]*(.*?\d{4})', data,re.I)
address = possible_addresses[1]
print('Address : ' address)
Address : Unit8-10 LEWIS St, BERRI,SA 5343

I'm guessing that the expression we wish to design here should be swiping everything from address to a four-digit zip, excluding some defined chars such as _. Let's then start with a simple expression with an i flag, such as:
(address:)[^_]*|[^_]*\d{4}
Demo
Any char that we do not want to have would go in here [^_]. For instance, if we are excluding !, our expression would become:
(address:)[^_!]*|[^_!]*\d{4}
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(address:)[^_]*|[^_]*\d{4}"
test_str = ("Address: Unit 8 - 10 LEWIS ST, BERRI SA 5343 \n"
"Address: Got now: __Unit 8 - 10 LEWIS ST, BERRI SA 5343\n"
"aDDress: Unit 8 10 LEWIS ST BERRI SA 5343\n"
"address: any special chars here !##$%^&*( 5343\n"
"address: any special chars here !##$%^&*( 5343")
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Related

Python regexp obtain not matched content

I need to parse a line of text and separate in parts and add it to a list, thing that i was able to do with the help of re.parse('regexp'). The thing is that i get some text that i dont want that match on this, but i need to know where is it, and how to detect it and of course what is it, to show an error.
the code matches and filters out all perfectly, the thing is i need to filter out the 12 and the 32 that are not matching the regexp
import re
str = '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 3 c -3 ltr'
a=re.compile(r'[+-]?[0-9]*\.[0-9]+\s*[a-z]+|[+-]?[0-9]*\s*[a-z]+')
r=a.findall(str)
print (r)
Initial String:
str= '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 1 3 c -3 ltr'
list parsed, correctly
['455c', '2v', '12tv', ' v', '0.5b', '-3b', '-b', '+b', '-3li', ' b', '-0.5b', '3 c', '-3 ltr']
list that i need as well and any other string not matched ie: (/%&$%)
[12, 32, 1]

My guess is that if we might not want to collect the digits only, then we would be starting with a simple expression:
\b([\d]{1,}\s)\b|([\w+-.]+)
with two parts:
\b([\d]{1,}\s)\b
are our undesired digits, and
([\w+-.]+)
has our desired outputs.
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\b([\d]{1,}\s)\b|([\w+-.]+)"
test_str = "12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 3 c -3 ltr"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

I've solved this by myself by replacing the correctly parsed on the initial string, so i get the difference then split to get it as a list
str = '12 32 455c 2v 12tv v 0.5b -3b -b+b-3li b-0.5b 1 3 c -3 ltr'
a=re.compile(r'[+-]?[0-9]*\.[0-9]+\s*[a-z]+|[+-]?[0-9]*\s*[a-z]+')
r=a.findall(str)
print (r)
errors = str
for t in r:
errors = errors.replace(t, '', 1)
errors = errors.split()
print(errors)

RegEx for extracting latitude in a string

lrgstPlace = features[0]
strLrgstPlace = str(lrgstPlace)
longtide = re.match("r(lat=)([\-\d\.]*)",strLrgstPlace)
print (longtide)
This is how my features list looks like
Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3)
Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)
Why does the regex cant match anything?Its just gives me "None" as a result.

I think you meant to put the r outside the quotes:
r"(lat=)([\-\d\.]*)"

Your original expression works fine, we might want to slightly modify it, if we wish to just extract the lat numbers:
(?:lat=)([0-9\.\-]+)(?:,)
where ([0-9\.\-]+) would capture our desired lat, and we wrap it with two non-capturing groups:
(?:lat=)
(?:,)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:lat=)([0-9\.\-]+)(?:,)"
test_str = "Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3) Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Regex for Text Between Brackets and Text Between Semicolons

I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.

The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)

how can i extract values inside quotes using regex?

my text is
my_text = """ ["supra","value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="}};</script> """
i want to extract the value which is
ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==
i've tried this
extract_posted_data = re.search(r'(\"value\": \")(\w*)', my_text)
print (extract_posted_data.group(2))
and this is what i received
ddad7f1eada3c52c66cmh6ZG8tf
it isnt extracting the complete value
Thanks

- is not included in \w (and also = is not included)
You'll need to use: [\w=-]* instead of \w*

The regex that you're looking for is r"\"value\":\"(\S+)\"" and the required string for you is available in group(1) of the match
Here's a live link to the regex with your test string to test. Regex101 also has code generators that you could use to generate the required python code and test.
https://regex101.com/r/p2N524/1
import re
regex = r"\"value\":\"(\S+)\""
test_str = "[\"supra\",\"value\":\"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==\"}};</script> "
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Result
Match 1 was found at 9-123: "value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="
Group 1 found at 18-122: ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==

Python - Basic validation of international names?

Given a name string, I want to validate a few basic conditions:
-The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis.
-The string doesn't contain digits and is of length < 40
I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?

You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])
Live Demo
You may want to add more like \p{Sc} to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! 🕺🏻 \nWow cool song! 🕺🏻Wow cool song! 🕺🏻Wow cool song! 🕺🏻\n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip special characters in front of the first alphanumeric character - python

[\W_]* captures the special chars. import re data='Service address —_Unit8-10 LEWIS St, BERRI,SA 5343' possible_addresses = re.search('address[\W_](.?\d{4})', data,re.I) address = possible_addresses[1] print('Address : ' address) Address : Unit8-10 LEWIS St, BERRI,SA 5343

Related

Python regexp obtain not matched content

RegEx for extracting latitude in a string

Regex for Text Between Brackets and Text Between Semicolons

how can i extract values inside quotes using regex?

Python - Basic validation of international names?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strip special characters in front of the first alphanumeric character - python

[\W_]* captures the special chars. import re data='Service address —_Unit8-10 LEWIS St, BERRI,SA 5343' possible_addresses = re.search('address[\W_]*(.*?\d{4})', data,re.I) address = possible_addresses[1] print('Address : ' address) Address : Unit8-10 LEWIS St, BERRI,SA 5343

Related

Python regexp obtain not matched content

RegEx for extracting latitude in a string

Regex for Text Between Brackets and Text Between Semicolons

how can i extract values inside quotes using regex?

Python - Basic validation of international names?

Categories

Resources

[\W_]* captures the special chars. import re data='Service address —_Unit8-10 LEWIS St, BERRI,SA 5343' possible_addresses = re.search('address[\W_](.?\d{4})', data,re.I) address = possible_addresses[1] print('Address : ' address) Address : Unit8-10 LEWIS St, BERRI,SA 5343