my text is
my_text = """ ["supra","value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="}};</script> """
i want to extract the value which is
ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==
i've tried this
extract_posted_data = re.search(r'(\"value\": \")(\w*)', my_text)
print (extract_posted_data.group(2))
and this is what i received
ddad7f1eada3c52c66cmh6ZG8tf
it isnt extracting the complete value
Thanks
- is not included in \w (and also = is not included)
You'll need to use: [\w=-]* instead of \w*
The regex that you're looking for is r"\"value\":\"(\S+)\"" and the required string for you is available in group(1) of the match
Here's a live link to the regex with your test string to test. Regex101 also has code generators that you could use to generate the required python code and test.
https://regex101.com/r/p2N524/1
import re
regex = r"\"value\":\"(\S+)\""
test_str = "[\"supra\",\"value\":\"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==\"}};</script> "
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Result
Match 1 was found at 9-123: "value":"ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A=="
Group 1 found at 18-122: ddad7f1eada3c52c66cmh6ZG8tf-nLt1A596b7URouAxiT1JKph-ceBld-ISJapdG6bKrE1kvru158hLUBx2GdzABc6PHP-gNbnD8A==
Related
I have a dataframe with a column containing string representation of a list of ObjectIds. i.e:
"[ObjectId('5d28938629fe749c7c12b6e3'), ObjectId('5caf4522a30528e3458b4579')]"
And I want to convert it from string literal to a python list of just ids like:
['5d28938629fe749c7c12b6e3', '5caf4522a30528e3458b4579']
json.loads & ast.literal_eval both fails because string contains ObjectId
I share this regex: https://regex101.com/r/m5rW2q/1
You can click on codegenerator for example:
import re
regex = r"ObjectId\('(\w+)'\)"
test_str = "[ObjectId('5d28938629fe749c7c12b6e3'), ObjectId('5caf4522a30528e3458b4579')]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
output:
Match 1 was found at 1-37: ObjectId('5d28938629fe749c7c12b6e3')
Group 1 found at 11-35: 5d28938629fe749c7c12b6e3
Match 2 was found at 39-75: ObjectId('5caf4522a30528e3458b4579')
Group 1 found at 49-73: 5caf4522a30528e3458b4579
for your example:
import re
regex = r"ObjectId\('(\w+)'\)"
test_str = "[ObjectId('5d28938629fe749c7c12b6e3'), ObjectId('5caf4522a30528e3458b4579')]"
matches = re.finditer(regex, test_str, re.MULTILINE)
[i.groups()[0] for i in matches]
output:
['5d28938629fe749c7c12b6e3', '5caf4522a30528e3458b4579']
And all about regex you can find here: https://docs.python.org/3/library/re.html
Well, you can use replace
a = "[ObjectId('5d28938629fe749c7c12b6e3'), ObjectId('5caf4522a30528e3458b4579')]"
a.replace('ObjectId(', '').replace(")","")
#Output:
"['5d28938629fe749c7c12b6e3', '5caf4522a30528e3458b4579']"
Locate the rows; split at '; select items 1 and 3 from list:
my_df.loc[my_df["my_column"].str.contains("ObjectId"),"my_column"].str.split("'")[0][1:4:2]
Gives exactly a list of two elements:
['5d28938629fe749c7c12b6e3', '5caf4522a30528e3458b4579']
This question already has answers here:
Retrieving parameters from a URL
(20 answers)
Closed 3 years ago.
I'm using Python 3.7. I want to extract the portion of a url between the "q=...&" part of a query string. I have this code
href = span.a['href']
print("href:" + href)
matchObj = re.match( r'q=(.*?)\&', href, re.M|re.I)
if matchObj:
criteria = matchObj.group(1)
but despite the fact that my href is this
href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB
the "matchObj" is always NoneType and the subsequent lines aren't evaluated. What else do I need to do to fix my regex?
You can use the urllib module
Ex:
import urllib.parse as urlparse
url = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB"
data = urlparse.urlparse(url)
print(urlparse.parse_qs(data.query)['q'][0])
Output:
bet i won t get one share
You're using the wrong function if you wish to match in the middle of the string.
re.match only matches from start of the string
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Here use re.search instead.
import re
href = 'href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB'
print("href:" + href)
matchObj = re.search( r'q=(.*?)\&', href, re.M|re.I)
if matchObj:
criteria = matchObj.group(1)
print(criteria)
'bet+i+won+t+get+one+share'
Here, we would apply a simple expression with left and right boundaries such as:
&q=(.+?)&
Demo
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"&q=(.+?)&"
test_str = "href:/search?hl=en-US&q=bet+i+won+t+get+one+share&tbm=isch&tbs=simg:CAQSkwEJyapBtj9kKiIahwELEKjU2AQaAAwLELCMpwgaYgpgCAMSKMILxAufFcsLnBWeFZsVnRWABMcPsCKgLaMtoi2hLZ0tqziiI6w4uSQaMG01mL5LQ62s4q5ZMf-Wetz68lCkHfrFOOKs2CELzQJlPjHIMzmlp2Ny-a5t7hZbiCAEDAsQjq7-CBoKCggIARIEXLNODAw&sa=X&ved=0ahUKEwjThcCx59ziAhWKHLkGHfWjDs4Q2A4ILCgB
"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
RegEx Circuit
jex.im visualizes regular expressions:
lrgstPlace = features[0]
strLrgstPlace = str(lrgstPlace)
longtide = re.match("r(lat=)([\-\d\.]*)",strLrgstPlace)
print (longtide)
This is how my features list looks like
Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3)
Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)
Why does the regex cant match anything?Its just gives me "None" as a result.
I think you meant to put the r outside the quotes:
r"(lat=)([\-\d\.]*)"
Your original expression works fine, we might want to slightly modify it, if we wish to just extract the lat numbers:
(?:lat=)([0-9\.\-]+)(?:,)
where ([0-9\.\-]+) would capture our desired lat, and we wrap it with two non-capturing groups:
(?:lat=)
(?:,)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:lat=)([0-9\.\-]+)(?:,)"
test_str = "Feature(place='28km S of Cliza, Bolivia', long=-65.8913, lat=-17.8571, depth=358.34, mag=6.3) Feature(place='12km SSE of Volcano, Hawaii', long=-155.2005, lat=19.3258333, depth=6.97, mag=5.54)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.
The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)
Given a name string, I want to validate a few basic conditions:
-The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis.
-The string doesn't contain digits and is of length < 40
I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?
You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])
Live Demo
You may want to add more like \p{Sc} to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! 🕺🏻 \nWow cool song! 🕺🏻Wow cool song! 🕺🏻Wow cool song! 🕺🏻\n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.