what regular expression can extract data I need? - python

I have a string
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
I like to extract the number 528341191030 between the first two \u. I tried this,
m = re.search('\?id\u\d+d(\d+?)\u', url)
if m:
print m.group(1)
But it doesn't work. What is wrong with my solution?

Are you sure you need regex?
Here is a solution using split:
url.split("\u")[1].split("d")[-1]
'528341191030'
In terms of what is wrong with your regex, "\" is a special character, so you should use "\\" for backslash (so " \\\u" instead of "\u"):
m = re.search('\?id\\\u\d+d(\d+?)\\\u', url)
if m:
print m.group(1)
Gives: 528341191030
Docs:
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal.
Or,use Raw String Notation
m = re.search(r"\?id\\u\d+d(\d+?)\\u", url)
if m:
print m.group(1)

Well, you could always try this (not super elegant but works):
first = url.find('\u') + 2
prev = 'u'
m = ""
for i in range(first, len(url)):
if prev == '\' and url[i] == 'u':
break
else:
m += url[i]
if url[i] == 'd':
m = ""

Better way is to parseurl and get the query string values
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
import urllib.parse as urlparse
print ( urlparse.parse_qs(urlparse.urlparse(url).query) )
print ( urlparse.parse_qs(urlparse.urlparse(url).query)['id'] )
Output:
{'id': ['528341191030'], 'ns': ['1'], 'abbucket': ['0']}
['528341191030']

Related

Split by comma and how to exclude comma from quotes in split

python 2.7 code
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = cStr.split(',')
print newStr # -> ['"aaaa"','"bbbb"','"ccc','ddd"' ]
but, I want this result.
result = ['"aaa"','"bbb"','"ccc,ddd"']
The solution using re.split() function:
import re
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = re.split(r',(?=")', cStr)
print newStr
The output:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
,(?=") - lookahead positive assertion, ensures that delimiter , is followed by double quote "
Try to use CSV.
import csv
cStr = '"aaaa","bbbb","ccc,ddd"'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([cStr], delimiter=',', quotechar='"'))[0] ]
print newStr
Check Python parse CSV ignoring comma with double-quotes
By using regex try this:
COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
split_result = COMMA_MATCHER.split(string)
pyparsing has a builtin expression, commaSeparatedList:
cStr = '"aaaa","bbbb","ccc,ddd"'
import pyparsing as pp
print(pp.commaSeparatedList.parseString(cStr).asList())
prints:
['"aaaa"', '"bbbb"', '"ccc,ddd"']
You can also add a parse-time action to strip those double-quotes (since you probably just want the content, not the quotation marks too):
csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(cStr).asList())
gives:
['aaaa', 'bbbb', 'ccc,ddd']
It will be better to use regex in this case.
re.findall('".*?"', cStr) returns exactly what you need
asterisk is greedy wildcard, if you used '".*"', it would return maximal match, i.e. everything in between the very first and the very last double quote. The question mark makes it non greedy, so '".*?"' returns the smallest possible match.
I liked Mark de Haan' solution but I had to rework it, as it removed the quote characters (although they were needed) and therefore an assertion in his example failed. I also added two additional parameters to deal with different separators and quote characters.
def tokenize( string, separator = ',', quote = '"' ):
"""
Split a comma separated string into a List of strings.
Separator characters inside the quotes are ignored.
:param string: A string to be split into chunks
:param separator: A separator character
:param quote: A character to define beginning and end of the quoted string
:return: A list of strings, one element for every chunk
"""
comma_separated_list = []
chunk = ''
in_quotes = False
for character in string:
if character == separator and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
else:
chunk += character
if character == quote:
in_quotes = False if in_quotes else True
comma_separated_list.append( chunk )
return comma_separated_list
And the tests...
def test_tokenizer():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = tokenize(string)
assert expected == actual
It is always better to use existing libraries when you can, but I was struggling to get my specific use case to work with all the above answers, so I wrote my own for python 3.9 (will probably work until 3.6, and removing the type hinting will get you to 2.x compatability).
def separate(string) -> List[str]:
"""
Split a comma separated string into a List of strings.
Resulting list elements are trimmed of double quotes.
Comma's inside double quotes are ignored.
:param string: A string to be split into chunks
:return: A list of strings, one element for every chunk
"""
comma_separated_list: List[str] = []
chunk: str = ''
in_quotes: bool = False
for character in string:
if character == ',' and not in_quotes:
comma_separated_list.append(chunk)
chunk = ''
elif character == '"':
in_quotes = False if in_quotes else True
else:
chunk += character
comma_separated_list.append(chunk)
return comma_separated_list
And the tests...
def test_separator():
string = '"aaaa","bbbb","ccc,ddd"'
expected = ['"aaaa"', '"bbbb"', '"ccc,ddd"']
actual = separate(string)
assert expected == actual
You can first split the string by " then filter out '' or ',', finally format it, it may be the simplest way:
['"%s"' % s for s in cStr.split('"') if s and s != ',']
You need a parser. You can build your own, or you may be able to press one of the library ones into service. In this case, json could be (ab)used.
import json
cStr = '"aaaa","bbbb","ccc,ddd"'
jstr = '[' + cStr + ']'
result = json.loads( jstr) # ['aaaa', 'bbbb', 'ccc,ddd']
result = [ '"'+r+'"' for r in result ] # ['"aaaa"', '"bbbb"', '"ccc,ddd"']
This is not a standard module, you have to install it via pip, but as an option try tssplit:
In [3]: from tssplit import tssplit
In [4]: tssplit('"aaaa","bbbb","ccc,ddd"', quote='"', delimiter=',')
Out[4]: ['aaaa', 'bbbb', 'ccc,ddd']

In PyParsing, how to stop a Regex from consuming the entire string

I'm trying to write a function parse such that, for example,
assert parse("file://foo:bar.txt:r+") == ("foo:bar.txt", "r+")
The string consists of a fixed prefix file://, followed by a file name (which can consist of one or more of any character), followed by a colon and a string representing access flags.
Here is one implementation using regular expressions:
import re
def parse(string):
SCHEME = r"file://" # File prefix
PATH_PATTERN = r"(?P<path>.+)" # One or more of any character
FLAGS_PATTERN = r"(?P<flags>[rwab+0-9]+)" # The letters r, w, a, b, a '+' symbol, or any digit
FILE_RESOURCE_PATTERN = SCHEME + PATH_PATTERN + r":" + FLAGS_PATTERN + r"$" # The full pattern including the end of line character
tokens = re.match(FILE_RESOURCE_PATTERN, string).groupdict()
return tokens['path'], tokens['flags']
I would prefer to use PyParsing, however, because it typically gives more detailed error messages if the string doesn't match the expression (rather than re.match which simply returns None), and I would eventually like to make the flags optional.
Following Paul McGuire's answer in python regex in pyparsing, I made the following attempt:
from pyparsing import Word, alphas, nums, StringEnd, Regex, FollowedBy, Suppress, Literal
def parse(string):
scheme = Literal("file://")
path = Regex(".+")
flags = Word(alphas + nums + "+")
expression = Suppress(scheme) + (~(Suppress(":") + flags + StringEnd()) + path("path") + Suppress(":") + flags("flags") + StringEnd())
tokens = expression.parseString(string)
return tokens['path'], tokens['flags']
In the second part of the expression, I'm basically trying the negative lookahead (~suffix + path + suffix), where suffix is ":" + flags + StringEnd(). However, when trying to parse "file://foo:bar.txt:r+", I run into the following error:
pyparsing.ParseException: Expected ":" (at char 21), (line:1, col:22)
Since the string is 21 characters long, I interpret this as that the Regex has 'consumed' the entire string so that the suffix is no longer 'found'.
How can I fix the parse method using pyparsing?
Try this:
s="file://foo:bar.txt:r+"
path,flag=re.sub(r'.*\/\/(.*):(.*$)',r'\1,\2',s)

The elegant way to replace specific characters in Python

I have strings that are unpredictable in terms of character content, but I know that every string contains exactly one character '*'.
How to replace two characters after the '*' with some non hard-coded string. Non hard-coded string is actually calculated checksum and converted into string:
checksum_str = str(hex(csum).lstrip('0x'))
You want something like:
star_pos = my_string.find('*')
my_string = my_string[:star_pos] + '*' + checksum_str + my_string[star_pos + 3:]
You can do it with a regular expression:
import re
my_string = re.sub(r'(?<=\*)..', checksum_str, my_string, 1)

getting string between 2 characters in python

I need to get certain words out from a string in to a new format. For example, I call the function with the input:
text2function('$sin (x)$ is an function of x')
and I need to put them into a StringFunction:
StringFunction(function, independent_variables=[vari])
where I need to get just 'sin (x)' for function and 'x' for vari. So it would look like this finally:
StringFunction('sin (x)', independent_variables=['x']
problem is, I can't seem to obtain function and vari. I have tried:
start = string.index(start_marker) + len(start_marker)
end = string.index(end_marker, start)
return string[start:end]
and
r = re.compile('$()$')
m = r.search(string)
if m:
lyrics = m.group(1)
and
send = re.findall('$([^"]*)$',string)
all seems to seems to give me nothing. Am I doing something wrong? All help is appreciated. Thanks.
Tweeky way!
>>> char1 = '('
>>> char2 = ')'
>>> mystr = "mystring(123234sample)"
>>> print mystr[mystr.find(char1)+1 : mystr.find(char2)]
123234sample
$ is a special character in regex (it denotes the end of the string). You need to escape it:
>>> re.findall(r'\$(.*?)\$', '$sin (x)$ is an function of x')
['sin (x)']
If you want to cut a string between two identical characters (i.e, !234567890!)
you can use
line_word = line.split('!')
print (line_word[1])
You need to start searching for the second character beyond start:
end = string.index(end_marker, start + 1)
because otherwise it'll find the same character at the same location again:
>>> start_marker = end_marker = '$'
>>> string = '$sin (x)$ is an function of x'
>>> start = string.index(start_marker) + len(start_marker)
>>> end = string.index(end_marker, start + 1)
>>> string[start:end]
'sin (x)'
For your regular expressions, the $ character is interpreted as an anchor, not the literal character. Escape it to match the literal $ (and look for things that are not $ instead of not ":
send = re.findall('\$([^$]*)\$', string)
which gives:
>>> import re
>>> re.findall('\$([^$]*)\$', string)
['sin (x)']
The regular expression $()$ otherwise doesn't really match anything between the parenthesis even if you did escape the $ characters.

Why is the return value of an empty python regexp search a match?

When passing an empty string to a regular expression object, the result of a search is a match object an not None. Should it be None since there is nothing to match?
import re
m = re.search("", "some text")
if m is None:
print "Returned None"
else:
print "Return a match"
Incidentally, using the special symbols ^ and $ yield the same result.
Empty pattern matches any part of the string.
Check this:
import re
re.search("", "ffff")
<_sre.SRE_Match object at 0xb7166410>
re.search("", "ffff").start()
0
re.search("$", "ffff").start()
4
Adding $ doesn't yield the same result. Match is at the end, because it is the only place it can be.
Look at it this way: Everything you searched for was matched, therefore the search was successful and you get a match object.
What you need to be doing is not checking if m is None, rather you want to check if m is True:
if m:
print "Found a match"
else:
print "No match"
Also, the empty pattern matches the whole string.
Those regular expressions are successfully matching 0 literal characters.
All strings can be thought of as containing an infinite number of empty strings between the characters:
'Foo' = '' + '' + ... + 'F' + '' + ... + '' + 'oo' + '' + ...

Categories