Get javascript variable with python - python

Banging my head here..
I am trying to parse the html source for the entire contents of javascript variable 'ListData' with regex which starts with the declaration var Listdata = and ends with };.
I found a solution which is similar:
Fetch data of variables inside script tag in Python or Content added from js
But I am unable to get it to match the entire regex.
Code:
# Need the ListData object
pat = re.compile('var ListData = (.*?);')
string = """QuickLaunchMenu == null) QuickLaunchMenu = $create(UI.AspMenu,
null, null, null, $get('QuickLaunchMenu')); } ExecuteOrDelayUntilScriptLoaded(QuickLaunchMenu, 'Core.js');
var ListData = { "Row" :
[{
"ID": "159",
"PermMask": "0x1b03cc312ef",
"FSObjType": "0",
"ContentType": "Item"
};
moretext;
moretext"""
#Returns NoneType instead of match object
print(type(pat.search(string)))
Not sure what is going wrong here. Any help would be appreaciated.

In your regex, (.*?); part matches any 0+ chars other than line break chars up to the first ;. If there is no ; on the line, you will have no match.
Basing on the fact your expected match ends with the first }; at the end of a line, you may use
'(?sm)var ListData = (.*?)};$'
Here,
(?sm) - enables re.S (it makes . match any char) and re.M (this makes $ match the end of a line, not just the whole string and makes ^ match the start of line positions) modes
var ListData =
(.*?) - Group 1: any 0+ chars, as few as possible, up to the first...
};$ - }; at the end of a line

Related

Extract email addresses from academic curly braces format

I have a file where each line contains a string that represents one or more email addresses.
Multiple addresses can be grouped inside curly braces as follows:
{name.surname, name2.surnam2}#something.edu
Which means both addresses name.surname#something.edu and name2.surname2#something.edu are valid (this format is commonly used in scientific papers).
Moreover, a single line can also contain curly brackets multiple times. Example:
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
results in:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.
Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).
pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:
import pyparsing as pp
LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}#')
# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE
+ pp.Group(pp.delimitedList(email_part))('names')
+ RBRACE
+ '#'
+ email_part('trailing'))
# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
return ["{}#{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)
# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '#' + email_part)
# the complete list parser looks for a comma-delimited list of compressed
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)
pyparsing parsers come with a runTests method to test your parser against various test strings:
tests = """\
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
# just a plain email
plain_old_bob#uni.elsewhere
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
"""
email_list_parser.runTests(tests)
Prints:
# original test string
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com']
# a tricky email containing a quoted string
{x.y, z.k}#edu.com, "{a, b}"#domain.com
['x.y#edu.com', 'z.k#edu.com', '"{a, b}"#domain.com']
# just a plain email
plain_old_bob#uni.elsewhere
['plain_old_bob#uni.elsewhere']
# mixed list of plain and compressed emails
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, plain_old_bob#uni.elsewhere
['a.b#uni.somewhere', 'c.d#uni.somewhere', 'e.f#uni.somewhere', 'x.y#edu.com', 'z.k#edu.com', 'plain_old_bob#uni.elsewhere']
DISCLOSURE: I am the author of pyparsing.
Note
I'm more familiar with JavaScript than Python, and the basic logic is the same regardless (the different is syntax), so I've written my solutions here in JavaScript. Feel free to translate to Python.
The Issue
This question is a bit more involved than a simple one-line script or regular expression, but depending on the specific requirements you may be able to get away with something rudimentary.
For starters, parsing an e-mail is not trivially boiled down to a single regular expression. This website has several examples of regular expressions that will match "many" e-mails, but explains the trade-offs (complexity versus accuracy) and goes on to include the RFC 5322 standard regular expression that should theoretically match any e-mail, followed by a paragraph for why you shouldn't use it. However even that regular expression assumes that a domain name taking the form of an IP address can only consist of a tuple of four integers ranging from 0 to 255 -- it doesn't allow for IPv6
Even something as simple as:
{a, b}#domain.com
Could get tripped up because technically according to the e-mail address specification an e-mail address can contain ANY ASCII characters surrounded by quotes. The following is a valid (single) e-mail address:
"{a, b}"#domain.com
To accurately parse an e-mail would require that you read the characters one letter at a time and build a finite state machine to track whether you are within a double-quote, within a curly brace, before the #, after the #, parsing a domain name, parsing an IP, etc. In this way you could tokenize the address, locate your curly brace token, and parse it independently.
Something Rudimentary
Regular expressions are not the way to go for 100% accuracy and support for all e-mails, *especially* if you want to support more than one e-mail on a single line. But we'll start with them and try to build from there.
You've probably tried a regular expression like:
/\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
Match a single curly brace...
Followed by one or more instances of:
One or more non-comma characters...
Followed by zero or one commas
Followed by a single closing curly brace...
Followed by a single #
Followed by one or more instances of:
One or more "word" characters...
Followed by a single .
Followed by one or more alpha characters
This should match something roughly of the form:
{one, two}#domain1.domain2.toplevel
This handles validating, next is the issue of extracting all valid e-mails. Note that we have two sets of parenthesis in the name portion of the e-mail address that are nested: (([^,]+),?). This causes a problem for us. Many regular expression engines don't know how to return matches in this case. Consider what happens when I run this in JavaScript using my Chrome developer console:
var regex = /\{(([^,]+),?)+\}\#(\w+\.)+[A-Za-z]+/
var matches = "{one, two}#domain.com".match(regex)
Array(4) [ "{one, two}#domain.com", " two", " two", "domain." ]
Well that wasn't right. It found two twice, but didn't find one once! To fix this, we need to eliminate the nesting and do this in two steps.
var regexOne = /\{([^}]+)\}\#(\w+\.)+[A-Za-z]+/
"{one, two}#domain.com".match(regexOne)
Array(3) [ "{one, two}#domain.com", "one, two", "domain." ]
Now we can use the match and parse that separately:
// Note: It's important that this be a global regex (the /g modifier) since we expect the pattern to match multiple times
var regexTwo = /([^,]+,?)/g
var nameMatches = matches[1].match(regexTwo)
Array(2) [ "one,", " two" ]
Now we can trim these and get our names:
nameMatches.map(name => name.replace(/, /g, "")
nameMatches
Array(2) [ "one", "two" ]
For constructing the "domain" part of the e-mail, we'll need similar logic for everything after the #, since this has a potential for repeats the same way the name part had a potential for repeats. Our final code (in JavaScript) may look something like this (you'll have to convert to Python yourself):
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
Multi-email Lines
This is where things start to get tricky, and we need to accept a little less accuracy if we don't want to build a full on lexer / tokenizer. Because our e-mails contain commas (within the name field) we can't accurately split on commas -- unless those commas aren't within curly braces. With my knowledge of regular expressions, I don't know if this can be easily done. It may be possible with lookahead or lookbehind operators, but someone else will have to fill me in on that.
What can be easily done with regular expressions, however, is finding a block of text containing a post-ampersand comma. Something like: #[^#{]+?,
In the string a#b.com, c#d.com this would match the entire phrase #b.com, - but the important thing is that it gives us a place to split our string. The tricky bit is then finding out how to split your string here. Something along the lines of this will work most of the time:
var emails = "a#b.com, c#d.com"
var matches = emails.match(/#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(2) [ "a", " c#d.com" ]
split[0] = split[0] + matches[0] // Add back in what we split on
This has a potential bug should you have two e-mails in the list with the same domain:
var emails = "a#b.com, c#b.com, d#e.com"
var matches = emails.match(#[^#{]+?,/g)
var split = emails.split(matches[0])
console.log(split) // Array(3) [ "a", " c", " d#e.com" ]
split[0] = split[0] + matches[0]
console.log(split) // Array(3) [ "a#b.com", " c", " d#e.com" ]
But again, without building a lexer / tokenizer we're accepting that our solution will only work for most cases and not all.
However since the task of splitting one line into multiple e-mails is easier than diving into the e-mail, extracting a name, and parsing the name: we may be able to write a really stupid lexer for just this part:
var inBrackets = false
var emails = "{a, b}#c.com, d#e.com"
var split = []
var lastSplit = 0
for (var i = 0; i < emails.length; i++)
{
if (inBrackets && emails[i] === "}")
inBrackets = false;
if (!inBrackets && emails[i] === "{")
inBrackets = true;
if (!inBrackets && emails[i] === ",")
{
split.push(emails.substring(lastSplit, i))
lastSplit = i + 1 // Skip the comma
}
}
split.push(emails.substring(lastSplit))
console.log(split)
Once again, this won't be a perfect solution because an e-mail address may exist like the following:
","#domain.com
But, for 99% of use cases, this simple lexer will suffice and we can now build a "usually works but not perfect" solution like the following:
function getEmails(input)
{
var emailRegex = /([^#]+)\#(.+)/;
var emailParts = input.match(emailRegex);
var name = emailParts[1];
var domain = emailParts[2];
var nameList;
if (/\{.+\}/.test(name))
{
// The name takes the form "{...}"
var nameRegex = /([^,]+,?)/g;
var nameParts = name.match(nameRegex);
nameList = nameParts.map(name => name.replace(/\{|\}|,| /g, ""));
}
else
{
// The name is not surrounded by curly braces
nameList = [name];
}
return nameList.map(name => `${name}#${domain}`);
}
function splitLine(line)
{
var inBrackets = false;
var split = [];
var lastSplit = 0;
for (var i = 0; i < line.length; i++)
{
if (inBrackets && line[i] === "}")
inBrackets = false;
if (!inBrackets && line[i] === "{")
inBrackets = true;
if (!inBrackets && line[i] === ",")
{
split.push(line.substring(lastSplit, i));
lastSplit = i + 1;
}
}
split.push(line.substring(lastSplit));
return split;
}
var line = "{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com";
var emails = splitLine(line);
var finalList = [];
for (var i = 0; i < emails.length; i++)
{
finalList = finalList.concat(getEmails(emails[i]));
}
console.log(finalList);
// Outputs: [ "a.b#uni.somewhere", "c.d#uni.somewhere", "e.f#uni.somewhere", "x.y#edu.com", "z.k#edu.com" ]
If you want to try and implement the full lexer / tokenizer solution, you can look at the simple / dumb lexer I built as a starting point. The general idea is that you have a state machine (in my case I only had two states: inBrackets and !inBrackets) and you read one letter at a time but interpret it differently based on your current state.
a quick solution using re:
test with one text line:
import re
line = '{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com'
com = re.findall(r'(#[^,\n]+),?', line) #trap #xx.yyy
adrs = re.findall(r'{([^}]+)}', line) #trap all inside { }
result=[]
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result=result+s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
test with a text file:
import io
data = io.StringIO(u'''\
{a.b, c.d, e.f}#uni.somewhere, {x.y, z.k}#edu.com, {z.z, z.a}#edu.com
{a.b, c.d, e.f}#uni.anywhere
{x.y, z.k}#adi.com, {z.z, z.a}#du.com
''')
result=[]
import re
for line in data:
com = re.findall(r'(#[^,\n]+),?', line)
adrs = re.findall(r'{([^}]+)}', line)
for i in range(len(adrs)):
s = re.sub(r',\s*', com[i] + ',', adrs[i]) + com[i]
result = result + s.split(',')
for r in result:
print(r)
output in list result:
a.b#uni.somewhere
c.d#uni.somewhere
e.f#uni.somewhere
x.y#edu.com
z.k#edu.com
z.z#edu.com
z.a#edu.com
a.b#uni.anywhere
c.d#uni.anywhere
e.f#uni.anywhere
x.y#adi.com
z.k#adi.com
z.z#du.com
z.a#du.com

Python Regex returning a list of all occurrences of block data

I needed some help regarding finding a block of text from a text file.
The text file is a structured one.
From the File, I want to extract blocks of data which starts with a string and Ends with a } (Curly Bracket with No White Space and \r\n)
Example -:
ABCD = XYZAHFJKBKFF
{
DATAFIELD1 = "TYPE1"
{
VALUE = 1
VALUE = 2
VALUE = 3
}
DATAFIELD1 = "TYPE2"
{
VALUE = 5
VALUE = 6
VALUE = 7
}
}
pattern = re.compile(r"ABCD.*}",re.DOTALL)
fafs = re.findall(pattern, data)
This one does give me the result, but not as a list even if I use a for loop like
for letters in re.findall(pattern, data):
print(letters)
What i want to get is a list of All the Blocks of Data between the "ABCD" and "}".
There can be many occurrences and I want to get all of them in an iterable format or as a list.
can someone please help me with this.
Here, try this, it does what it sounds like you want:
txt = """ABCD = XYZAHFJKBKFF
{
sdfsd
sd
fsd
fsd
fsd
fsd
fsd
fsd
(This can be anything and including most common characters)
(This may Include Curly, Round brackets as well)
}"""
pattern = re.compile(r".*")
_fafs = re.findall(pattern, txt[txt.index("ABCD")+4:txt.rindex("}")])
fafs = [faf for faf in _fafs if faf != ""]
for letters in re.findall(pattern, txt):
print(letters)
The fafs = [faf for faf in _fafs if faf != ""] line is to remove the empty string items that appeared. Also, if you want to strip whitespace from the chunks of data, then replace that with fafs = [faf.strip(" \t\n\r") for faf in _fafs if faf != ""], and add any other whitespace characters (besides spaces, tabs, and two different newlines) into the str that is the argument to strip.
Oh, and replace txt's string literal with a call to whatever will procure the data you wish to parse. And if you want a starting flag other than "ABCD", then replace txt.index("ABCD")+4 with txt.index(flag)+len(flag)
And alternatively to _fafs = re.findall(pattern, txt[txt.index("ABCD")+4:txt.rindex("}")]):
ind = txt.index(flag)+len(flag)
_fafs = re.findall(pattern, txt[ind:txt.index("}", ind)])
BUT that'll stop at the first close brace after the starting flag.

Python: Can't turn string into JSON

For the past few hours, I've been fighting to get a string into a JSON dict. I've tried everything from json.loads(... which throws an error:
requestInformation = json.loads(entry["request"]["postData"]["text"])
//throws this error
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
to stripping out the slashes using a medley of re.sub('\\','',mystring) ,mystring.sub(... to no effect. My problem string looks like so
'{items:[{n:\\'PackageChannel.GetUnitsInConfigurationForUnitType\\',ps:[{n:\\'unitType\\',v:"ActionTemplate"}]}]}'
The origin of this string is that it's a HAR dump from Google Chrome. I think those backslashes are from it being escaped somewhere along the way because the bulk of the HAR file doesn't contain them, but they do appear commonly in any field labeled "text".
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
EDIT I eventually gave up on turning the text above into JSON and instead opted for regex. Sometimes the slashes showed up, sometimes they didn't based on what I was viewing the text in and that made it difficult to work with.
the json module wants a string where the keys are also wrapped in double quotes
so the string below would work:
mystring = '{"items":[{"n":"PackageChannel.GetUnitsInConfigurationForUnitType", "ps":[{"n":"unitType","v":"ActionTemplate"}]}]}'
myjson = json.loads(mystring)
This function should remove the double backslashes and put double quotes around your keys.
import json, re
def make_jsonable(mystring):
# we'll use this regex to find any key that doesn't contain any of: {}[]'",
key_regex = "([\,\[\{](\s+)?[^\"\{\}\,\[\]]+(\s+)?:)"
mystring = re.sub("[\\\]", "", mystring) # remove any backslashes
mystring = re.sub("\'", "\"", mystring) # replace single quotes with doubles
match = re.search(key_regex, mystring)
while match:
start_index = match.start(0)
end_index = match.end(0)
print(mystring[start_index+1:end_index-1].strip())
mystring = '%s"%s"%s'%(mystring[:start_index+1], mystring[start_index+1:end_index-1].strip(), mystring[end_index-1:])
match = re.search(key_regex, mystring)
return mystring
I couldn't directly test it on the first string you wrote, the double/single quotes don't match up, but on the one in the last code sample it works.
You'll need a r before JSON String, or replace all \ with \\
This works:
import json
validasst_json = r'''{
"postData": {
"mimeType": "application/json",
"text": "{items:[{n:'PackageChannel.GetUnitsInConfigurationForUnitType',ps:[{n:'unitType',v:\"Analysis\"}]}]}"
}
}'''
txt = json.loads(validasst_json)
print(txt["postData"]['mimeType'])
print(txt["postData"]['text'])

Regular expression in python

I am trying to match/sub the following line
line1 = '# Some text\n'
But avoid match/sub lines like this
'# Some text { .blah}\n'
So in other a # followed by any amount of words spaces and numbers (no punctuation) and then the end of line.
line2 = re.sub(r'# (\P+)$', r'# \1 { .text}', line1)
Puts the contents of line1 into line2 unchanged.
(I read somewhere that \P means everything except punctuation)
line2 = re.sub(r'# (\w*\d*\s*)+$', r'# \1 { .text}', line1)
Whereas the above gives
'# { .text}'
Any help is appreciated
Thanks
Tom
Your regex is a bit weird; expanded, it looks like
r"# ([a-zA-Z0-9_]*[0-9]*[ \t\n\r\f\v]*)+$"
Things to note:
It is not anchored to the beginning of the string, meaning it would match
print("Important stuff!") # Very important
The \d* is redundant, because it is already captured by \w*
Looking at your example, it seems you should be less worried about punctuation; the only thing you cannot have is a curly-brace ({).
Try
from functools import partial
def add_text(txt):
return re.sub(r"^#([^{]*)$", r"#\1 { .text }", txt, flags=re.M)
text = "# Some text\n# More text { .blah}\nprint('abc') # but not me!\n# And once again"
print("===before===")
print(text)
print("\n===after===")
print(add_text(text))
which gives
===before===
# Some text
# More text { .blah}
print('abc') # but not me!
# And once again
===after===
# Some text { .text }
# More text { .blah}
print('abc') # but not me!
# And once again { .text }
If you only want lines which start with a # and continue with alphanumeric values, spaces and _, you want this:
/^#[\w ]+$/gm

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
use strict; use warnings;
my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';
my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;
my $rxExpanded = qr/
(?!}$) # ASSERT ahead: NOT a } plus end
(?:^{?|,) # Boundry: Start of string plus { OR comma
\s* # 0 or more whitespace
( ".*?" | .*?) # Capture "Quoted" or non quoted data
\s* # 0 or more whitespace
(?=,|}$) # Boundry ASSERT ahead: Comma OR } plus end
/x;
my ($newstring, $sucess) = ('[', 0);
for my $field ($str =~ /$rx/g)
{
my $tmp = $field;
$sucess = 1;
if ( $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
$tmp = "'$tmp'";
}
$newstring .= "$tmp,";
}
if ( $sucess ) {
$newstring =~ s/,$//;
$newstring .= ']';
print $newstring,"\n";
}
else {
print "Invalid string!\n";
}
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
It seemed that the CSV approach was the easiest to implement:
def parsePsycopgSQLArray(input):
import csv
import cStringIO
input = input.strip("{")
input = input.strip("}")
buffer = cStringIO.StringIO(input)
reader = csv.reader(buffer, delimiter=',', quotechar='"')
return reader.next() #There can only be one row
if __name__ == "__main__":
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
result = parsePsycopgSQLArray(text)
print result
Thanks for the responses, they were most helpfull!
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:
def restore_str_array(val):
"""
Converts a postgres formatted string array (as a string) to python
:param val: postgres string array
:return: python array with values as strings
"""
val = val.strip("{}")
if not val:
return []
reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
return reader.next()

Categories