How to parse a simple quoted string (handling escapes) - python

Note: this should be REALLY simple, I know. It's not. Or I'm dumb. But I tried hard.
What I want to do is simple. I have a string, there are strings inside it, separated with ,, quoted with '. I want to parse them. Consider the presence of \'s and \\s.
I want to do in the most simple, elegant and coincise way, obviously.
Now, on to some failed tries:
"I know, I'll use json!" No. JSON uses ". Too bad.
Mmmh, a regex? This looks like looking for trouble, but... Oh God my eyes those regexes I got from the Internet! At least do they... Nope, no support of escapes.
shlex! The Python standard library always has a solution! See below my failed attempt.
Current status: sobbing, writing a parser.
Test input: 'xx\'x,x\\x"xx\\\'\\',1,2,'xx\'x\''
Test output: xx'x,x\x"xx\'\, 1, 2, xx'x'
def split(s):
import shlex
lex = shlex.shlex(s, posix=True)
lex.whitespace = ','
lex.whitespace_split = True
lex.commenters = ''
return list(lex)

Made it. I've looked into csv before but I needed to customize it heavily. Here is the function
def parse_quoted_strings_list(s):
import csv
return next(csv.reader([s],
skipinitialspace=True,
quoting=csv.QUOTE_NONNUMERIC,
escapechar='\\',
doublequote=False,
quotechar="'"))
And here are the tests
>>> test = r"""'xx\'x,x\\x"xx\\\'\\',1,2,'xx\'x\''"""
>>> map(print, parse_quoted_strings_list(test))
xx'x,x\x"xx\'\
1.0
2.0
xx'x'

Related

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

Latex command substitution using regexp in python

I wrote a very ugly script in order to parse some rows of latex in python and doing string substitution. I'm here because I'm want to write something to be proud of, and learn :P
More specifically, I'd like to change:
\ket{(.*)} into |(.*)\rangle
\bra{(.*)} into \langle(*)|
To this end, I wrote a very very ugly script. The intended use is to do a thing like this:
cat file.tex | python script.py > new_file.tex
So what I did is the following. It's working, but is not nice at all and I'm wondering if you could give me a suggestion, even a link to the right command to use is ok. Note that I do recursion because when I have found the first "\ket{" i know that I want to replace the first occuring "}" (i.e. I'm sure there are no other subcommands within "\ket{"). But again, it's not the right way of parsing latex.
def recursion_ket(string_input, string_output=""):
match = re.search("\ket{", string_input)
if not match:
return string_input
else:
string_output = re.sub(r"\\ket{", '|', string_input, 1)
string_output_second =re.sub(r"}", "\rangle", stringa_output.split('|', 1)[1], 1)
string_output = string_output.split('|', 1)[0]+string_output_second
string_output=recursion_ket(string_output, string_output)
return string_output
if __name__ == '__main__':
with open(sys.argv[1]) as f:
content=f.readlines()
new=[]
for line in content:
new.append(ricorsione_ket(line))
z=open(sys.argv[2], 'w')
for i in new:
z.write(i.replace("\r", '\\r').replace("\b", '\\b'))
z.write("")
Which I know is very ugly. And it's definitely not the right way of doing it. Probably it's because I come from perl, and I'm not used to python regexp.
First problem: is it possible to use regexp to substitute just the "border" of a matching string, and leave the inside as it is? I want to leave the content of \command{xxx} as it is.
Second problem: the \r. Apparently, when I try to print on the terminal or in a file each string, I need to make sure \r is not interpreted as carriage return. I have tried to use the automatic escape, but it's not what I need. It escapes the \n with another \ and this is not what I want.
To answer your questions,
First problem: You can use (named) groups
Second problem: In Python3, you can use r"\btree" to deal with the backslash gracefully.
Using a latex parser like github.com/alvinwan/TexSoup, we can simplify the code a bit. I know OP has asked for regex, but if OP is tool-agnostic, a parser would be more robust.
Nice Function
We can abstract this into a replace function
def replaceTex(soup, command, replacement):
for node in soup.find_all(command):
node.replace(replacement.format(args=node.args))
Then, use this replaceTex function in the following way
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> replaceTex('bra', r"|{args[0]}\rangle")
>>> replaceTex('ket', r"\langle{args[0]}|")
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol
Demo
Here's a self-contained demonstration, based on TexSoup:
>>> import TexSoup
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> soup
\section{hello} text \bra{(.)} haha \ket{(.)}lol
>>> soup.ket.replace(r"|{args[0]}\rangle".format(args=soup.ket.args))
>>> soup.bra.replace(r"\langle{args[0]}|".format(args=soup.bra.args))
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol

how to remove or translate multiple strings from strings?

I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))

How to get the setting string (like buildpath = /home/build ) using regular expressions in python?

My string is like this(xcode project settings):
ARCHS_UNIVERSAL_IPHONE_OS = armv7 armv7s
AVAILABLE_PLATFORMS = iphonesimulator macosx iphoneos
BUILD_COMPONENTS = headers build
BUILD_DIR = /home/projects/build
BUILD_ROOT = /home/ohter/build
I wan to get the string "/home/projects/build" whitch "BUILD_DIR = " at its head. and a \n at the end.
I want using regular expressions in python . I read a lot of doc about the regular expressions but I can't understand very quickly. Anybody can give me some tip ?
Regular expressions can be very attractive, but in general you want to avoid them whenever possible. They tend to be relatively difficult to comprehend, prone to error, and inflexible.
Instead, I would suggest something closer to this (assuming you're reading this string in from a file):
def grab_parameter(filename, parameter):
with open(filename) as source:
for line in source:
if line.startswith(parameter):
return line.split('=')[1].strip()
print grab_parameter('my_settings.txt', 'BUILD_DIR')
>>> /home/projects/build
This way, your usage becomes very flexible, and if you decided to grab another variable you could do it easily, for instance:
print grab_parameter('my_settings.txt', 'BUILD_COMPONENTS')
>>> headers build
You COULD do this with regular expressions, by using a regex like this:
r'BUILD_DIR\s?\=\s?(.*?)'
But as you can see, it's a pretty unattractive prospect. It's unclean, unintuitive, and generally very brittle (not to mention that you get weird SOH characters popping up).
If you REALLY want to use regex (which you shouldn't), you could do it like so:
import re
def grab_build_dir(filename):
re_pattern = re.compile(r'BUILD_DIR\s?\=\s?(.*?)')
with open(filename) as source:
for line in source:
if re.match(re_pattern, line):
return re.sub(re_pattern, '\1', line)
print grab_parameter('my_settings.txt', 'BUILD_DIR')
>>> \x01/home/projects/build

Python Regular Expression findall with variable

I am trying to use re.findall with look-behind and look-forward to extract data. The regular expression works fine when I am not using a raw_input variable, but I need users to be able to input a variety of different search terms.
Here is the current code:
me = re.findall(r"(?<='(.+)'+variable+'(.+)')(.*?)(?='(.+)+variable+(.+)')", raw)
As you can see, I am trying to pull out strings between one search term.
However, each time I use this type of formatting, I get a fixed width error. Is there anyway around this?
I have also tried the following formats with no success.
variable = raw_input('Term? ')
'.*' + variable + '.*'
and
'.*%s.*' % (variable, )
and
'.*{0}.*'.format(variable)
and
'.*{variable}.*'.format(variable=variable)
I'm not sure if this is what you mean, but it may get you started. As far as I understand your question, you don't need lookaheads or lookbehinds. This is for Python 2.x (won't work with Python 3):
>>> import re
>>> string_to_search = 'fish, hook, swallowed, reeled, boat, fish'
>>> entered_by_user = 'fish'
>>> search_regex = r"{0}(.+){0}".format(entered_by_user)
>>> match = re.search(search_regex, string_to_search)
>>> if match:
... print "result:", match.group(1).strip(' ,')
...
result: hook, swallowed, reeled, boat
If you really want the last 'fish' in the result as in your comment above, then just remove the second {0} from the format() string.
This solution should work:
me = re.findall(rf"(?<='(.+)'+{variable}+'(.+)')(.*?)(?='(.+)+{variable}+(.+)')", raw)
You also can add many different variables as you wish.
Add rf for the regular expression and the desired variables in between {}
import re
text = "regex is the best"
var1 = "is the"
var2 = "best"
yes = re.findall(rf"regex {var1} {var2}", text)
print(yes)
['regex is the best']
The way lookbehind is usually implemented (including its Python implementation) has an inherent limitation that you are unfortunately running into: lookbehinds cannot be variable-length. The "Important Notes About Lookbehind" section here explains why. I think you should be able to do the regex without a lookbehind, though.

Categories