Latex command substitution using regexp in python - python

I wrote a very ugly script in order to parse some rows of latex in python and doing string substitution. I'm here because I'm want to write something to be proud of, and learn :P
More specifically, I'd like to change:
\ket{(.*)} into |(.*)\rangle
\bra{(.*)} into \langle(*)|
To this end, I wrote a very very ugly script. The intended use is to do a thing like this:
cat file.tex | python script.py > new_file.tex
So what I did is the following. It's working, but is not nice at all and I'm wondering if you could give me a suggestion, even a link to the right command to use is ok. Note that I do recursion because when I have found the first "\ket{" i know that I want to replace the first occuring "}" (i.e. I'm sure there are no other subcommands within "\ket{"). But again, it's not the right way of parsing latex.
def recursion_ket(string_input, string_output=""):
match = re.search("\ket{", string_input)
if not match:
return string_input
else:
string_output = re.sub(r"\\ket{", '|', string_input, 1)
string_output_second =re.sub(r"}", "\rangle", stringa_output.split('|', 1)[1], 1)
string_output = string_output.split('|', 1)[0]+string_output_second
string_output=recursion_ket(string_output, string_output)
return string_output
if __name__ == '__main__':
with open(sys.argv[1]) as f:
content=f.readlines()
new=[]
for line in content:
new.append(ricorsione_ket(line))
z=open(sys.argv[2], 'w')
for i in new:
z.write(i.replace("\r", '\\r').replace("\b", '\\b'))
z.write("")
Which I know is very ugly. And it's definitely not the right way of doing it. Probably it's because I come from perl, and I'm not used to python regexp.
First problem: is it possible to use regexp to substitute just the "border" of a matching string, and leave the inside as it is? I want to leave the content of \command{xxx} as it is.
Second problem: the \r. Apparently, when I try to print on the terminal or in a file each string, I need to make sure \r is not interpreted as carriage return. I have tried to use the automatic escape, but it's not what I need. It escapes the \n with another \ and this is not what I want.

To answer your questions,
First problem: You can use (named) groups
Second problem: In Python3, you can use r"\btree" to deal with the backslash gracefully.
Using a latex parser like github.com/alvinwan/TexSoup, we can simplify the code a bit. I know OP has asked for regex, but if OP is tool-agnostic, a parser would be more robust.
Nice Function
We can abstract this into a replace function
def replaceTex(soup, command, replacement):
for node in soup.find_all(command):
node.replace(replacement.format(args=node.args))
Then, use this replaceTex function in the following way
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> replaceTex('bra', r"|{args[0]}\rangle")
>>> replaceTex('ket', r"\langle{args[0]}|")
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol
Demo
Here's a self-contained demonstration, based on TexSoup:
>>> import TexSoup
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> soup
\section{hello} text \bra{(.)} haha \ket{(.)}lol
>>> soup.ket.replace(r"|{args[0]}\rangle".format(args=soup.ket.args))
>>> soup.bra.replace(r"\langle{args[0]}|".format(args=soup.bra.args))
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol

Related

Find and replace string in between characters

I need to find a string in a file and reformat it.
String to find format:
[title](link)
example:
[template application](https://stackoverflow.com/sample/base-app)
I would like to change that to HTML link:
title
example:
template application
What is the best way to do it?
I am thinking regular expressions but I have no clue to how achieve that. Is there a simple way?
You could take advantage of the fact that the sub function receives another function as a parameter for replacement:
import re
line = '[template application](https://stackoverflow.com/sample/base-app)'
def repl(match):
return '{}'.format(match.group(2), match.group(1))
result = re.sub('\[(.+?)\]\((https?.+?)\)', repl, line)
print(result)
Output
template application
The pattern '\[(.+?)\]\((https?.+?)\)' captures everything between brackets followed by something link like (starts with http), notice that you must escape brackets and parenthesis because they have a special meaning inside a regex.
Or as suggested by #JonClements, you could use:
re.sub('\[(.+?)\]\((https?.+?)\)', r'\1', line)
instead of the repl function.
You can reach your desired result using re as Daniel said, But if you dont want to use regex, you can do this with str.split:
line = '[template application](https://stackoverflow.com/sample/base-app)'
link = line.split('(')[1][:-1]
title = line.split(']')[0][1:]
result = '{}'.format(link, title)
if you are using Python 3.6 or higher:
result = f'{title}'

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

How to parse a simple quoted string (handling escapes)

Note: this should be REALLY simple, I know. It's not. Or I'm dumb. But I tried hard.
What I want to do is simple. I have a string, there are strings inside it, separated with ,, quoted with '. I want to parse them. Consider the presence of \'s and \\s.
I want to do in the most simple, elegant and coincise way, obviously.
Now, on to some failed tries:
"I know, I'll use json!" No. JSON uses ". Too bad.
Mmmh, a regex? This looks like looking for trouble, but... Oh God my eyes those regexes I got from the Internet! At least do they... Nope, no support of escapes.
shlex! The Python standard library always has a solution! See below my failed attempt.
Current status: sobbing, writing a parser.
Test input: 'xx\'x,x\\x"xx\\\'\\',1,2,'xx\'x\''
Test output: xx'x,x\x"xx\'\, 1, 2, xx'x'
def split(s):
import shlex
lex = shlex.shlex(s, posix=True)
lex.whitespace = ','
lex.whitespace_split = True
lex.commenters = ''
return list(lex)
Made it. I've looked into csv before but I needed to customize it heavily. Here is the function
def parse_quoted_strings_list(s):
import csv
return next(csv.reader([s],
skipinitialspace=True,
quoting=csv.QUOTE_NONNUMERIC,
escapechar='\\',
doublequote=False,
quotechar="'"))
And here are the tests
>>> test = r"""'xx\'x,x\\x"xx\\\'\\',1,2,'xx\'x\''"""
>>> map(print, parse_quoted_strings_list(test))
xx'x,x\x"xx\'\
1.0
2.0
xx'x'

Python Regular Expression findall with variable

I am trying to use re.findall with look-behind and look-forward to extract data. The regular expression works fine when I am not using a raw_input variable, but I need users to be able to input a variety of different search terms.
Here is the current code:
me = re.findall(r"(?<='(.+)'+variable+'(.+)')(.*?)(?='(.+)+variable+(.+)')", raw)
As you can see, I am trying to pull out strings between one search term.
However, each time I use this type of formatting, I get a fixed width error. Is there anyway around this?
I have also tried the following formats with no success.
variable = raw_input('Term? ')
'.*' + variable + '.*'
and
'.*%s.*' % (variable, )
and
'.*{0}.*'.format(variable)
and
'.*{variable}.*'.format(variable=variable)
I'm not sure if this is what you mean, but it may get you started. As far as I understand your question, you don't need lookaheads or lookbehinds. This is for Python 2.x (won't work with Python 3):
>>> import re
>>> string_to_search = 'fish, hook, swallowed, reeled, boat, fish'
>>> entered_by_user = 'fish'
>>> search_regex = r"{0}(.+){0}".format(entered_by_user)
>>> match = re.search(search_regex, string_to_search)
>>> if match:
... print "result:", match.group(1).strip(' ,')
...
result: hook, swallowed, reeled, boat
If you really want the last 'fish' in the result as in your comment above, then just remove the second {0} from the format() string.
This solution should work:
me = re.findall(rf"(?<='(.+)'+{variable}+'(.+)')(.*?)(?='(.+)+{variable}+(.+)')", raw)
You also can add many different variables as you wish.
Add rf for the regular expression and the desired variables in between {}
import re
text = "regex is the best"
var1 = "is the"
var2 = "best"
yes = re.findall(rf"regex {var1} {var2}", text)
print(yes)
['regex is the best']
The way lookbehind is usually implemented (including its Python implementation) has an inherent limitation that you are unfortunately running into: lookbehinds cannot be variable-length. The "Important Notes About Lookbehind" section here explains why. I think you should be able to do the regex without a lookbehind, though.

Regular expression for string between two strings?

Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!
The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:
link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};
I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862
To get the links, I know I need to start with:
re.matchall(regexp, doc_sting)
But what should regexp be?
The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'}; then you can do it very simple using simple string manipulation:
myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )
(If you just have one string with multiple lines by that, you can just split the string into lines.)
If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:
myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""
print( re.findall( "'([^']+)'", myDoc ) )
Depending on how the whole string looks, you might have to include the link: as well:
print( re.findall( "link: '([^']+)'", myDoc ) )
I'd start with:
regexp = "'([^']+)'"
And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.
Use a few simple splits
>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
... if "/" in i:
... print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>

Categories