Find and replace string in between characters - python

I need to find a string in a file and reformat it.
String to find format:
[title](link)
example:
[template application](https://stackoverflow.com/sample/base-app)
I would like to change that to HTML link:
title
example:
template application
What is the best way to do it?
I am thinking regular expressions but I have no clue to how achieve that. Is there a simple way?

You could take advantage of the fact that the sub function receives another function as a parameter for replacement:
import re
line = '[template application](https://stackoverflow.com/sample/base-app)'
def repl(match):
return '{}'.format(match.group(2), match.group(1))
result = re.sub('\[(.+?)\]\((https?.+?)\)', repl, line)
print(result)
Output
template application
The pattern '\[(.+?)\]\((https?.+?)\)' captures everything between brackets followed by something link like (starts with http), notice that you must escape brackets and parenthesis because they have a special meaning inside a regex.
Or as suggested by #JonClements, you could use:
re.sub('\[(.+?)\]\((https?.+?)\)', r'\1', line)
instead of the repl function.

You can reach your desired result using re as Daniel said, But if you dont want to use regex, you can do this with str.split:
line = '[template application](https://stackoverflow.com/sample/base-app)'
link = line.split('(')[1][:-1]
title = line.split(']')[0][1:]
result = '{}'.format(link, title)
if you are using Python 3.6 or higher:
result = f'{title}'

Related

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

Latex command substitution using regexp in python

I wrote a very ugly script in order to parse some rows of latex in python and doing string substitution. I'm here because I'm want to write something to be proud of, and learn :P
More specifically, I'd like to change:
\ket{(.*)} into |(.*)\rangle
\bra{(.*)} into \langle(*)|
To this end, I wrote a very very ugly script. The intended use is to do a thing like this:
cat file.tex | python script.py > new_file.tex
So what I did is the following. It's working, but is not nice at all and I'm wondering if you could give me a suggestion, even a link to the right command to use is ok. Note that I do recursion because when I have found the first "\ket{" i know that I want to replace the first occuring "}" (i.e. I'm sure there are no other subcommands within "\ket{"). But again, it's not the right way of parsing latex.
def recursion_ket(string_input, string_output=""):
match = re.search("\ket{", string_input)
if not match:
return string_input
else:
string_output = re.sub(r"\\ket{", '|', string_input, 1)
string_output_second =re.sub(r"}", "\rangle", stringa_output.split('|', 1)[1], 1)
string_output = string_output.split('|', 1)[0]+string_output_second
string_output=recursion_ket(string_output, string_output)
return string_output
if __name__ == '__main__':
with open(sys.argv[1]) as f:
content=f.readlines()
new=[]
for line in content:
new.append(ricorsione_ket(line))
z=open(sys.argv[2], 'w')
for i in new:
z.write(i.replace("\r", '\\r').replace("\b", '\\b'))
z.write("")
Which I know is very ugly. And it's definitely not the right way of doing it. Probably it's because I come from perl, and I'm not used to python regexp.
First problem: is it possible to use regexp to substitute just the "border" of a matching string, and leave the inside as it is? I want to leave the content of \command{xxx} as it is.
Second problem: the \r. Apparently, when I try to print on the terminal or in a file each string, I need to make sure \r is not interpreted as carriage return. I have tried to use the automatic escape, but it's not what I need. It escapes the \n with another \ and this is not what I want.
To answer your questions,
First problem: You can use (named) groups
Second problem: In Python3, you can use r"\btree" to deal with the backslash gracefully.
Using a latex parser like github.com/alvinwan/TexSoup, we can simplify the code a bit. I know OP has asked for regex, but if OP is tool-agnostic, a parser would be more robust.
Nice Function
We can abstract this into a replace function
def replaceTex(soup, command, replacement):
for node in soup.find_all(command):
node.replace(replacement.format(args=node.args))
Then, use this replaceTex function in the following way
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> replaceTex('bra', r"|{args[0]}\rangle")
>>> replaceTex('ket', r"\langle{args[0]}|")
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol
Demo
Here's a self-contained demonstration, based on TexSoup:
>>> import TexSoup
>>> soup = TexSoup(r"\section{hello} text \bra{(.)} haha \ket{(.)}lol")
>>> soup
\section{hello} text \bra{(.)} haha \ket{(.)}lol
>>> soup.ket.replace(r"|{args[0]}\rangle".format(args=soup.ket.args))
>>> soup.bra.replace(r"\langle{args[0]}|".format(args=soup.bra.args))
>>> soup
\section{hello} text \langle(.)| haha |(.)\ranglelol

regex error : raise error, v # invalid expression

I have a variable device='A/B/C/X1' that is commented out in another file. There can be multiple instances of the same device such as 'A/B/C/X1#1', ..#2 and so on. All of these devices are commented out in another file with a prefix *.
I want to remove the * but not affect similar devices like 'A/B/C/X**10**'.
I tried using regex to simply substitute a pattern using the following line of code, but I'm getting an InvalidExpression error.
line=re.sub('^*'+device+'#',device+'#',line)
Please help.
You need to escape the asterisk since it has a meaning in regex syntax:
line=re.sub(r'^\*'+device+'#',device+'#',line).
Escaping the variables you use to construct the regex is also always a good idea:
line=re.sub(r'^\*'+re.escape(device)+'#',device+'#',line)
def replace_all(text):
if text in ['*']:
text = text.replace(text, '')
return text
my_text = 'adsada*asd*****dsa*****'
result = "".join(map(replace_all, my_text))
print result
Or
import re
my_text = 'adsada*asd*****dsa*****'
print (re.sub('\*', '', my_text))

python regex on variable

Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.
If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']
You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')
You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).

Regular expression for string between two strings?

Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!
The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:
link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};
I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862
To get the links, I know I need to start with:
re.matchall(regexp, doc_sting)
But what should regexp be?
The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'}; then you can do it very simple using simple string manipulation:
myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )
(If you just have one string with multiple lines by that, you can just split the string into lines.)
If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:
myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""
print( re.findall( "'([^']+)'", myDoc ) )
Depending on how the whole string looks, you might have to include the link: as well:
print( re.findall( "link: '([^']+)'", myDoc ) )
I'd start with:
regexp = "'([^']+)'"
And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.
Use a few simple splits
>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
... if "/" in i:
... print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>

Categories