getting string between 2 characters in python - python

I need to get certain words out from a string in to a new format. For example, I call the function with the input:
text2function('$sin (x)$ is an function of x')
and I need to put them into a StringFunction:
StringFunction(function, independent_variables=[vari])
where I need to get just 'sin (x)' for function and 'x' for vari. So it would look like this finally:
StringFunction('sin (x)', independent_variables=['x']
problem is, I can't seem to obtain function and vari. I have tried:
start = string.index(start_marker) + len(start_marker)
end = string.index(end_marker, start)
return string[start:end]
and
r = re.compile('$()$')
m = r.search(string)
if m:
lyrics = m.group(1)
and
send = re.findall('$([^"]*)$',string)
all seems to seems to give me nothing. Am I doing something wrong? All help is appreciated. Thanks.

Tweeky way!
>>> char1 = '('
>>> char2 = ')'
>>> mystr = "mystring(123234sample)"
>>> print mystr[mystr.find(char1)+1 : mystr.find(char2)]
123234sample

$ is a special character in regex (it denotes the end of the string). You need to escape it:
>>> re.findall(r'\$(.*?)\$', '$sin (x)$ is an function of x')
['sin (x)']

If you want to cut a string between two identical characters (i.e, !234567890!)
you can use
line_word = line.split('!')
print (line_word[1])

You need to start searching for the second character beyond start:
end = string.index(end_marker, start + 1)
because otherwise it'll find the same character at the same location again:
>>> start_marker = end_marker = '$'
>>> string = '$sin (x)$ is an function of x'
>>> start = string.index(start_marker) + len(start_marker)
>>> end = string.index(end_marker, start + 1)
>>> string[start:end]
'sin (x)'
For your regular expressions, the $ character is interpreted as an anchor, not the literal character. Escape it to match the literal $ (and look for things that are not $ instead of not ":
send = re.findall('\$([^$]*)\$', string)
which gives:
>>> import re
>>> re.findall('\$([^$]*)\$', string)
['sin (x)']
The regular expression $()$ otherwise doesn't really match anything between the parenthesis even if you did escape the $ characters.

Related

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

what regular expression can extract data I need?

I have a string
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
I like to extract the number 528341191030 between the first two \u. I tried this,
m = re.search('\?id\u\d+d(\d+?)\u', url)
if m:
print m.group(1)
But it doesn't work. What is wrong with my solution?
Are you sure you need regex?
Here is a solution using split:
url.split("\u")[1].split("d")[-1]
'528341191030'
In terms of what is wrong with your regex, "\" is a special character, so you should use "\\" for backslash (so " \\\u" instead of "\u"):
m = re.search('\?id\\\u\d+d(\d+?)\\\u', url)
if m:
print m.group(1)
Gives: 528341191030
Docs:
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\'
as the pattern string, because the regular expression must be \, and
each backslash must be expressed as \ inside a regular Python string
literal.
Or,use Raw String Notation
m = re.search(r"\?id\\u\d+d(\d+?)\\u", url)
if m:
print m.group(1)
Well, you could always try this (not super elegant but works):
first = url.find('\u') + 2
prev = 'u'
m = ""
for i in range(first, len(url)):
if prev == '\' and url[i] == 'u':
break
else:
m += url[i]
if url[i] == 'd':
m = ""
Better way is to parseurl and get the query string values
url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail'
import urllib.parse as urlparse
print ( urlparse.parse_qs(urlparse.urlparse(url).query) )
print ( urlparse.parse_qs(urlparse.urlparse(url).query)['id'] )
Output:
{'id': ['528341191030'], 'ns': ['1'], 'abbucket': ['0']}
['528341191030']

Regex match everything between special tag

I have the following string that I need to parse and get the values of anything inside the defined \$ tags
for example, the string
The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$
I want to parse whatever is in between the \$ tags, so that the result will contain both equations
'f(x) = x^2'
'g(x) = x^(4/2) '
I tried something like re.compile(r'\\\$(.)*\\$') but it didnt work.
You almost got it, just missing a backslash and a question mark (so it stops as soon as it finds the second \$ and doesn't match the longest string possible): r'\\\$(.*?)\\\$'
>>> pattern = r'\\\$(.*?)\\\$'
>>> data = "The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$"
>>> re.findall(pattern, data)
['f(x) = x^2', 'g(x) = x^(4/2) ']
That regex can fit:
/\\\$.{0,}\\\$/g
/ - begin
\\\$ - escaped: \$
. - any character between
{0,} - at least 0 chars (any number of chars, actually)
\\\$ - escaped: \$
/ - end
g - global search
This works:
import re
regex = r'\\\$(.*)\\\$'
r = re.compile(regex)
print r.match("\$f(x) = x^2\$").group(1)
print r.match("\$g(x) = x^(4/2) \$").group(1)

python prefix string with backslash

I am looking for a way to prefix strings in python with a single backslash, e.g. "]" -> "]". Since "\" is not a valid string in python, the simple
mystring = '\' + mystring
won't work. What I am currently doing is something like this:
mystring = r'\###' + mystring
mystring.replace('###','')
While this works most of the time, it is not elegant and also can cause problems for strings containing "###" or whatever the "filler" is set to. Is there a bette way of doing this?
You need to escape the backslash with a second one, to make it a literal backslash:
mystring = "\\" + mystring
Otherwise it thinks you're trying to escape the ", which in turn means you have no quote to terminate the string
Ordinarily, you can use raw string notation (r'string'), but that won't work when the backslash is the last character
The difference between print a and just a:
>>> a = 'hello'
>>> a = '\\' + a
>>> a
'\\hello'
>>> print a
\hello
Python strings have a feature called escape characters. These allow you to do special things inside as string, such as showing a quote (" or ') without closing the string you're typing
See this table
So when you typed
mystring = '\' + mystring
the \' is an escaped apostrophe, meaning that your string now has an apostrophe in it, meaning it isn't actually closed, which you can see because the rest of that line is coloured.
To type a backslash, you must escape one, which is done like this:
>>> aBackSlash = '\\'
>>> print(aBackSlash)
\
You should escape the backslash as follows:
mystring = "\\" + mystring
This is because if you do '\' it will end up escaping the second quotation. Therefore to treat the backslash literally, you must escape it.
Examples
>>> s = 'hello'
>>> s = '\\' + s
>>> print
\hello
Your case
>>> mystring = 'it actually does work'
>>> mystring = '\\' + mystring
>>> print mystring
\it actually does work
As a different way of approaching the problem, have you considered string formatting?
r'\%s' % mystring
or:
r'\{}'.format(mystring)

Extracting part of string in parenthesis using python

I have a csv file with a column with strings. Part of the string is in parentheses. I wish to move the part of string in parentheses to a different column and retain the rest of the string as it is.
For instance: I wish to convert:
LC(Carbamidomethyl)RLK
to
LCRLK Carbamidomethyl
Regex solution
If you only have one parentheses group in your string, you can use this regex:
>>> a = "LC(Carbamidomethyl)RLK"
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK Carbamidomethyl'
>>> a = "LCRLK"
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK' # works with no parentheses too
Regex decomposed:
(.*) #! Capture begin of the string
\( # match first parenthesis
(.+) #! Capture content into parentheses
\) # match the second
(.*) #! Capture everything after
---------------
\g<1>\g<3> \g<2> # Write each capture in the correct order
String manipulation solution
A faster solution, for huge data set is:
begin, end = a.find('('), a.find(')')
if begin != -1 and end != -1:
a = a[:begin] + a[end+1:] + " " + a[begin+1:end]
The process is to get the positions of parentheses (if there's any) and cut the string where we want. Then, we concatenate the result.
Performance of each method
It's clear that the string manipulation is the fastest method:
>>> timeit.timeit("re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)", setup="a = 'LC(Carbadidomethyl)RLK'; import re")
15.214869976043701
>>> timeit.timeit("begin, end = a.find('('), a.find(')') ; b = a[:begin] + a[end+1:] + ' ' + a[begin+1:end]", setup="a = 'LC(Carbamidomethyl)RL'")
1.44008207321167
Multi parentheses set
See comments
>>> a = "DRC(Carbamidomethyl)KPVNTFVHESLADVQAVC(Carbamidomethyl)SQKNVACK"
>>> while True:
... begin, end = a.find('('), a.find(')')
... if begin != -1 and end != -1:
... a = a[:begin] + a[end+1:] + " " + a[begin+1:end]
... else:
... break
...
>>> a
'DRCKPVNTFVHESLADVQAVCSQKNVACK Carbamidomethyl Carbamidomethyl'

Categories