Compare & manipulate strings with python - python

I've written an XML parser in Python and have just added functionality to read a further script from a different directory.
I've got two args, first is the path where I'm parsing XML. Second is a string in another XML file which I want to match with the first path;
arg1 = \work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator
path = calculators/2012/example/calculator
How can I compare the two strings to match identify that they're both referencing the same thing and also, how can I strip calculator from either string so I can store that & use it?
edit
Just had a thought. I have used a Regex to get the year out of the path already with year = re.findall(r"\.(\d{4})\.", path) following a problem Python has with numbers when converting the path to an import statement.
I could obviously split the strings and use a regex to match the path as a pattern in arg1 but this seems a long way round. Surely there's a better method?

Here I am assuming you are actually talking about strings, and not file paths - for which #mgilson's suggestion is better
How can I compare the two strings to match identify that they're both
referencing the same thing
Well first you need to identify what you mean by "the same thing"
At first glance it seems that if the the second string ends with the first string with the reversed slash, you have a match.
arg1 = r'\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator'
arg2 = r'calculators/2012/example/calculator'
>>> arg1.endswith(arg2.replace('/','\\'))
True
and also, how can I strip calculator from
either string so I can store that & use it?
You also need to decide if you want to strip the first calculator, the last calculator or any occurance of calculator in the string.
If you just want to remove the last string after the separator, then its simply:
>>> arg2.split('/')[-1]
'calculator'
Now to get the orignal string back, without the last bit:
>>> '/'.join(arg2.split('/')[:-1])
'calculators/2012/example'

check out os.path.samefile:
http://docs.python.org/library/os.path.html#os.path.samefile
and os.path.dirname:
http://docs.python.org/library/os.path.html#os.path.dirname
or maybe os.path.basename (I'm not sure what part of the string you want to keep).

Here, try this:
arg1 = "\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator"
path = "calculators/2012/example/calculator"
arg1=arg1.replace("/","\\")
path=path.replace("/","\\")
if str(arg1).endswith(str(path)) or str(path).endswith(str(arg1)):
print "Match"
That should work for your needs. Cheers :)

Related

Is there a way to strip the end of a string until a certain character is reached?

I'm working on a side project for myself and have stumbled on an issue that I'm not sure how to solve for. I have a url, for arguments sake let's say https://stackoverflow.com/xyz/abc. I'm attempting to strip the the end of the url so that I am only left with https://stackoverflow.com/xyz/.
Initially I tried to use the strip function and specify a length/position to remove up to, but realized for other url's I'm working with, it is not the same length. (i.e. URL 1 = /xyz/abc, URL 2 = /xyz/abcd))
Is there any advice for achieving this, I looked into using the regular expression operations in Python, but was unsure how to apply it to this use case. Ideally I would like to write a function that would start from the end of the string and strip away all characters till the first '/' is reached. Any advice would be appreciated.
Thanks
Why not just use rfind, which starts from the end?
>>> string = 'https://stackoverflow.com/xyz/abc'
>>> string = string[:string.rfind('/')+1]
>>> print(string)
'https://stackoverflow.com/xyz/'
And if you don't want the character either (the / in this case), simply remove the +1.
Keep in mind however that this only works if the string actually contains the character you are looking for.
If you want to protect against this, you will have to use the following:
string = 'https://stackoverflow.com/xyz/abc'
idx = string.rfind('/')
if(idx != -1):
string = string[:idx+1]
Unless, obviously, you do want to end up with an empty string in case the character is not found.
Then the first example works just fine.
if yo dont want to use regex, you can combine both the split and join().
lol = 'https://stackoverflow.com/xyz/abc'
splt= lol.split('/')[:-1]
'/'.join(splt)
output
'https://stackoverflow.com/xyz'

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

How to extract string between key substring and "/" with regex?

I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.

Iterating through python string array gives unexpected output

I was debugging some python code and as any begginer, I'm using print statements. I narrowed down the problem to:
paths = ("../somepath") #is this not how you declare an array/list?
for path in paths:
print path
I was expecting the whole string to be printed out, but only . is. Since I planned on expanding it anyway to cover more paths, it appears that
paths = ("../somepath", "../someotherpath")
fixes the problem and correctly prints out both strings.
I'm assuming the initial version treats the string as an array of characters (or maybe that's just the C++ in me talking) and just prints out characters.?...??
I'd still like to know why this happens.
("../somepath")
is nothing but a string covered in parenthesis. So, it is the same as "../somepath". Since Python's for loop can iterate through any iterable and a string happens to be an iterable, it prints one character at a time.
To create a tuple with one element, use comma at the end
("../somepath",)
If you want to create a list, you need to use square brackets, like this
["../somepath"]
paths = ["../somepath","abc"]
This way you can create list.Now your code will work .
paths = ("../somepath", "../someotherpath") this worked as it formed a tuple.Which again is a type of non mutable list.
Tested it and the output is one character per line
So all is printed one character per character
To get what you want you need
# your code goes here
paths = ['../somepath'] #is this not how you declare an array/list?
for path in paths:
print path

Print raw string from variable? (not getting the answers)

I'm trying to find a way to print a string in raw form from a variable. For instance, if I add an environment variable to Windows for a path, which might look like 'C:\\Windows\Users\alexb\', I know I can do:
print(r'C:\\Windows\Users\alexb\')
But I cant put an r in front of a variable.... for instance:
test = 'C:\\Windows\Users\alexb\'
print(rtest)
Clearly would just try to print rtest.
I also know there's
test = 'C:\\Windows\Users\alexb\'
print(repr(test))
But this returns 'C:\\Windows\\Users\x07lexb'
as does
test = 'C:\\Windows\Users\alexb\'
print(test.encode('string-escape'))
So I'm wondering if there's any elegant way to make a variable holding that path print RAW, still using test? It would be nice if it was just
print(raw(test))
But its not
I had a similar problem and stumbled upon this question, and know thanks to Nick Olson-Harris' answer that the solution lies with changing the string.
Two ways of solving it:
Get the path you want using native python functions, e.g.:
test = os.getcwd() # In case the path in question is your current directory
print(repr(test))
This makes it platform independent and it now works with .encode. If this is an option for you, it's the more elegant solution.
If your string is not a path, define it in a way compatible with python strings, in this case by escaping your backslashes:
test = 'C:\\Windows\\Users\\alexb\\'
print(repr(test))
In general, to make a raw string out of a string variable, I use this:
string = "C:\\Windows\Users\alexb"
raw_string = r"{}".format(string)
output:
'C:\\\\Windows\\Users\\alexb'
You can't turn an existing string "raw". The r prefix on literals is understood by the parser; it tells it to ignore escape sequences in the string. However, once a string literal has been parsed, there's no difference between a raw string and a "regular" one. If you have a string that contains a newline, for instance, there's no way to tell at runtime whether that newline came from the escape sequence \n, from a literal newline in a triple-quoted string (perhaps even a raw one!), from calling chr(10), by reading it from a file, or whatever else you might be able to come up with. The actual string object constructed from any of those methods looks the same.
I know i'm too late for the answer but for people reading this I found a much easier way for doing it
myVariable = 'This string is supposed to be raw \'
print(r'%s' %myVariable)
try this. Based on what type of output you want. sometime you may not need single quote around printed string.
test = "qweqwe\n1212as\t121\\2asas"
print(repr(test)) # output: 'qweqwe\n1212as\t121\\2asas'
print( repr(test).strip("'")) # output: qweqwe\n1212as\t121\\2asas
Get rid of the escape characters before storing or manipulating the raw string:
You could change any backslashes of the path '\' to forward slashes '/' before storing them in a variable. The forward slashes don't need to be escaped:
>>> mypath = os.getcwd().replace('\\','/')
>>> os.path.exists(mypath)
True
>>>
Just simply use r'string'. Hope this will help you as I see you haven't got your expected answer yet:
test = 'C:\\Windows\Users\alexb\'
rawtest = r'%s' %test
I have my variable assigned to big complex pattern string for using with re module and it is concatenated with few other strings and in the end I want to print it then copy and check on regex101.com.
But when I print it in the interactive mode I get double slash - '\\w'
as #Jimmynoarms said:
The Solution for python 3x:
print(r'%s' % your_variable_pattern_str)
Your particular string won't work as typed because of the escape characters at the end \", won't allow it to close on the quotation.
Maybe I'm just wrong on that one because I'm still very new to python so if so please correct me but, changing it slightly to adjust for that, the repr() function will do the job of reproducing any string stored in a variable as a raw string.
You can do it two ways:
>>>print("C:\\Windows\Users\alexb\\")
C:\Windows\Users\alexb\
>>>print(r"C:\\Windows\Users\alexb\\")
C:\\Windows\Users\alexb\\
Store it in a variable:
test = "C:\\Windows\Users\alexb\\"
Use repr():
>>>print(repr(test))
'C:\\Windows\Users\alexb\\'
or string replacement with %r
print("%r" %test)
'C:\\Windows\Users\alexb\\'
The string will be reproduced with single quotes though so you would need to strip those off afterwards.
To turn a variable to raw str, just use
rf"{var}"
r is raw and f is f-str; put them together and boom it works.
Replace back-slash with forward-slash using one of the below:
re.sub(r"\", "/", x)
re.sub(r"\", "/", x)
This does the trick
>>> repr(string)[1:-1]
Here is the proof
>>> repr("\n")[1:-1] == r"\n"
True
And it can be easily extrapolated into a function if need be
>>> raw = lambda string: repr(string)[1:-1]
>>> raw("\n")
'\\n'
i wrote a small function.. but works for me
def conv(strng):
k=strng
k=k.replace('\a','\\a')
k=k.replace('\b','\\b')
k=k.replace('\f','\\f')
k=k.replace('\n','\\n')
k=k.replace('\r','\\r')
k=k.replace('\t','\\t')
k=k.replace('\v','\\v')
return k
Here is a straightforward solution.
address = 'C:\Windows\Users\local'
directory ="r'"+ address +"'"
print(directory)
"r'C:\\Windows\\Users\\local'"

Categories