how to handle '../' in python? - python

i need to strip ../something/ from a url
eg. strip ../first/ from ../first/bit/of/the/url.html where first can be anything.
what's the best way to achieve this?
thanks :)

You can simply split the path twice at the official path separator (os.sep, and not '/') and take the last bit:
>>> s = "../first/bit/of/the/path.html"
>>> s.split(os.sep, 2)[-1]
'bit/of/the/path.html'
This is also more efficient than splitting the path completely and stringing it back together.
Note that this code does not complain when the path contains fewer than 3+ path elements (for instance, 'file.html' yields 'file.html'). If you want the code to raise an exception if the path is not of the expected form, you can just ask for its third element (which is not present for paths that are too short):
>>> s.split(os.sep, 2)[2]
This can help detect some subtle errors.

EOL has given a nice and clean approach however I could not resist giving a regex alternative to it:)
>>> import re
>>> m=re.search('^(\.{2}\/\w+/)(.*)$','../first/bit/of/the/path.html')
>>> m.group(1)
'../first/'

Related

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

How to extract string between key substring and "/" with regex?

I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.

How to extract a substring from inside a string?

Let's say I have a string /Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherthing I want to extract just the '0-1-2-3-4-5' part. I tried this:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.find("-")]
But, the result is only 0. How to extract just the '0-1-2-3-4-5' part?
Use os.path.basename and rsplit:
>>> from os.path import basename
>>> name = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> number, tail = basename(name).rsplit('-', 1)
>>> number
'0-1-2-3-4-5'
You're almost there:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.rfind("-")]
rfind will search from the end. This assumes that no dashes appear anywhere else in the path. If it can, do this instead:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
str = os.path.basename(str)
print str[str.find("-")-1:str.rfind("-")]
basename will grab the filename, excluding the rest of the path. That's probably what you want.
Edit:
As pointed out by #bradley.ayers, this breaks down in the case where the filename isn't exactly described in the question. Since we're using basename, we can omit the beginning index:
print str[:str.rfind("-")]
This would parse '/Apath1/Bpath2/Cpath3/10-1-2-3-4-5-something.otherhing' as '10-1-2-3-4-5'.
This works:
>>> str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> str.split('/')[-1].rsplit('-', 1)[0]
'0-1-2-3-4-5'
Assuming that what you want is just what's between the last '/' and the last '-'. The other suggestions with os.path might make better sense (as long as there is no OS confusion over what a a proper path looks like)
you could use re:
>>> import re
>>> ss = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> re.search(r'(?:\d-)+\d',ss).group(0)
'0-1-2-3-4-5'
While slightly more complicated, it seems like a solution similar to this might be slightly more robust...

split twice in the same expression?

Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
You can get what you want platform independently by using os.path.basename to get the last part of a path and then use os.path.splitext to get the filename without extension.
from os.path import basename, splitext
pathname = "/adda/adas/sdas/hello.txt"
name, extension = splitext(basename(pathname))
print name # --> "hello"
Using os.path.basename and os.path.splitext instead of str.split, or re.split is more proper (and therefore received more points then any other answer) because it does not break down on other platforms that use different path separators (you would be surprised how varried this can be).
It also carries most points because it answers your question for "one line" precisely and is aesthetically more pleasing then your example (even though that is debatable as are all questions of taste)
Answering the question in the topic rather than trying to analyze the example...
You really want to use Florians solution if you want to split paths, but if you promise not to use this for path parsing...
You can use re.split() to split using several separators by or:ing them with a '|', have a look at this:
import re
inFile = "/adda/adas/sdas/hello.txt"
print re.split('\.|/', inFile)[-2]
>>> inFile = "/adda/adas/sdas/hello.txt"
>>> inFile.split('/')[-1]
'hello.txt'
>>> inFile.split('/')[-1].split('.')[0]
'hello'
if it is always going to be a path like the above you can use os.path.split and os.path.splitext
The following example will print just the hello
from os.path import split, splitext
path = "/adda/adas/sdas/hello.txt"
print splitext(split(path)[1])[0]
For more info see https://docs.python.org/library/os.path.html
I'm pretty sure some Regex-Ninja*, would give you a more or less sane way to do that (or as I now see others have posted: ways to write two expressions on one line...)
But I'm wondering why you want to do split it with just one expression?
For such a simple split, it's probably faster to do two than to create some advanced either-or logic. If you split twice it's safer too:
I guess you want to separate the path, the file name and the file extension, if you split on '/' first you know the filename should be in the last array index, then you can try to split just the last index to see if you can find the file extension or not. Then you don't need to care if ther is dots in the path names.
*(Any sane users of regular expressions, should not be offended. ;)

Categories