How to extract a substring from inside a string? - python

Let's say I have a string /Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherthing I want to extract just the '0-1-2-3-4-5' part. I tried this:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.find("-")]
But, the result is only 0. How to extract just the '0-1-2-3-4-5' part?

Use os.path.basename and rsplit:
>>> from os.path import basename
>>> name = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> number, tail = basename(name).rsplit('-', 1)
>>> number
'0-1-2-3-4-5'

You're almost there:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.rfind("-")]
rfind will search from the end. This assumes that no dashes appear anywhere else in the path. If it can, do this instead:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
str = os.path.basename(str)
print str[str.find("-")-1:str.rfind("-")]
basename will grab the filename, excluding the rest of the path. That's probably what you want.
Edit:
As pointed out by #bradley.ayers, this breaks down in the case where the filename isn't exactly described in the question. Since we're using basename, we can omit the beginning index:
print str[:str.rfind("-")]
This would parse '/Apath1/Bpath2/Cpath3/10-1-2-3-4-5-something.otherhing' as '10-1-2-3-4-5'.

This works:
>>> str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> str.split('/')[-1].rsplit('-', 1)[0]
'0-1-2-3-4-5'
Assuming that what you want is just what's between the last '/' and the last '-'. The other suggestions with os.path might make better sense (as long as there is no OS confusion over what a a proper path looks like)

you could use re:
>>> import re
>>> ss = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> re.search(r'(?:\d-)+\d',ss).group(0)
'0-1-2-3-4-5'
While slightly more complicated, it seems like a solution similar to this might be slightly more robust...

Related

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

How to check if a given pathname matches a given regular expression in Python

I'm trying to figure a way out to compare each directory path against a given regular expression to find out if it matches that pattern or not.
I have the following list of paths
C:\Dir
C:\Dir\data
C:\Dir\data\file1
C:\Dir\data\file2
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
I only want to print those paths that match the following pattern
where "*" can replace zero or more directory levels and match1 can be either the name of a file or directory.
C:\Dir\*\match1
I figured out that re.match() would help me out with this but I'm having a hard time trying to figure out how to define the pattern and the one I came up with (pasted below) doesn't work at all. item will contain the path in quotes
re.match("((C:\\)(Dir)\\(.*)\\(match1))",item)
Can someone please help me out with this task ?
You could go for:
^C:\\Dir\\.+?match1.*
See a demo on regex101.com.
In Python, this would be:
import re
rx = re.compile(r'C:\\Dir\\.+?match1.*')
files = [r'C:\Dir', r'C:\Dir\data', r'C:\Dir\data\file1', r'C:\Dir\data\file2', r'C:\Dir\data\match1\file1', r'C:\Dir\data\match1\file2']
filtered = [match.group(0)
for file in files
for match in [rx.match(file)]
if match]
print(filtered)
Or, if you like filter() and lambda():
filtered = list(filter(lambda x: rx.match(x), files))
Your regexp is:
^C:\\Dir\\.*match1
Explanation is:
C:\\Dir\\ is start sub string of your path
.* any other symbols in path
match1 explicit name of something that goes after (file or dir)
Since I don't have yet reputation to comment, I'll remark here.
The solution proposed by #Jan works for the particular list of paths in question, but has a few problems if applied as a general solution. If list of paths is as follows:
>>> print paths
C:\Dir
C:\Dir\data
C:\Dir\match1
C:\Dir\data\file1
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
C:\Dir\data\abcmatch1def\file3
C:\Dir\data\file1\match12
C:\Dir\data\file1\match1
>>>
the (r'C:\Dir\.+?match1.*') fails to match "C:\Dir\match1" and produces false positives, i.e. "C:\Dir\data\abcmatch1def\file3" and "C:\Dir\data\file1\match12".
Proposed solution:
>>> import re
>>> for line in paths.splitlines():
... if re.match(r"C:\\Dir.*\\match1(\\|$)", line):
... print line
...
C:\Dir\match1
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
C:\Dir\data\file1\match1
>>>

How to use regex in python in getting a string between two characters?

I have this as my input
content = 'abc.zip'\n
I want to take out abc out of it . How do I do it using regex in python ?
Edit :
No this is not a homework question . I am trying to automate something and I am stuck at a certain point so that I can make the automate generic to any zip file I have .
os.system('python unzip.py -z data/ABC.zip -o data/')
After I take in the zip file , I unzip it .
I am planning to make it generic , by getting the filename from the directory the zip file was put in and then provide the file name to the upper stated syntax to unzip it
As I implied in my comment, regular expressions are unlikely to be the best tool for the job (unless there is some artificial restriction on the problem, or it is far more complex than your example). The standard string and/or path libraries provide functions which should do what you are after. To better illustrate how these work, I'll use the following definition of content instead:
>>> content = 'abc.def.zip'
If its a file, and you want the name and extension:
>>> import os.path
>>> filename, extension = os.path.splitext(content)
>>> print filename
abc.def
>>> print extension
.zip
If it is a string, and you want to remove the substring 'abc':
>>> noabc = content.replace('abc', '')
>>> print noabc
.def.zip
If you want to break it up on each occurrence of a period;
>>> broken = content.split('.')
>>> print broken
['abc', 'def', 'zip']
If it has multiple periods, and you want to break it on the first or last one:
>>> broken = content.split('.', 1)
>>> print broken
['abc', 'def.zip']
>>> broken = content.rsplit('.', 1)
>>> print broken
['abc.def', 'zip']
Edit: Changed the regexp to match for "content = 'abc.zip\n'" instead of the string "abc.zip".
import re
#Matching for "content = 'abc.zip\n'"
matches = re.match("(?P<filename>.*).zip\n'$", "content = 'abc.zip\n'")
matches = matches.groupdict()
print matches
#Matching for "abc.zip"
matches = re.match("(?P<filename>.*).zip$", "abc.zip")
matches = matches.groupdict()
print matches
Output:
{'filename': 'abc'}
This will print the matches of everything before .zip. You can access everything like a regular dictionary.
If you're trying to break up parts of a path, you may find the os.path module to be useful. It has nice abstractions with clear semantics that are easy to use.

how to handle '../' in python?

i need to strip ../something/ from a url
eg. strip ../first/ from ../first/bit/of/the/url.html where first can be anything.
what's the best way to achieve this?
thanks :)
You can simply split the path twice at the official path separator (os.sep, and not '/') and take the last bit:
>>> s = "../first/bit/of/the/path.html"
>>> s.split(os.sep, 2)[-1]
'bit/of/the/path.html'
This is also more efficient than splitting the path completely and stringing it back together.
Note that this code does not complain when the path contains fewer than 3+ path elements (for instance, 'file.html' yields 'file.html'). If you want the code to raise an exception if the path is not of the expected form, you can just ask for its third element (which is not present for paths that are too short):
>>> s.split(os.sep, 2)[2]
This can help detect some subtle errors.
EOL has given a nice and clean approach however I could not resist giving a regex alternative to it:)
>>> import re
>>> m=re.search('^(\.{2}\/\w+/)(.*)$','../first/bit/of/the/path.html')
>>> m.group(1)
'../first/'

split twice in the same expression?

Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
You can get what you want platform independently by using os.path.basename to get the last part of a path and then use os.path.splitext to get the filename without extension.
from os.path import basename, splitext
pathname = "/adda/adas/sdas/hello.txt"
name, extension = splitext(basename(pathname))
print name # --> "hello"
Using os.path.basename and os.path.splitext instead of str.split, or re.split is more proper (and therefore received more points then any other answer) because it does not break down on other platforms that use different path separators (you would be surprised how varried this can be).
It also carries most points because it answers your question for "one line" precisely and is aesthetically more pleasing then your example (even though that is debatable as are all questions of taste)
Answering the question in the topic rather than trying to analyze the example...
You really want to use Florians solution if you want to split paths, but if you promise not to use this for path parsing...
You can use re.split() to split using several separators by or:ing them with a '|', have a look at this:
import re
inFile = "/adda/adas/sdas/hello.txt"
print re.split('\.|/', inFile)[-2]
>>> inFile = "/adda/adas/sdas/hello.txt"
>>> inFile.split('/')[-1]
'hello.txt'
>>> inFile.split('/')[-1].split('.')[0]
'hello'
if it is always going to be a path like the above you can use os.path.split and os.path.splitext
The following example will print just the hello
from os.path import split, splitext
path = "/adda/adas/sdas/hello.txt"
print splitext(split(path)[1])[0]
For more info see https://docs.python.org/library/os.path.html
I'm pretty sure some Regex-Ninja*, would give you a more or less sane way to do that (or as I now see others have posted: ways to write two expressions on one line...)
But I'm wondering why you want to do split it with just one expression?
For such a simple split, it's probably faster to do two than to create some advanced either-or logic. If you split twice it's safer too:
I guess you want to separate the path, the file name and the file extension, if you split on '/' first you know the filename should be in the last array index, then you can try to split just the last index to see if you can find the file extension or not. Then you don't need to care if ther is dots in the path names.
*(Any sane users of regular expressions, should not be offended. ;)

Categories