Python windows path regex - python

I've spent the last two hours figuring this out. I have this string:
C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png
I am interested in getting \\Livier_11.png but it seems impossible for me. How can I do this?

I'd strongly recommend using the python pathlib module. It's part of the standard library and designed to handle file paths. Some examples:
>>> from pathlib import Path
>>> p = Path(r"C:\Users\Bob\.luxshop\jeans\diesel-qd\images\Livier_11.png")
>>> p
WindowsPath('C:/Users/Bob/.luxshop/jeans/diesel-qd/images/Livier_11.png')
>>> p.name
'Livier_11.png'
>>> p.parts
('C:\\', 'Users', 'Bob', '.luxshop', 'jeans', 'diesel-qd', 'images', 'Livier_11.png')
>>> # construct a path from parts
...
>>> Path("C:\some_folder", "subfolder", "file.txt")
WindowsPath('C:/some_folder/subfolder/file.txt')
>>> p.exists()
False
>>> p.is_file()
False
>>>
Edit:
If you want to use regex, this should work:
>>> s = "C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png"
>>> import re
>>> match = re.match(r".*(\\.*)$", s)
>>> match.group(1)
'\\Livier_11.png'
>>>

You can use this
^.*(\\\\.*)$
Explanation
^ - Anchor to start of string.
.* - Matches anything except new line zero or time (Greedy method).
(\\\\.*) - Capturing group. Matches \\ followed any thing except newline zero or more time.
$ - Anchor to end of string.
Demo
P.S - For such kind of this you should use standard libraries available instead of regex.

If you can clearly say that "\\" is a delimiter (does not appear in any string except to separate the strings) then you can say:
str = "C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png"
spl = str.split(“\\”) #split the string
your_wanted_string = spl[-1]
Please note this is a very simple way to do it and not always the best way! If you need to do this often or if something important depends on it use a library!
If you are just learning to code then this is easier to understand.

Related

Python Regex searching

I want to use regex to search in a file for this expression:
time:<float> s
I only want to get the float number.
I'm learning about regex, and this is what I did:
astr = 'lalala time:1.5 s\n'
p = re.compile(r'time:(\d+).*(\d+)')
m = p.search(astr)
Well, I get time:1.5 from m.group(0)
How can I directly just get 1.5 ?
I'm including some extra python-specific materiel since you said you're learning regex. As already mentioned the simplest regex for this would certainly be \d+\.\d+ in various commands as described below.
Something that threw me off with python initially was getting my head around the return types of various re methods and when to use group() vs. groups().
There are several methods you might use:
re.match()
re.search()
re.findall()
match() will only return an object if the pattern is found at the beginning of the string.
search() will find the first pattern and top.
findall() will find everything in the string.
The return type for match() and search() is a match object, __Match[T], or None, if a match isn't found. However the return type for findall() is a list[T]. These different return types obviously have ramifications for how you get the values out of your match.
Both match and search expose the group() and groups() methods for retrieving your matches. But when using findall you'll want to iterate through your list or pull a value with an enumerator. So using findall:
>>>import re
>>>easy = re.compile(r'123')
>>>matches = easy.findall(search_me)
>>>for match in matches: print match
123
If you're using search() or match(), you'll want to use .group() or groups() to retrieve your match depending on how you've set up your regular expression.
From the documentation, "The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are."
Therefore if you have no groups in your regex, as shown in the following example, you wont get anything back:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.groups()
()
Adding a "group" to your regular expression enables you to use this:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'(123)')
>>>matches = easy.search(search_me)
>>>print matches.groups()
('123',)
You don't have to specify groups in your regex. group(0) or group() will return the entire match even if you don't have anything in parenthesis in your expression. --group() defaults to group(0).
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.group(0)
123
If you are using parenthesis you can use group to match specific groups and subgroups.
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'((1)(2)(3))')
>>>matches = easy.search(search_me)
>>>print matches.group(1)
>>>print matches.group(2)
>>>print matches.group(3)
>>>print matches.group(4)
123
1
2
3
I'd like to point as well that you don't have to compile your regex unless you care to for reasons of usability and/or readability. It won't improve your performance.
>>>import re
>>>search_me = '123abc'
>>>#easy = re.compile(r'123')
>>>#matches = easy.search(search_me)
>>>matches = re.search(r'123', search_me)
>>>print matches.group()
Hope this helps! I found sites like debuggex helpful while learning regex. (Although sometimes you have to refresh those pages; I was banging my head for a couple hours one night before I realized that after reloading the page my regex worked just fine.) Lately I think you're served just as well by throwing sandbox code into something like wakari.io, or an IDE like PyCharm, etc., and observing the output. http://www.rexegg.com/ is also a good site for general regex knowledge.
You could do create another group for that. And I would also change the regex slightly to allow for numbers that don't have a decimal separator.
re.compile(r'time:((\d+)(\.?(\d+))?')
Now you can use group(1) to capture the match of the floating point number.
I think the regex you actually want is something more like:
re.compile(r'time:(\d+\.\d+)')
or even:
re.compile(r'time:(\d+(?:\.\d+)?)') # This one will capture integers too.
Note that I've put the entire time into 1 grouping. I've also escaped the . which means any character in regex.
Then, you'd get 1.5 from m.group(1) -- m.group(0) is the entire match. m.group(1) is the first submatch (parenthesized grouping), m.group(2) is the second grouping, etc.
example:
>>> import re
>>> p = re.compile(r'time:(\d+(?:\.\d+)?)')
>>> p.search('time:34')
<_sre.SRE_Match object at 0x10fa77d50>
>>> p.search('time:34').group(1)
'34'
>>> p.search('time:34.55').group(1)
'34.55'

Python Regular Expression searching backwards

I need to extract a string from a directory like this:
my_new_string = "C:\\Users\\User\\code\\Python\\final\\mega_1237665428090192022_cts.ascii"
ID = '1237665428090192022'
m = re.match(r'.*(\b\w+%s)(?<!.{%d})' % (ID, -1), my_new_string)
if m: print m.group(1)
I need to extract 'mega' from the above my_new_string. At the moment the above just gets mega_1237665428090192022 so how do I get it to ignore the ID number?
To be honest I don't understand how these expressions work, even after consulting documentation.
What does the r' do? And how does the ?<!.{%d} work?
edit: Thanks guys!
There are a couple of ways to do this, although I'm not sure you necessarily need a regex here. Here are some options:
>>> import os.path
>>> my_new_string = "C:\\Users\\User\\code\\Python\\final\\mega_1237665428090192022_cts.ascii"
>>> os.path.basename(my_new_string)
'mega_1237665428090192022_cts.ascii'
>>> basename = os.path.basename(my_new_string)
>>> basename.split('_')[0]
'mega'
>>> import re
>>> re.match(r'[A-Za-z]+', basename).group()
'mega'
I don't think you are looking for a negative lookahead assertion or a negative lookbehind assertion. If anything, you want to match if numbers DO follow. For example, something like this:
>>> re.match(r'.*?(?=[_\d])', basename).group()
'mega'
The r simply makes a raw string (so that you don't need to constantly escape backslashes, for example).
>>> m = re.match(r'.*\b(\w+)_(%s)(?<!.{%d})' % (ID, -1), my_new_string)
>>> m.groups()
('mega', '1237665428090192022')
>>> m.group(1)
'mega'

Regex to retrieve the last few characters of a string

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

Python regex alternation

I'm trying to find all links on a webpage in the form of "http://something" or https://something. I made a regex and it works:
L = re.findall(r"http://[^/\"]+/|https://[^/\"]+/", site_str)
But, is there a shorter way to write this? I'm repeating ://[^/\"]+/ twice, probably without any need. I tried various stuff, but it doesn't work. I tried:
L = re.findall(r"http|https(://[^/\"]+/)", site_str)
L = re.findall(r"(http|https)://[^/\"]+/", site_str)
L = re.findall(r"(http|https)(://[^/\"]+/)", site_str)
It's obvious I'm missing something here or I just don't understand python regexes enough.
You are using capturing groups, and .findall() alters behaviour when you use those (it'll only return the contents of capturing groups). Your regex can be simplified, but your versions will work if you use non-capturing groups instead:
L = re.findall(r"(?:http|https)://[^/\"]+/", site_str)
You don't need to escape the double quote if you use single quotes around the expression, and you only need to vary the s in the expression, so s? would work too:
L = re.findall(r'https?://[^/"]+/', site_str)
Demo:
>>> import re
>>> example = '''
... "http://someserver.com/"
... "https://anotherserver.com/with/path"
... '''
>>> re.findall(r'https?://[^/"]+/', example)
['http://someserver.com/', 'https://anotherserver.com/']

Python matching some characters into a string

I'm trying to extract/match data from a string using regular expression but I don't seem to get it.
I wan't to extract from the following string the i386 (The text between the last - and .iso):
/xubuntu/daily/current/lucid-alternate-i386.iso
This should also work in case of:
/xubuntu/daily/current/lucid-alternate-amd64.iso
And the result should be either i386 or amd64 given the case.
Thanks a lot for your help.
You could also use split in this case (instead of regex):
>>> str = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> str.split(".iso")[0].split("-")[-1]
'i386'
split gives you a list of elements on which your string got 'split'. Then using Python's slicing syntax you can get to the appropriate parts.
If you will be matching several of these lines using re.compile() and saving the resulting regular expression object for reuse is more efficient.
s1 = "/xubuntu/daily/current/lucid-alternate-i386.iso"
s2 = "/xubuntu/daily/current/lucid-alternate-amd64.iso"
pattern = re.compile(r'^.+-(.+)\..+$')
m = pattern.match(s1)
m.group(1)
'i386'
m = pattern.match(s2)
m.group(1)
'amd64'
r"/([^-]*)\.iso/"
The bit you want will be in the first capture group.
First off, let's make our life simpler and only get the file name.
>>> os.path.split("/xubuntu/daily/current/lucid-alternate-i386.iso")
('/xubuntu/daily/current', 'lucid-alternate-i386.iso')
Now it's just a matter of catching all the letters between the last dash and the '.iso'.
The expression should be without the leading trailing slashes.
import re
line = '/xubuntu/daily/current/lucid-alternate-i386.iso'
rex = re.compile(r"([^-]*)\.iso")
m = rex.search(line)
print m.group(1)
Yields 'i386'
reobj = re.compile(r"(\w+)\.iso$")
match = reobj.search(subject)
if match:
result = match.group(1)
else:
result = ""
Subject contains the filename and path.
>>> import os
>>> path = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> file, ext = os.path.splitext(os.path.split(path)[1])
>>> processor = file[file.rfind("-") + 1:]
>>> processor
'i386'

Categories