How to stop python Regular Expression being too greedy - python

I'm trying to match (in Python) the show name and season/episode numbers from tv episode filenames in the format:
Show.One.S01E05.720p.HDTV.x264-CTU.mkv
and
Show.Two.S08E02.HDTV.XviD-LOL.avi
My regular expression:
(?P<show>[\w\s.,_-]+)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})
matches correctly on Show Two giving me Show Two, 08 and 02. However the 720 in Show One means I get back 7 and 20 for season/episode.
If I remove the ? after [XxEe] then it matches both types but I want that range to be optional for filenames where the episode identifier isn't included.
I've tried using ?? to stop the [XxEe] match being greedy as listed in the python docs re module section but this has no effect.
How can I capture the series name section and the season/episode section while ignoring the rest of the string?

Change the greedity on first match:
p=re.compile('(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})')
print p.findall("Game.of.Thrones.S01E05.720p.HDTV.x264-CTU.mkv")
[('Game.of.Thrones', '01', '05')]
print p.findall("Entourage.S08E02.HDTV.XviD-LOL.avi")
[('Entourage', '08', '02')]
Note the ? following + in first group.
Explanation :
First match eats too much, so reducing its greedity makes the following match sooner. (not a really nice example by the way, I would have changed names as they definitely sound a bit too Warezzz-y to be honest ;-) )

Try:
v
(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})

Add a dot at the end of the regex :
(?P<show>[\w\s.,_-]+)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})\.
here __^

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

Can I combine these two regexes into a single regex? (Find `that` in `string` if `this` is anywhere in `string`)

As input I have a series of long strings, which may or may not have the pattern(s) I'm looking for. The strings that have the pattern(s) will have an identifier(s) somewhere in the string, but not necessarily directly preceding the pattern(s). Currently I'm using this logic to find what I'm looking for:
droid_name = re.compile("(r2-d2|c-3po)")
location = re.compile("pattern_of_numbered_sectors_where_theyre_located")
find_droid = re.findall(location, string) if re.match(droid_name, string) else not_the_droids_youre_looking_for
r2-d2 and c-3po won't be the same length.
Can I combine this logic into a single regex? Thanks!
EDIT:
I'm looking for a one-line solution because I have a number of different types of information that I want to extract from various strings, so I'm using a dictionary with the regexes. So, something like this:
regexes = {
'droid location': re.compile("droid_location_pattern")
'jedi name': re.compile("jedi_name_pattern")
'tatooine phone number': re.compile("tatooine_phone_pattern")
}
def analyze(some_string):
for key, regex in regexes:
data = re.findall(regex, some_string)
if data:
for data_item in data:
send_to_mysql(label=key, info=data_item)
EDIT:
Some sample strings are below.
Valid numbers will have the pattern: 9XXXX, which may also be written as 9XXX-X
I don't want to match the number 92222:
[Darth Vader]: Hey babe, I'm chilling in the Death Star. Where are you?
[Padme Amidala]: At the Galactic Senate, can't talk.
[Darth Vader]: Netflix and chill?
[Padme Amidala]: Call me later on my burner phone, the number is: 92222.
Here, I want to match the number 97777, because the string contains r2-d2:
[communique yoda:palpatine] spotted luke skywalker i have.
[communique yoda:palpatine] with the droid he is. r2-d2 we must kill.
[communique yoda:palpatine] location 97777 you must go.
Another possible match because the string contains c-3po:
root#palpatine$ at-at start --target c-3po --location 9777-7
AT-AT startup sequence...
[Error] fuel reserves low, aborting startup. Goodbye.
Don't want to match:
https://members.princessleiapics.com?username=stormtrooper&password=96969
Well, this highly depends on your actual strings. Assuming that c-3po or r2-d2 will always be before the desired location number (am I correct here?) you could use for both your examples the following regex:
(?:c-3po|r2-d2)(?=.*\b(9\d\d\d-?\d)\b)
# looks for c-3po or r2-d2 literally
# start a positive lookahead
# which consumes every character zero or unlimited times
# looks for a word boundary
# and captures a five digit number with or without a dash
# looks for a word boundary afterwards and close the lookahead
Be aware that this only works in DOTALL mode (aka the dot matches newline characters as well). See a working demo on regex101 here (copy and paste your other strings to confirm the examples are working).
Additionaly thoughts: It might be better though to check if the strings c-3po or r2-d2 occur in the chunks using normal python string functions and if so try to match the desired location number with the following regex:
\b(9\d\d\d-?\d)\b
# same as above without the lookahead

regex does not match only upper case letters, despite being instructed to do so

I'm making a script to crawl through a web page and find all upper case names, equalling a number (ex. DUP_NB_FUNC=8). The part where my regular expression has to match only upper case letters however, does not seem to be working properly.
value = re.findall(r"[A-Z0-9_]*(?==\d).{2,}", input)
|tc_apb_conf_00.v:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Desired output should look something like the above. However, I am getting:
|tc_apb_conf_00.v:-:=1" name="viewport"/>
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Based on the input I can see its finding a match starting at =1. I don't however understand why as I've put only A-Z in the regex range. I'd really appreciate a bit of assistance and clearing up.
This should be help:
[A-Z0-9_]+(?==\d).{2,}
or
\b[A-Z0-9_]*(?==\d).{2,}\b
But anyway your regex quite weird, according to your requirement above I suggest this
[A-Z0-9_]+=\d+
Instead of using
(?==\d).{2,}: any letters two or more and make sure that the first two letter are = and a one integer respectively,
you can just use
=\d+
Try this.
value = re.findall(r"[A-Z0-9_]+(?==\d).{2,}", input)
You want the case sensitive match to match at least once, which means you want the + quantifier, not the * quantifier, that matches between zero and unlimited times.
I will suggest you define your pattern and check you input if it is available
for i in tlist:
value=re.compile(r"[A-Z0-9_:-.]+=\d+")
jee=value.match(i)
if jee is not None:
print i
tlist contains your input

Categories