regex - how to recognise a pattern until a second one is found - python

I have a file, named a particular way. Let's say it's:
tv_show.s01e01.episode_name.avi
it's the standard way a video file of a tv show's episode is named on the net. The pattern is quite the same all over the web, so I want to extract some information from a file named this way. Basically I want to get:
the show's title;
the season number s01;
the episode number e01;
the extension.
I'm using a Python 3 script to do so. This test file is pretty simple because all I have to do is this
import re
def acquire_info(f="tv_show.s01e01.episode_name.avi"):
tvshow_title = title_p.match(f).group()
numbers = numbers_p.search(f).group()
season_number = numbers.split("e")[0].split("s")[1]
ep_number = numbers.split("e")[1]
return [tvshow_title, season_number, ep_number]
if __name__ == '__main__':
# re.I stands for the option "ignorecase"
title_p = re.compile("^[a-z]+", re.I)
numbers_p = re.compile("s\d{1,2}e\d{1,2}", re.I)
print(acquire_info())
and the output is as expected ['tv_show', '01', '01']. But what if my file name is like this other one? some.other.tv.show.s04e05.episode_name.avi.
How can I build a regex that gets all the text BEFORE the "s\d{1,2}e\d{1,2}" pattern is found?
P.S. I didn't put in the example the code to get the extension, I know, but that's not my problem so it does not matter.

try this
show_p=re.compile("(.*)\.s(\d*)e(\d*)")
show_p.match(x).groups()
where x is your string
Edit** (I forgot to include the extension, here is the revision)
show_p=re.compile("^(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
show_p.match(x).groups()
And Here is the test result
>>> show_p=re.compile("(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
>>> x="tv_show.s01e01.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '01', '01', 'avi')
>>> x="tv_show.s2e1.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '2', '1', 'avi')
>>> x='some.other.tv.show.s04e05.episode_name.avi'
>>> show_p.match(x).groups()
('some.other.tv.show', '04', '05', 'avi')
>>>

Here is one option, use capturing groups to extract all of the info you want in one step:
>>> show_p = re.compile(r'(.*?)\.s(\d{1,2})e(\d{1,2})')
>>> show_p.match('some.other.tv.show.s04e05.episode_name.avi').groups()
('some.other.tv.show', '04', '05')

I'm not a Python expert but if it can do named captures, something general like this might work:
^(?<Title>.+)\.s(?<Season>\d{1,2})e(?<Episode>\d{1,2})\..*?(?<Extension>[^.]+)$
if no named groups, just use normal groups.
A problem could occur if the title has a .s2e1. part that masks the real season/episode part. That would require more logic. The regex above asumes that the title/season/episode/extension exists, and s/e is the farthest one to the right.

Related

How can I use variables on a variables loaded from text file? [duplicate]

I am looking for either technique or templating system for Python for formatting output to simple text. What I require is that it will be able to iterate through multiple lists or dicts. It would be nice if I would be able to define template into separate file (like output.templ) instead of hardcoding it into source code.
As simple example what I want to achieve, we have variables title, subtitle and list
title = 'foo'
subtitle = 'bar'
list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
And running throught a template, output would look like this:
Foo
Bar
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
How to do this? Thank you.
You can use the standard library string an its Template class.
Having a file foo.txt:
$title
$subtitle
$list
And the processing of the file (example.py):
from string import Template
d = {
'title': 'This is the title',
'subtitle': 'And this is the subtitle',
'list': '\n'.join(['first', 'second', 'third'])
}
with open('foo.txt', 'r') as f:
src = Template(f.read())
result = src.substitute(d)
print(result)
Then run it:
$ python example.py
This is the title
And this is the subtitle
first
second
third
There are quite a number of template engines for python: Jinja, Cheetah, Genshi etc. You won't make a mistake with any of them.
If your prefer to use something shipped with the standard library, take a look at the format string syntax. By default it is not able to format lists like in your output example, but you can handle this with a custom Formatter which overrides the convert_field method.
Supposed your custom formatter cf uses the conversion code l to format lists, this should produce your given example output:
cf.format("{title}\n{subtitle}\n\n{list!l}", title=title, subtitle=sibtitle, list=list)
Alternatively you could preformat your list using "\n".join(list) and then pass this to your normal template string.
if you want arbitrary prefixes/suffixes to identify your variables, you can simply use re.sub with a lambda expression:
from pathlib import Path
import re
def tpl(fn:Path, v:dict[str,str]) -> str:
text = fn.with_suffix('.html').read_text()
return re.sub("(<!-- (.+?) -->)", lambda m: v[m[2].lower()], text)
html = tpl(Path(__file__), {
'title' : 't',
'body' : 'b'
})

Regex Contains No Capture Groups?

I've got a series of malformed JSON data that I need to use Regex to get the data I need out of it, then I need to use regex again to remove a specific aspect of the data i.e. the main category, in the example below it's 'games'.
Part 1 works, the second part does not.
I've limited experience with Python, and next to no experience with Regex.
Final Output: games
I'm getting the error:
ValueError: pattern contains no capture groups
The series of data contains information formated like this:
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}
The Python call I'm using is this:
First I remove the slug from the JSON.
ksdata.cat_slug_raw = ksdata.category.str.extract('\"slug\"\:\"(.+?)\"', expand=False)
Then I remove everything before the /
ksdata.cat_slug = ksdata.cat_slug_raw.str.extract('^[^/]+(?=/)', expand=False)
I'd really appreciate some help with where I'm going wrong...and if you think my solution as a whole sux please tell me :)
You can use ast.literal_eval:
s = '{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}'
import ast
final_data = ast.literal_eval(s)
Output:
{'name': 'Playing Cards', 'color': 51627, 'slug': 'games/playing cards', 'parent_id': 12, 'urls': {'web': {'discover': 'http://www.kickstarter.com/discover/categories/games/playing%20cards'}}, 'position': 4, 'id': 273}
Based on an amended suggestion from TomSitter I used
ksdata.cat_slug_raw.str.split('/').str[0]
This was the simplest way to get around it.

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Python: help composing regex pattern

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string
"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."
I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.
for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
print m.group(1)
print m.group(2)
print m.group(3)
print m.group(4)
the output is:
begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46
and I want it to be:
32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46
Thank you for any help.
.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.
>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'),
('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
With a modified pattern, forcing single quotes to be present at the start/end of the match:
>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).
Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.
Another option to Blckknght and Tim Pietzcker's is
re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)
Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.
The advantage is that it's more rigid: there's no way to get the delimiter in the result, so
'begin:33,13:134:2:2006-11-31 T 11:46:end'
would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.
The disadvantage is that it's more rigid, of course.
I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

What's the best way to format a phone number in Python?

If all I have is a string of 10 or more digits, how can I format this as a phone number?
Some trivial examples:
555-5555
555-555-5555
1-800-555-5555
I know those aren't the only ways to format them, and it's very likely I'll leave things out if I do it myself. Is there a python library or a standard way of formatting phone numbers?
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting, storing and validating international phone numbers.
The readme is insufficient, but I found the code well documented.
Seems like your examples formatted with three digits groups except last, you can write a simple function, uses thousand seperator and adds last digit:
>>> def phone_format(n):
... return format(int(n[:-1]), ",").replace(",", "-") + n[-1]
...
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555555")
'555-555-5555'
>>> phone_format("18005555555")
'1-800-555-5555'
Here's one adapted from utdemir's solution and this solution that will work with Python 2.6, as the "," formatter is new in Python 2.7.
def phone_format(phone_number):
clean_phone_number = re.sub('[^0-9]+', '', phone_number)
formatted_phone_number = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1-", "%d" % int(clean_phone_number[:-1])) + clean_phone_number[-1]
return formatted_phone_number
You can use the function clean_phone() from the library DataPrep. Install it with pip install dataprep.
>>> from dataprep.clean import clean_phone
>>> df = pd.DataFrame({'phone': ['5555555', '5555555555', '18005555555']})
>>> clean_phone(df, 'phone')
Phone Number Cleaning Report:
3 values cleaned (100.0%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
phone phone_clean
0 5555555 555-5555
1 5555555555 555-555-5555
2 18005555555 1-800-555-5555
More verbose, one dependency, but guarantees consistent output for most inputs and was fun to write:
import re
def format_tel(tel):
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1 or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = f"{tel[:3]}-{tel[3:6]}-{tel[6:]}"
return tel
Output:
>>> format_tel("1-800-628-8737")
'800-628-8737'
>>> format_tel("800-628-8737")
'800-628-8737'
>>> format_tel("18006288737")
'800-628-8737'
>>> format_tel("1800-628-8737")
'800-628-8737'
>>> format_tel("(800) 628-8737")
'800-628-8737'
>>> format_tel("(800) 6288737")
'800-628-8737'
>>> format_tel("(800)6288737")
'800-628-8737'
>>> format_tel("8006288737")
'800-628-8737'
Without magic numbers; ...if you're not into the whole brevity thing:
def format_tel(tel):
AREA_BOUNDARY = 3 # 800.6288737
SUBSCRIBER_SPLIT = 6 # 800628.8737
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1, or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = (f"{tel[:AREA_BOUNDARY]}-"
f"{tel[AREA_BOUNDARY:SUBSCRIBER_SPLIT]}-{tel[SUBSCRIBER_SPLIT:]}")
return tel
A simple solution might be to start at the back and insert the hyphen after four numbers, then do groups of three until the beginning of the string is reached. I am not aware of a built in function or anything like that.
You might find this helpful:
http://www.diveintopython3.net/regular-expressions.html#phonenumbers
Regular expressions will be useful if you are accepting user input of phone numbers. I would not use the exact approach followed at the above link. Something simpler, like just stripping out digits, is probably easier and just as good.
Also, inserting commas into numbers is an analogous problem that has been solved efficiently elsewhere and could be adapted to this problem.
In my case, I needed to get a phone pattern like "*** *** ***" by country.
So I re-used phonenumbers package in our project
from phonenumbers import country_code_for_region, format_number, PhoneMetadata, PhoneNumberFormat, parse as parse_phone
import re
def get_country_phone_pattern(country_code: str):
mobile_number_example = PhoneMetadata.metadata_for_region(country_code).mobile.example_number
formatted_phone = format_number(parse_phone(mobile_number_example, country_code), PhoneNumberFormat.INTERNATIONAL)
without_country_code = " ".join(formatted_phone.split()[1:])
return re.sub("\d", "*", without_country_code)
get_country_phone_pattern("KG") # *** *** ***

Categories