Parsing file name with RegEx - Python

Parsing file name with RegEx - Python - python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!

Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:

\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)

Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Related

How to use regex to extract citation/reference context?

I'm working with quotations/citations/references. In particular, given a text, I would like to extract citations and references, and a context for each of them. In my project, context is defined as the string of at most 10 characters to the left or right of a quote/citation/reference.
This is my code:
# some toy text
text = 'Once upon a time a cat says «gross!». A long story you can check here (ref. 11). People witnessed the scene [...]'
quoting_pattern = '\([^\(]*\)|„[^„]*"|<<.*>>|«[^«]*»|“[^“]*”|‹[^‹]*›|"[^"]*"|›[^›]*‹|»[^»]*«'
context_pattern = ".{0,100}(?:" + quoting_pattern + ").{0,100}"
# get all quotations
quotations = re.findall(r'{}'.format(quoting_pattern), text, re.DOTALL)
# get all contexts
contexts = re.findall(r'{}'.format(context_pattern), text, re.DOTALL)
for i, q in enumerate(quotations):
print(q, contexts[i])
My expected result is this one:
"«gross!»", " cat says «gross!». A long s"
"(ref. 11)", "heck here (ref. 11)"
However, I got an error: IndexError: list index out of range.
Even if "«gross!»" and "(ref. 11)" are extracted in the 'quotations' variable and I'm able to extract the context for "«gross!»", I can't find any context for "(ref. 11)".
Why does this happen? How can I solve this issue?
Thanks in advance

Try re.finditer. Match objects have .start() and .end() methods you can use to get context:
import re
text = "Once upon a time a cat says «gross!». A long story you can check here (ref. 11). People witnessed the scene [...]"
pat = re.compile(r"«[^»]*»|\"[^\"]*\"|\([^)]*\)")
for m in pat.finditer(text):
ctx = text[max(m.start() - 10, 0) : min(m.end() + 10, len(text))]
print(m.group(0), ctx)
Prints:
«gross!» cat says «gross!». A long s
(ref. 11) heck here (ref. 11). People w

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.

Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>

If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

How to return a word in a string if it starts with a certain character? (Python)

I'm building a reddit bot for practice that converts US dollars into other commonly used currencies, and I've managed to get the conversion part working fine, but now I'm a bit stuck trying to pass the characters that directly follow a dollar sign to the converter.
This is sort of how I want it to work:
def run_bot():
subreddit = r.get_subreddit("randomsubreddit")
comments = subreddit.get_comments(limit=25)
for comment in comments:
comment_text = comment.body
#If comment contains a string that starts with '$'
# Pass the rest of the 'word' to a variable
So for example, if it were going over a comment like this:
"I bought a boat for $5000 and it's awesome"
It would assign '5000' to a variable that I would then put through my converter
What would be the best way to do this?
(Hopefully that's enough information to go off, but if people are confused I'll add more)

You could use re.findall function.
>>> import re
>>> re.findall(r'\$(\d+)', "I bought a boat for $5000 and it's awesome")
['5000']
>>> re.findall(r'\$(\d+(?:\.\d+)?)', "I bought two boats for $5000 $5000.45")
['5000', '5000.45']
OR
>>> s = "I bought a boat for $5000 and it's awesome"
>>> [i[1:] for i in s.split() if i.startswith('$')]
['5000']

If you dealing with prices as in float number, you can use this:
import re
s = "I bought a boat for $5000 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s)
print(matches) # ['5000']
s2 = "I bought a boat for $5000.52 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s2)
print(matches) # ['5000.52']

Output surrounded by [' and ]' - How to stop?

I am pulling information down from an rss feed. Due to further analysis,, I don't particularly want to use the likes of beautiful soup or feedparser. The explanation is kind of out of scope for this question.
The output is generating the text covered in [' and ']. For example
Title:
['The Morning Download: Apple Stumbles but Mobile Soars']
Published:
['Tue, 28 Jan 2014 13:09:04 GMT']
Why is this output like this? How do I stop this?
try:
#This is the RSS Feed that is being scraped
page = 'http://finance.yahoo.com/rss/headline?s=aapl'
yahooFeed = opener.open(page).read()
try:
items = re.findall(r'<item>(.*?)</item>', yahooFeed)
for item in items:
# Prints the title
title = re.findall(r'<title>(.*?)</title>', item)
print "Title:"
print title
# Prints the Date / Time Published
print "Published:"
datetime = re.findall(r'<pubDate>(.*?)</pubDate>', item)
print datetime
print "\n"
except Exception, e:
print str(e)
I am grateful of any criticism, advise and best practice information.
I'm a Java / Perl programmer so still getting used to Python, so any great resources you know of, are greatly appreciated.

Use re.search instead of re.findall, re.findall always returns a list of all matches.
datetime = re.search(r'<pubDate>(.*?)</pubDate>', item).group(1)
Note that the difference between re.findall and re.search is that the former returns a list(Python's array data-structure) of all matches, while re.search will only return the first match found.
In case of a no match re.search returns None, so to handle that as well:
match = re.search(r'<pubDate>(.*?)</pubDate>', item)
if match is not None:
datetime = match.group(1)

Error: match word in file

There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?

Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person

you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing file name with RegEx - Python - python

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything: .(?=.\d{4}.\d{3,}p) Regex example (try the unit tests to see the regex in action) Explanation:

Related

How to use regex to extract citation/reference context?

How to extract text before a specific keyword in python?

How to return a word in a string if it starts with a certain character? (Python)

Output surrounded by [' and ]' - How to stop?

Error: match word in file

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing file name with RegEx - Python - python

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything: .*(?=.\d{4}.*\d{3,}p) Regex example (try the unit tests to see the regex in action) Explanation:

Related

How to use regex to extract citation/reference context?

How to extract text before a specific keyword in python?

How to return a word in a string if it starts with a certain character? (Python)

Output surrounded by [' and ]' - How to stop?

Error: match word in file

Categories

Resources

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything: .(?=.\d{4}.\d{3,}p) Regex example (try the unit tests to see the regex in action) Explanation: