Regular Expressions: Matching Song names in Python 3 - python

I'm currently working on a project to parse data from a music database and I'm creating a search function using regular expressions in python (version 3.5.1).
I would like to create a regular expression to make the song names- songs without characters following the name and songs with feature details - but not songs containing given song's name in the matching song's name(examples may help illustrate my point):
What I'd like to match:
Work
Work (ft. Drake)
What would NOT like to match:
Work it
Workout
My current regular expression is ' /Work(\s(\w+)?/ ' but this matches all 4 example cases.
Can someone help me figure out an expression to accomplish this?

Personally, I'd go with something like
^Work(?:\s+\(.+\))?$
which will match your two provided test cases, but not the two you want to avoid. If you want to make it a but more specific regarding matching who the artist is, you can go with something like
^Work(?:\s+\((?:ft.|featuring).+\))?$
Which will still match your two cases, but will only match stuff in the brackets that starts with "ft." or "featuring".

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.
You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

Remove dynamic time and name combinations using regex

I am unsuccessfully trying to use regex to remove time stamps and names from the online conversations I am processing.
The pattern I am trying to remove looks like this: [08:03:16] Name:
It is randomly distributed throughout the conversation instances.
The Name portion of the pattern can be lower or uppercase and can contain multiple names, e.g. Dave, adam Jons, Wei-Xing.
I am using the following regex:
[A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)
From Find names with Regular Expression, but this only removes names outside the timestamp example provided above (and only works for some names in the timestamps).
I have been looking through SO for a while now to find something that might help me but nothing has worked across all examples so far.
That looks a lot more complicated than it has to be - might be easier to match the timestamp format, then match characters up until the next : is found (assuming that names can't have :s in them):
\[(?:\d{2}:){2}\d{2}\] [^:]+:
https://regex101.com/r/5i4HId/1

Python- Regular express without order

I want to extract for example 2 entities from a sentence. eg:
str1 = 'i am tom and i have a car'
I want to extract the word 'tom' or 'jack' as name if exist.
I also want to extract the word 'car' or 'bike' as property if exist
Now I can simply write 2 regular expressions:
re.search(r"(?P<name>tom|jack)", s).group('name')
re.search(r"(?P<property>car|bike)", s).group('property')
But I wonder if I can combine these two together.
The problem is I could not know the order of both name and property. So the following code
re.search(r"(?P<name>tom|jim).*(?P<property>car|bike)", s)
does not work for :
'str2 = i have a car and i am tom'
I tried to simply combine two order situation
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property>car|bike).*(?P<name>tom|jack)))", s2)
it gives me "redefinition of group name" error unless I changed to
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property2>car|bike).*(?P<name2>tom|jack)))", s2)
Question
How can i write a regular express to extract tom/jack as name and car/bike as property without considering the order?
Moreover
I don't want to simply list all the possible orders because it might be too many situations if i want to extract n kinds of entities.
Yes, it's possible but within lookarounds otherwise characters are consumed and engine pointer doesn't bother to go back for a new look up.
\A(?=.*(?P<name>tom|jack))(?=.*(?P<property>car|bike))
Live demo
Every pattern in a regex should match to lead a successful match. If they are not mandatory patterns make them optional.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Regex to match Domain.CCTLD

Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.
EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.
It sounds like you are looking for the information available through the Public Suffix List project.
A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.
There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.
Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.
First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.
suffixes = parse_suffix_list("suffix_list.txt")
Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:
def is_domain(d):
for suffix in suffixes:
if d.endswith(suffix):
# Get the base domain name without suffix
base_name = d[0:-(suffix.length + 1)]
# If it contains '.', it's a subdomain.
if not base_name.contains('.'):
return true
# If we get here, no matches were found
return false
I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):
tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i
I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)

Categories