Remove dynamic time and name combinations using regex - python

I am unsuccessfully trying to use regex to remove time stamps and names from the online conversations I am processing.
The pattern I am trying to remove looks like this: [08:03:16] Name:
It is randomly distributed throughout the conversation instances.
The Name portion of the pattern can be lower or uppercase and can contain multiple names, e.g. Dave, adam Jons, Wei-Xing.
I am using the following regex:
[A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)
From Find names with Regular Expression, but this only removes names outside the timestamp example provided above (and only works for some names in the timestamps).
I have been looking through SO for a while now to find something that might help me but nothing has worked across all examples so far.

That looks a lot more complicated than it has to be - might be easier to match the timestamp format, then match characters up until the next : is found (assuming that names can't have :s in them):
\[(?:\d{2}:){2}\d{2}\] [^:]+:
https://regex101.com/r/5i4HId/1

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.
You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

Text between delimiters starting from the end of the string

I'm really new to python and programming in general, and to practice I'm doing projects where I try to tackle problems from my day to day work, so please excuse me if this may be a silly question. I'm trying to combine a group of files located on a remote folder into a single monthly one based on the date, I've already combined files based on date so I think I can do that, but I'm having trouble with the regex to pick the date from the file name string, the string with the filepath is as follows
\\machinename123\main folder\subfolder\2021-01-24.csv
The file name will always have the same format since it's and automated process, only changing the date on the name, I was trying to pick the date from this string using a regex to select the text between the last \ of the string and the . from the format, so I can get a 2021-01-24 as a result but at the level I'm at, regex are like witchcraft and I don't really know what I'm doing, I've been trying for a few hours to no success, so far this is the closest I can get by trial and error (?:[0-9\-]) but this selects all the numbers on the string, including the ones on the machine name, besides the issue of not knowing why it works the way it works (for example I know that the ?: works by testing, but I don't understand the theory behind it so I couldn't replicate it in the future).
How can I make it ignore the other numbers on the string, or more specifically pick only the text between the last \ and the . from the csv, xlsx or whatever the format is?
I'd like the former option better, since it would allow me to learn how to make it do what I need it to do and not get the result by coincidence.
Thanks for any help
Use re.search() to find a pattern: <4 digits>-<2 digits>-<2 digits>.
s = r'\\machinename123\main folder\subfolder\2021-01-24.csv'
m = re.search(r'\d{4}-\d{2}-\d{2}', s).group(0)
You can use the following regex, that summarizes the structure of your full path:
import re
filename_regex = re.compile(r'^\\\\[(?:\w| )+\\]+((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
m = filename_regex.match(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
if m is not None:
print(f'File found with date: {m.groups()[0]}')
else:
print('Filename not of interest')
The output will be:
File found with date: 2021-01-24
filename_regex accepts string starting with \\, followed by a repetition of characters (alphanumeric and underscores) and spaces followed by \, but with the final part corresponding to 4 digits, followed by a minus, then 2 digits, a minus again, 2 digits and the string .csv. The regular expression used here to match the date is very simple, but you can use a more complex one if you prefer.
Another simpler approach would be using ntpath library, extracting the name of the file from the full path and applying the regular expression only to the name of the file:
import ntpath
import re
filename_regex = re.compile(r'^((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
filename = ntpath.basename(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
m = filename_regex.match(filename)
m here will have the same value as before.

Python- Regular express without order

I want to extract for example 2 entities from a sentence. eg:
str1 = 'i am tom and i have a car'
I want to extract the word 'tom' or 'jack' as name if exist.
I also want to extract the word 'car' or 'bike' as property if exist
Now I can simply write 2 regular expressions:
re.search(r"(?P<name>tom|jack)", s).group('name')
re.search(r"(?P<property>car|bike)", s).group('property')
But I wonder if I can combine these two together.
The problem is I could not know the order of both name and property. So the following code
re.search(r"(?P<name>tom|jim).*(?P<property>car|bike)", s)
does not work for :
'str2 = i have a car and i am tom'
I tried to simply combine two order situation
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property>car|bike).*(?P<name>tom|jack)))", s2)
it gives me "redefinition of group name" error unless I changed to
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property2>car|bike).*(?P<name2>tom|jack)))", s2)
Question
How can i write a regular express to extract tom/jack as name and car/bike as property without considering the order?
Moreover
I don't want to simply list all the possible orders because it might be too many situations if i want to extract n kinds of entities.
Yes, it's possible but within lookarounds otherwise characters are consumed and engine pointer doesn't bother to go back for a new look up.
\A(?=.*(?P<name>tom|jack))(?=.*(?P<property>car|bike))
Live demo
Every pattern in a regex should match to lead a successful match. If they are not mandatory patterns make them optional.

Regular Expressions: Matching Song names in Python 3

I'm currently working on a project to parse data from a music database and I'm creating a search function using regular expressions in python (version 3.5.1).
I would like to create a regular expression to make the song names- songs without characters following the name and songs with feature details - but not songs containing given song's name in the matching song's name(examples may help illustrate my point):
What I'd like to match:
Work
Work (ft. Drake)
What would NOT like to match:
Work it
Workout
My current regular expression is ' /Work(\s(\w+)?/ ' but this matches all 4 example cases.
Can someone help me figure out an expression to accomplish this?
Personally, I'd go with something like
^Work(?:\s+\(.+\))?$
which will match your two provided test cases, but not the two you want to avoid. If you want to make it a but more specific regarding matching who the artist is, you can go with something like
^Work(?:\s+\((?:ft.|featuring).+\))?$
Which will still match your two cases, but will only match stuff in the brackets that starts with "ft." or "featuring".

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Categories