I want to extract for example 2 entities from a sentence. eg:
str1 = 'i am tom and i have a car'
I want to extract the word 'tom' or 'jack' as name if exist.
I also want to extract the word 'car' or 'bike' as property if exist
Now I can simply write 2 regular expressions:
re.search(r"(?P<name>tom|jack)", s).group('name')
re.search(r"(?P<property>car|bike)", s).group('property')
But I wonder if I can combine these two together.
The problem is I could not know the order of both name and property. So the following code
re.search(r"(?P<name>tom|jim).*(?P<property>car|bike)", s)
does not work for :
'str2 = i have a car and i am tom'
I tried to simply combine two order situation
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property>car|bike).*(?P<name>tom|jack)))", s2)
it gives me "redefinition of group name" error unless I changed to
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property2>car|bike).*(?P<name2>tom|jack)))", s2)
Question
How can i write a regular express to extract tom/jack as name and car/bike as property without considering the order?
Moreover
I don't want to simply list all the possible orders because it might be too many situations if i want to extract n kinds of entities.
Yes, it's possible but within lookarounds otherwise characters are consumed and engine pointer doesn't bother to go back for a new look up.
\A(?=.*(?P<name>tom|jack))(?=.*(?P<property>car|bike))
Live demo
Every pattern in a regex should match to lead a successful match. If they are not mandatory patterns make them optional.
Related
I'm trying to replace {x;y} patterns in a text corpus with "x or y", except that the number of elements is variable, so sometimes there will be 3 or more elements i.e. {x;y;z} (max is 9).
I'm trying to do this with regex, but I'm not sure how to do this such that I can replace according to the number of elements present. So I mean like, if I use a regex with a variable component like the following
part = '(;[\w\s]+)'
regex = '\(([\w\s]+);([\w\s]+){}?\)'.format(part)
re.sub(regex,/1 or /2 or /3, text)
I will sometimes get an additional 'or' (and more if I increase the number of variable elements) when there are only 2 elements present in the braces, which I don't want. The alternative is to do this many times with different number of variable parts but the code would be very clunky. I'm wondering if there are any ways I could achieve this with regex methods? Would appreciate any ideas.
I'm using python3.5 with spyder.
The scenario is just a bit too much for a regular search-and-replace action, so I would recommend passing in a function to dynamically generate the replacement string.
import re
text = 'There goes my {cat;dog} playing in the {street;garden}.'
def replacer(m):
return m.group(1).replace(';', ' or ')
output = re.sub(r'\{((\w;?)*\w)\}', replacer, text)
print(output)
Output:
There goes my cat or dog playing in the street or garden.
I am unsuccessfully trying to use regex to remove time stamps and names from the online conversations I am processing.
The pattern I am trying to remove looks like this: [08:03:16] Name:
It is randomly distributed throughout the conversation instances.
The Name portion of the pattern can be lower or uppercase and can contain multiple names, e.g. Dave, adam Jons, Wei-Xing.
I am using the following regex:
[A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)
From Find names with Regular Expression, but this only removes names outside the timestamp example provided above (and only works for some names in the timestamps).
I have been looking through SO for a while now to find something that might help me but nothing has worked across all examples so far.
That looks a lot more complicated than it has to be - might be easier to match the timestamp format, then match characters up until the next : is found (assuming that names can't have :s in them):
\[(?:\d{2}:){2}\d{2}\] [^:]+:
https://regex101.com/r/5i4HId/1
I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]
I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline
I'm currently working on a project to parse data from a music database and I'm creating a search function using regular expressions in python (version 3.5.1).
I would like to create a regular expression to make the song names- songs without characters following the name and songs with feature details - but not songs containing given song's name in the matching song's name(examples may help illustrate my point):
What I'd like to match:
Work
Work (ft. Drake)
What would NOT like to match:
Work it
Workout
My current regular expression is ' /Work(\s(\w+)?/ ' but this matches all 4 example cases.
Can someone help me figure out an expression to accomplish this?
Personally, I'd go with something like
^Work(?:\s+\(.+\))?$
which will match your two provided test cases, but not the two you want to avoid. If you want to make it a but more specific regarding matching who the artist is, you can go with something like
^Work(?:\s+\((?:ft.|featuring).+\))?$
Which will still match your two cases, but will only match stuff in the brackets that starts with "ft." or "featuring".