replace more than one pattern python - python

I have reviewed various links but all showed how to replace multiple words in one pass. However, instead of words I want to replace patterns e.g.
RT #amrightnow: "The Real Trump" Trump About You" Watch Make #1
https:\/\/t.co\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016
https:\/\/t.co\u2026
When I perform the following two commands on the above text I get the desired output
result = re.sub(r"http\S+","",sent)
result1 = re.sub(r"#\S+","",result)
This way I am removing all the urls and #(handlers from the tweet). The output will be something like follows:
>>> result1
'RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Trump #trump2016 '
Could someone let me know what is the best way to do it? I will be basically reading tweets from a file. I want to read each tweet and replace these handlers and urls with blanks.

You need the regex "or" operator which is the pipe |:
re.sub(r"http\S+|#\S+","",sent)
If you have a long list of patterns that you want to remove, a common trick is to use join to create the regular expression:
to_match = ['http\S+',
'#\S+',
'something_else_you_might_want_to_remove']
re.sub('|'.join(to_match), '', sent)

You can use an "or" pattern by separating the patterns with |:
import re
s = u'RT #amrightnow: "The Real Trump" Trump About You" Watch Make #1 https:\/\/t.co\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016 https:\/\/t.co\u2026'
result = re.sub(r"http\S+|#\S+", "", s)
print result
Output
RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Tru mp #trump2016
See the subsection '|' in the regular expression syntax documentation.

Related

How can I capture all sentences in a file with the format of (name): (sentence)\n(name):

I have files of transcripts where the format is
(name): (sentence)\n (<-- There can be multiples of this pattern)
(name): (sentence)\n
(sentence)\n
and so on. I need all of the sentences. So far I have gotten it to work by hard-coding the names in the file, but I need it to be generic.
utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)
Python 3.6 using re. Or if anyone knows how to do this using spacy, that would be a great help, thanks.
I want to just grab the \n after an empty statement, and put it in its own string. And I suppose I will just have to grab the tape information given at the end of this, for example, since I can't think of a way to distinguish if the line is part of someone's speech or not.
Also sometimes, there's more than one word between start of line and colon.
Mock data:
CRO: How far are you from the World Trade Center, how many blocks, about? Three or
four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
You can use a lookahead expression that looks for the same pattern of a name at the beginning of a line and is followed by a colon:
s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)
This outputs:
[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
('CALLER', ''),
('CRO', "You're welcome. Thank you.\n"),
('OPERATOR', 'Bye.\n'),
('CRO', 'Bye.\n'),
('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
('OPERATOR NEWELL', 'blah blah.\n'),
('GUY IN DESK', 'I speak words!')]
You never gave us mock data, so I used the following for testing purposes:
name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.
We can try matching using the following pattern:
^\S+:\s+((?:(?!^\S+:).)+)
This can be explained as:
^\S+:\s+ match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+) then match and capture everything up until the next name
Note that this handles the edge case of the final sentence, because the negative lookahead used above just would not be true, and hence all remaining content would be captured.
Code sample:
import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)
['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']
Demo

Find a substring in block of text, unless it is part of another substring

I was looking for an efficient way to find a substring between two expressions, unless the expression is a part of another.
For example:
Once upon a time, in a time far far away, dogs ruled the world. The End.
If I was searching for the substring between time and end, I would receive:
in a time far far away, dogs ruled the world. The
or
far far away, dogs ruled the world. The
I want to ignore if time is a part of Once upon a time. I didn't know if there was a pythonic method without using crazy for loops and if/else cases.
This is possible in regex by using a negative lookahead
>>> s = 'Once upon a time, in a time far far away, dogs ruled the world. The End.'
>>> pattern = r'time((?:(?!time).)*)End'
>>> re.findall(pattern, s)
[' far far away, dogs ruled the world. The ']
With multiple matches:
>>> s = 'a time b End time c time d End time'
>>> re.findall(pattern, s)
[' b ', ' d ']
Just remove 'Once upon a time' and check what's left.
my_string = 'Once upon a time, in a time far far away, dogs ruled the world. The End.'
if 'time' in my_string.replace('Once upon a time', ''):
pass
The typical solution here is to use capturing and non-capturing regular expression groups. Since regex alternations get parsed from left to right, placing any exceptions to the rule first (as a non-capture) and end with the alternation that you want to select for.
import re
text = "Once upon a time, in a time far far away, dogs ruled the world. The End."
query = re.compile(r"""
Once upon a time| # literally 'Once upon a time',
# should not be selected
time\b # from the word 'time'
(.*) # capture everything
\bend # until the word 'end'
""", re.X | re.I)
result = query.findall(text)
# result = ['', ' far far away, dogs ruled the world. The ']
You can strip out the empty group (that got put in when we matched the unwanted string)
result = list(filter(None, result))
# or result = [r for r in result if r]
# [' far far away, dogs ruled the world. The ']
and then strip the results
result = list(map(str.strip, filter(None, result)))
# or result = [r.strip() for r in result if r]
# ['far far away, dogs ruled the world. The']
This solution is particularly useful when you have a number of phrases you're trying to dodge.
phrases = ["Once upon a time", "No time like the present", "Time to die", "All we have left is time"]
querystring = r"time\b(.*)\bend"
query = re.compile("|".join(map(re.escape, phrases)) + "|" + querystring, re.I)
result = [r.strip() for r in query.findall(some_text) if r]

NLTK: How can I extract information based on sentence maps?

I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes
You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Regex Greediness

it might be hard question related regular expression but I couldn't solve it. Here is my regular expression:
regex = (^|(?<= ))Football( ((\S+ )+?(?=Football)|(\S+ )+)| )fun( ((\S+ )+?(?=Football)|(\S+ )+)| )Football\ is\ important((?= )|$)
With that I'd like to catch these:
text1 = "Football is fun I like Football is important"
text2 = "Fun to watch Football I think Football is important"
text3 = "Fun to watch Football I like Football"
but not this:
text4 = "Football is fun I like Football Football is important"
As far as I understand, expression shouldn't have matched because there is one more Football in there. Second ( ((\S+ )+?(?=Football)|(\S+ )+)| ) part should have matched I like because after this Football in there and it's not greedy because I added ? after second +. The last part should have matched Football is important so there is one Football (in the middle) hanging around. How can I modify it so that it makes what I need?
More clarification about the question:
( ((\S+ )+?(?=Football)|(\S+ )+)| )part should match with not whitespace chars until it sees Football and returns what it got. So this regex shouldn't have matched with text4 because there are only two Football in it. On the otherhand text4 contains 3 Football. Hope it's more clear now.
Sorry for the silly example; I changed my real text.
The word fun is mandatory after first occurence of football - the second and third sentences can't match since there's no fun there ;)
text4 is a bit more complicated to explain. It matches, due to the second occurence of ( ((\S+ )+?(?=Football)|(\S+ )+)| ) matches I like Football.
Every word is matched with the inner part (\S+ )+?.
You're right. You're using +? here - but there are two opportunities for the inner part:
match I like (Football)
match I like Football (Football)
both are valid for (\S+ )+?(?=Football) - what exactly is the least part of it, only depends on what's next.
Example
Use the pattern (\S+ )+?(?=Football)Football with the text I like Football Football. It will matche I like Football (as you expected).
Now, modify the pattern to (\S+ )+?(?=Football)Football$. You'll see that now, the complete text is matched. $ could not match if you stop at the first occurence of Football. The rest of the text have to match too - and since Football could be matched by \S+, everything is perfectly valid..
Hope that helps a bit.

Categories