python regex for people names - python

hello i have tried to extract all the names from the following string:
import re
def Find(string):
url = re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
return url
string = 'Arnold Schwarzenegger was born in Austria. He and Sylvester Stalone used to run a restaurant with J. Edgar Hoover.'
print(Find(string))
but i have got a problem with the output(doesnt print the J. on edgar)
['Arnold Schwarzenegger', 'Sylvester Stalone', 'Edgar Hoover']
another question for you :)
i have tried to print the second string but i get a problem.
i need to write a regex that print it without www or http or https like in the example:
import re
def Find(string):
url = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', string)
return url
string = 'To learn about pros/cons of data science, go to http://datascience.net. Alternatively, go to datascience.net/2020/'
print(Find(string))
output is:
['http://datascience.net.']
thanks

Question 1
Here's a regex that works for that specific case of three names:
((?:[A-Z]\.\s)?[A-Z][a-z]+\s[A-Z][a-z]+)
yields
Arnold Schwarzenegger
Sylvester Stalone
J. Edgar Hoover
Question 2
(?:http)?s?(?:\:\/\/)?(?:www.)?([A-z]+\.[A-z]+(?:[\./][A-z0-9]+)*\/?)
yields
http://datascience.net
datascience.net/2020/

Related

How do I match a string in a pandas column then return what follows it?

I have a pandas dataframe which contains a column containing twitter profile descriptions. In some of these description, there are strings like 'insta: profile_name'.
How can I create a line of code which would search for a string (eg, 'insta:' or 'instagram:') and then return the rest of the string of whatever is next to it?
1252: 'lad who loves to cook 🥘 • insta: xxx',
1254: 'founder and head chef | insta: xxx |',
1992: '🇬🇧 |bakery instagram - xxx',
2291: 'insta: #xxx for enquiries'
2336: 'self taught baker. ig:// xxxx 🍥🧆',
You can use Regex to match each of the keywords such as: Insta
The code should be something like this:
import re
container = list()
for word in [list of keywords, ex: "insta","face"]:
_tag = re.findall( word + 'Regex Syntax', the_string_to_find_from)
container.append([word,_tag])
then you can unpack the resulted Container variable when you want to get the result. I can help you write the Regex syntax but I need more information on the way your required information is wrapped in the text.
Answer provided by Nk03 in the comments:
df['name'].str.extract(pat = r'(insta:|ig:)(.*)')[1].str.strip('\',')

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Python, Regular Expression Postcode search

I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef
repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.
Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.
#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH

Categories