I'm trying to create a regex expression. I looked at this stackoverflow post and some others but I haven't been able to solve my problem.
I'm trying to match part of a street address. I want to capture everything after the directional.
Here are some examples
5XX N Clark St
91XX S Dr Martin Luther King Jr Dr
I was able to capture everything left of the directional with this pattern.
\d+[X]+\s\w
It returns 5XX N and 91XX S
I was wondering how I can take the inverse of the regex expression. I want to return
Clark St and Dr Martin Luther King Jr Dr.
I tried doing
(?!\d+[X]+\s\w)
But it returns no matches.
Use the following pattern :
import re
s1='5XX N Clark St'
s2='91XX S Dr Martin Luther King Jr Dr'
pattern="(?<=N|S|E|W).*"
k1=re.search(pattern,s1)
k2=re.search(pattern,s2)
print(k1[0])
print(k2[0])
Output:
Clark St
Dr Martin Luther King Jr Dr
we do not need necessarily use regex to get the desired text from each line of the string:
text =["5XX N Clark St", "91XX S Dr Martin Luther King Jr Dr"]
for line in text:
print(line.split(maxsplit=2)[-1])
result is:
Clark St
Dr Martin Luther King Jr Dr
Related
Lang: Python. Using regex for instance if I use remove1 = re.sub('\.(?!$)', '', text), it removes all periods. I am only able to remove all periods, not just prefixes. Can anyone help, please? Just put the below text for example.
Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.
You can capture what you want to keep, and match the dot that you want to replace.
\b(Mrs?)\.
Regex demo
In the replacement use group 1 like \1
import re
pattern = r"\b(Mrs?)\."
s = ("Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.\n")
result = re.sub(pattern, r"\1", s)
print(result)
Output
Mr and Mrs Jackson live up the street from us. However, Mrs Jackson's son lives in the street parallel to us.
I have
text = 'he is Dr. alex dams. He puts up in Washington town since 1990. He has been a very good friend of Dr. kane Andeas and his family'
I want to get the following output using re.findall:
['Dr. alex dams', 'Dr. kane Andeas']
I am using the following code but just getting ['Dr.'] in output.
re.findall("Dr.[a-z\s]+",text)
If the doctors will always follows the same format, you can search for then with \w+ for a word and \s for space.
(Dr\.\s\w+\s\w+)
Code
text = 'he is Dr. alex dams. He puts up in Washington town since 1990. He has been a very good friend of Dr. kane Andeas and his family'
re.findall(r'(Dr\.\s\w+\s\w+)', text)
#['Dr. alex dams', 'Dr. kane Andeas']
While PacketLoss answer works it will not catch hyphen divided names (like Pearl-Hopson or similar).
I would go for:
text = 'he is Dr. alex dams. He puts up in Washington town since 1990. He has been a very good friend of Dr. kane Andeas and his family'
re.findall(r'(Dr\.\s\S+\s\S+\b)', text)
I've been trying to clean some data with the below, but my regex won't go past the \n. I don't understand why because i thought .* should capture everything.
table = POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO \n2002 present Assistant Professor, Department of Development Sociology, Cornell \n University, Ithaca, NY \n \n1999 2001
output = table.encode('ascii', errors='ignore').strip()
pat = r'POSITIONS.*'.format(endword)
print pat
regex = re.compile(pat)
if regex.search(output):
print regex.findall(output)
pieces.append(regex.findall(output))
the above returns:
['POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO ']
. does not match a newline unless you specify re.DOTALL (or re.S) flag.
>>> import re
>>> re.search('.', '\n')
>>> re.search('.', '\n', flags=re.DOTALL)
<_sre.SRE_Match object at 0x0000000002AB8100>
regex = re.compile(pat, flags=re.DOTALL)
I refine this expression in my python console:
texts = re.findall(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text", text)
It works very well and his execution time is nearly instant when i execute in the console, but when I put it into my code and execute via interpreter its seems to get blocked.
I test it again in the Console and it executed in less that a second again.
I check that the blocking sentence is the regex execution and the text is the same all the executions.
What is happening?
----------------------------------------code---------------------------------------------
class Wiki:
# Regex definition
search_text_regex = re.compile(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text")
def search_by_title(self, name, text):
""" Search the slice(The last) of the text that contains the exact name and return the slice index.
"""
print "Backoff Launched:"
# extract the tex from wikipedia Pages
print "\tExtracting Texts from pages..."
texts = self.search_text_regex.findall(text) # <= The Regex Launch
# find the name in the text
print "\tFinding names on text..."
for index, text in enumerate(texts):
if name in text:
return index
return None
-----------------Source----------------------------------
<page><title>Andrew Johnson</title><id>1624</id><revision><id>244612901</id><timestamp>2008-10-11T18:30:44Z</timestamp><contributor><username>Excirial</username><id>5499713</id></contributor><minor/><comment>Reverted edits by [[Special:Contributions/71.113.103.209|71.113.103.209]] to last version by Soliloquial ([[WP:HG|HG]])</comment><text xml:space="preserve">{{otherpeople2|Andrew Johnson (disambiguation)}}
{{Infobox President
|name=Andrew Johnson
|nationality=American
|image=Andrew Johnson - 3a53290u.png
|caption=President Andrew Johnson, taken in 1865 by [[Mathew Brady|Matthew Brady]].
|order=17th [[President of the United States]]
|vicepresident=none
|term_start=April 15, 1865
|term_end=March 4, 1869
|predecessor=[[Abraham Lincoln]]
|successor=[[Ulysses S. Grant]]
|birth_date={{birth date|mf=yes|1808|12|29}}
|birth_place=[[Raleigh, North Carolina]]
|death_date={{death date and age|mf=yes|1875|7|31|1808|12|29}}
|death_place=[[Elizabethton, Tennessee]]
|spouse=[[Eliza McCardle Johnson]]
|occupation=[[Tailor]]
|party=[[History of the Democratic Party (United States)|Democratic]] until 1864 and after 1869; elected Vice President in 1864 on a [[National Union Party (United States)|National Union]] ticket; no party affiliation 1865–1869
|signature=Andrew Johnson Signature.png
|order2=16th [[Vice President of the United States]]
|term_start2=March 4, 1865
|term_end2=April 15, 1865
|president2=[[Abraham Lincoln]]
|predecessor2=[[Hannibal Hamlin]]
|successor2=[[Schuyler Colfax]]
|jr/sr3=United States Senator
|state3=[[Tennessee]]
|term_start3=October 8, 1857
|term_end3=March 4, 1862
|preceded3=[[James C. Jones]]
|succeeded3=[[David T. Patterson]]
|term_start4=March 4, 1875
|term_end4=July 31, 1875
|preceded4=[[William Gannaway Brownlow|William G. Brownlow]]
|succeeded4=[[David M. Key]]
|order5=17th
|title5=[[Governor of Tennessee]]
|term_start5=October 17, 1853
|term_end5=November 3, 1857
|predecessor5=[[William B. Campbell]]
|successor5=[[Isham G. Harris]]
|religion=[[Christian]] (no denomination; attended Catholic and Methodist services)<ref>[http://www.adherents.com/people/pj/Andrew_Johnson.html Adherents.com: The Religious Affiliation of Andrew Johnson]</ref>
}}
Johnson was nominated for the [[Vice President of the United States|Vice President]] slot in 1864 on the [[National Union Party (United States)|National Union Party]] ticket. He and Lincoln were [[United States presidential election, 1864|elected in November 1864]]. Johnson succeeded to the Presidency upon Lincoln's assassination on April 15, 1865.
==Bibliography==
{{portal|Tennessee}}
{{portal|United States Army|United States Department of the Army Seal.svg}}
{{portal|American Civil War}}
* Howard K. Beale, ''The Critical Year. A Study of Andrew Johnson and Reconstruction'' (1930). ISBN 0-8044-1085-2
* Winston; Robert W. ''Andrew Johnson: Plebeian and Patriot'' (1928) [http://www.questia.com/PM.qst?a=o&d=3971949 online edition]
===Primary sources===
* Ralph W. Haskins, LeRoy P. Graf, and Paul H. Bergeron et al, eds. ''The Papers of Andrew Johnson'' 16 volumes; University of Tennessee Press, (1967–2000). ISBN 1572330910.) Includes all letters and speeches by Johnson, and many letters written to him. Complete to 1875.
* [http://www.impeach-andrewjohnson.com/ Newspaper clippings, 1865–1869]
* [http://www.andrewjohnson.com/09ImpeachmentAndAcquittal/ImpeachmentAndAcquittal.htm Series of [[Harper's Weekly]] articles covering the impeachment controversy and trial]
*[http://starship.python.net/crew/manus/Presidents/aj2/aj2obit.html Johnson's obituary, from the ''New York Times'']
==Notes==
{{reflist|2}}
==External links==
{{sisterlinks|s=Author:Andrew Johnson}}
*{{gutenberg author|id=Andrew+Johnson | name=Andrew Johnson}}
{{s-start}}
{{s-par|us-hs}}
{{s-aft|after=[[Ulysses S. Grant]]}}
{{s-par|us-sen}}
{{s-bef|before=[[James C. Jones]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]|years=October 8, 1857{{ndash}} March 4, 1862|alongside=[[John Bell (Tennessee politician)|John Bell]], [[Alfred O. P. Nicholson]]}}
{{s-vac|next=[[David T. Patterson]]|reason=[[American Civil War|Secession of Tennessee from the Union]]}}
{{s-bef|before=[[William Gannaway Brownlow|William G. Brownlow]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]| years=March 4, 1875{{ndash}} July 31, 1875|alongside=[[Henry Cooper (U.S. Senator)|Henry Cooper]]}}
{{s-aft|after=[[David M. Key]]}}
{{s-ppo}}
{{s-bef|before=[[Hannibal Hamlin]]}}
{{s-ttl|title=[[List of United States Republican Party presidential tickets|Republican Party¹ vice presidential candidate]]|years=[[U.S. presidential election, 1864|1864]]}}
{{Persondata
|NAME= Johnson, Andrew
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION= seventeenth [[President of the United States]]<br/> [[Union (American Civil War)|Union]] [[Union Army|Army]] [[General officer|General]]
|DATE OF BIRTH={{birth date|mf=yes|1808|12|29|mf=y}}
|PLACE OF BIRTH= [[Raleigh, North Carolina]]
|DATE OF DEATH={{death date|mf=yes|1875|7|31|mf=y}}
|PLACE OF DEATH= [[Greeneville, Tennessee]]
}}
{{Lifetime|1808|1875|Johnson, Andrew}}
[[Category:Presidents of the United States]]
[[vi:Andrew Johnson]]
[[tr:Andrew Johnson]]
[[uk:Ендрю Джонсон]]
[[ur:انڈریو جانسن]]
[[yi:ענדרו זשאנסאן]]
[[zh:安德鲁·约翰逊]]</text></revision></page>
I solve it.
The code have a pipe for cleaning the text that remove some necessary markup for correct matching.
Because the length of the text, the search of a impossible match takes too much time.
I would use this:
result = re.findall(r"(?s)<text[^>]*>(?P<text>(?:(?!</?text>).)*)</text>", subject)
(?:(?!</?text>).)* consumes one character at a time, but only after the lookahead verifies that it's not the first character of a <text> or </text> tag.
How do I parse sentence case phrases from a passage.
For example from this passage
Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882.
We need to generate stuff like Conan Doyle, Holmes, Dr Joseph Bell, Wendell Scherr etc.
I would prefer a Pythonic Solution if possible
This kind of processing can be very tricky. This simple code does almost the right thing:
for s in re.finditer(r"([A-Z][a-z]+[. ]+)+([A-Z][a-z]+)?", text):
print s.group(0)
produces:
Conan Doyle
Holmes
Dr. Joseph Bell
Doyle
Edinburgh Royal Infirmary. Like Holmes
Bell
Michael Harrison
Ellery Queen
Mystery Magazine
Wendell Scherer
England
To include "Dr. Joseph Bell", you need to be ok with the period in the string, which allows in "Edinburgh Royal Infirmary. Like Holmes".
I had a similar problem: Separating Sentences.
The "re" approach runs out of steam very quickly. Named entity recognition is a very complicated topic, way beyond the scope of an SO answer. If you think you have a good approach to this problem, please point it at Flann O'Brien a.k.a. Myles na cGopaleen, Sukarno, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Köfering und Schönberg.
Update Following is an "re"-based approach that finds a lot more valid cases. I still don't think that this is a good approach, though. N.B. I've asciified the Bavarian count's name in my text sample. If anyone really wants to use something like this, they should work in Unicode, and normalise whitespace at some stage (either on input or on output).
import re
text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""
text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""
pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"
joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()
pattern2 = r"""(?x)
(?:
(?:[ .]|\b%s\b)*
(?:\b[a-z]*[A-Z][a-z]*\b)?
)+
""" % r'\b|\b'.join(joiners)
def get_names(pattern, text):
for m in re.finditer(pattern, text):
s = m.group(0).strip(" .'-")
if s:
yield s
for t in (text1, text2):
print "*** text: ", t[:20], "..."
print "=== Ned B"
for s in re.finditer(pattern1):
print repr(s.group(0))
print "=== John M =="
for name in get_names(pattern2, t):
print repr(name)
Output:
C:\junk\so>\python26\python extract_names.py
*** text: Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text: Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'