I refine this expression in my python console:
texts = re.findall(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text", text)
It works very well and his execution time is nearly instant when i execute in the console, but when I put it into my code and execute via interpreter its seems to get blocked.
I test it again in the Console and it executed in less that a second again.
I check that the blocking sentence is the regex execution and the text is the same all the executions.
What is happening?
----------------------------------------code---------------------------------------------
class Wiki:
# Regex definition
search_text_regex = re.compile(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text")
def search_by_title(self, name, text):
""" Search the slice(The last) of the text that contains the exact name and return the slice index.
"""
print "Backoff Launched:"
# extract the tex from wikipedia Pages
print "\tExtracting Texts from pages..."
texts = self.search_text_regex.findall(text) # <= The Regex Launch
# find the name in the text
print "\tFinding names on text..."
for index, text in enumerate(texts):
if name in text:
return index
return None
-----------------Source----------------------------------
<page><title>Andrew Johnson</title><id>1624</id><revision><id>244612901</id><timestamp>2008-10-11T18:30:44Z</timestamp><contributor><username>Excirial</username><id>5499713</id></contributor><minor/><comment>Reverted edits by [[Special:Contributions/71.113.103.209|71.113.103.209]] to last version by Soliloquial ([[WP:HG|HG]])</comment><text xml:space="preserve">{{otherpeople2|Andrew Johnson (disambiguation)}}
{{Infobox President
|name=Andrew Johnson
|nationality=American
|image=Andrew Johnson - 3a53290u.png
|caption=President Andrew Johnson, taken in 1865 by [[Mathew Brady|Matthew Brady]].
|order=17th [[President of the United States]]
|vicepresident=none
|term_start=April 15, 1865
|term_end=March 4, 1869
|predecessor=[[Abraham Lincoln]]
|successor=[[Ulysses S. Grant]]
|birth_date={{birth date|mf=yes|1808|12|29}}
|birth_place=[[Raleigh, North Carolina]]
|death_date={{death date and age|mf=yes|1875|7|31|1808|12|29}}
|death_place=[[Elizabethton, Tennessee]]
|spouse=[[Eliza McCardle Johnson]]
|occupation=[[Tailor]]
|party=[[History of the Democratic Party (United States)|Democratic]] until 1864 and after 1869; elected Vice President in 1864 on a [[National Union Party (United States)|National Union]] ticket; no party affiliation 1865–1869
|signature=Andrew Johnson Signature.png
|order2=16th [[Vice President of the United States]]
|term_start2=March 4, 1865
|term_end2=April 15, 1865
|president2=[[Abraham Lincoln]]
|predecessor2=[[Hannibal Hamlin]]
|successor2=[[Schuyler Colfax]]
|jr/sr3=United States Senator
|state3=[[Tennessee]]
|term_start3=October 8, 1857
|term_end3=March 4, 1862
|preceded3=[[James C. Jones]]
|succeeded3=[[David T. Patterson]]
|term_start4=March 4, 1875
|term_end4=July 31, 1875
|preceded4=[[William Gannaway Brownlow|William G. Brownlow]]
|succeeded4=[[David M. Key]]
|order5=17th
|title5=[[Governor of Tennessee]]
|term_start5=October 17, 1853
|term_end5=November 3, 1857
|predecessor5=[[William B. Campbell]]
|successor5=[[Isham G. Harris]]
|religion=[[Christian]] (no denomination; attended Catholic and Methodist services)<ref>[http://www.adherents.com/people/pj/Andrew_Johnson.html Adherents.com: The Religious Affiliation of Andrew Johnson]</ref>
}}
Johnson was nominated for the [[Vice President of the United States|Vice President]] slot in 1864 on the [[National Union Party (United States)|National Union Party]] ticket. He and Lincoln were [[United States presidential election, 1864|elected in November 1864]]. Johnson succeeded to the Presidency upon Lincoln's assassination on April 15, 1865.
==Bibliography==
{{portal|Tennessee}}
{{portal|United States Army|United States Department of the Army Seal.svg}}
{{portal|American Civil War}}
* Howard K. Beale, ''The Critical Year. A Study of Andrew Johnson and Reconstruction'' (1930). ISBN 0-8044-1085-2
* Winston; Robert W. ''Andrew Johnson: Plebeian and Patriot'' (1928) [http://www.questia.com/PM.qst?a=o&d=3971949 online edition]
===Primary sources===
* Ralph W. Haskins, LeRoy P. Graf, and Paul H. Bergeron et al, eds. ''The Papers of Andrew Johnson'' 16 volumes; University of Tennessee Press, (1967–2000). ISBN 1572330910.) Includes all letters and speeches by Johnson, and many letters written to him. Complete to 1875.
* [http://www.impeach-andrewjohnson.com/ Newspaper clippings, 1865–1869]
* [http://www.andrewjohnson.com/09ImpeachmentAndAcquittal/ImpeachmentAndAcquittal.htm Series of [[Harper's Weekly]] articles covering the impeachment controversy and trial]
*[http://starship.python.net/crew/manus/Presidents/aj2/aj2obit.html Johnson's obituary, from the ''New York Times'']
==Notes==
{{reflist|2}}
==External links==
{{sisterlinks|s=Author:Andrew Johnson}}
*{{gutenberg author|id=Andrew+Johnson | name=Andrew Johnson}}
{{s-start}}
{{s-par|us-hs}}
{{s-aft|after=[[Ulysses S. Grant]]}}
{{s-par|us-sen}}
{{s-bef|before=[[James C. Jones]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]|years=October 8, 1857{{ndash}} March 4, 1862|alongside=[[John Bell (Tennessee politician)|John Bell]], [[Alfred O. P. Nicholson]]}}
{{s-vac|next=[[David T. Patterson]]|reason=[[American Civil War|Secession of Tennessee from the Union]]}}
{{s-bef|before=[[William Gannaway Brownlow|William G. Brownlow]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]| years=March 4, 1875{{ndash}} July 31, 1875|alongside=[[Henry Cooper (U.S. Senator)|Henry Cooper]]}}
{{s-aft|after=[[David M. Key]]}}
{{s-ppo}}
{{s-bef|before=[[Hannibal Hamlin]]}}
{{s-ttl|title=[[List of United States Republican Party presidential tickets|Republican Party¹ vice presidential candidate]]|years=[[U.S. presidential election, 1864|1864]]}}
{{Persondata
|NAME= Johnson, Andrew
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION= seventeenth [[President of the United States]]<br/> [[Union (American Civil War)|Union]] [[Union Army|Army]] [[General officer|General]]
|DATE OF BIRTH={{birth date|mf=yes|1808|12|29|mf=y}}
|PLACE OF BIRTH= [[Raleigh, North Carolina]]
|DATE OF DEATH={{death date|mf=yes|1875|7|31|mf=y}}
|PLACE OF DEATH= [[Greeneville, Tennessee]]
}}
{{Lifetime|1808|1875|Johnson, Andrew}}
[[Category:Presidents of the United States]]
[[vi:Andrew Johnson]]
[[tr:Andrew Johnson]]
[[uk:Ендрю Джонсон]]
[[ur:انڈریو جانسن]]
[[yi:ענדרו זשאנסאן]]
[[zh:安德鲁·约翰逊]]</text></revision></page>
I solve it.
The code have a pipe for cleaning the text that remove some necessary markup for correct matching.
Because the length of the text, the search of a impossible match takes too much time.
I would use this:
result = re.findall(r"(?s)<text[^>]*>(?P<text>(?:(?!</?text>).)*)</text>", subject)
(?:(?!</?text>).)* consumes one character at a time, but only after the lookahead verifies that it's not the first character of a <text> or </text> tag.
Related
I'm using the code below to highlight a single matching sequence. (Just copy-paste it in a new Colab notebook, it'll work perfectly.
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
Now when I try highlighting multiple sequences, the code breaks (doesn't show the full text, and doesn't highlight the second or more sequence).
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. North American mainland at least 12,000 years ago, and advanced cultures began to appear later on.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
I need to highlight at least 2 sequences. I'm trying to make some sort of if else statement right now, maybe it'll work. Or is there a better library?
I am attempting to extract all the events from a wiki article on a date, such as May 9 (for example), and have all those events in a one-column dataframe while also ignoring the <h3> tag sub-headings Pre-1600, 1601–1900, 1901–present. All those events in those subsections should just be concatenated together into one column seamlessly.
I also want to ignore the other sections such as births, deaths, etc which are denoted in <h2> tag as well. So, only the events section is being extracted. The <h2> tag/section of interest is the second in the list as seen here.
import requests, itertools, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/May_9').text, 'html.parser')
h2 = d.find_all("h2")
h2
[<h2 id="mw-toc-heading">Contents</h2>,
<h2><span class="mw-headline" id="Events">Events</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Births">Births</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Deaths">Deaths</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Holidays_and_observances">Holidays and observances</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="References">References</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="External_links">External links</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2>Navigation menu</h2>]
I'm struggling with constructing a function that selects the Events section and then the subsequent <li> tags but ignores the subheadings and the other sections.
I've attempted to separate out the <h2> sections with
data = [[i.name, i] for i in d.find_all(re.compile('h2|ul'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[0] == 'h2')]
But I'm stuck at this point. If there is a better approach, I'm happy to use it.
You can use .find_previous to check if previous <h2> is the Events heading:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/May_9"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for li in soup.select("h3 + ul > li"):
if (h2 := li.find_previous("h2")) and (h2.find(id="Events")):
date, event = li.text.replace("–", "-").split(" - ", maxsplit=1)
print("{:<10} {}".format(date, event))
Prints:
0328 Athanasius is elected Patriarch of Alexandria.[1]
1009 Lombard Revolt: Lombard forces led by Melus revolt in Bari against the Byzantine Catepanate of Italy.
1386 England and Portugal formally ratify their alliance with the signing of the Treaty of Windsor, making it the oldest diplomatic alliance in the world which is still in force.
1450 'Abd al-Latif (Timurid monarch) is assassinated.
1540 Hernando de Alarcón sets sail on an expedition to the Gulf of California.
1662 The figure who later became Mr. Punch makes his first recorded appearance in England.[2]
1671 Thomas Blood, disguised as a clergyman, attempts to steal England's Crown Jewels from the Tower of London.
1726 Five men arrested during a raid on Mother Clap's molly house in London are executed at Tyburn.
1864 Second Schleswig War: The Danish navy defeats the Austrian and Prussian fleets in the Battle of Heligoland.
1865 American Civil War: Nathan Bedford Forrest surrenders his forces at Gainesville, Alabama.
1865 American Civil War: President Andrew Johnson issues a proclamation ending belligerent rights of the rebels and enjoining foreign nations to intern or expel Confederate ships.
1873 Der Krach: Vienna stock market crash heralds the Long Depression.
1877 Mihail Kogălniceanu reads, in the Chamber of Deputies, the Declaration of Independence of Romania. This day became the Independence Day of Romania.
1901 Australia opens its first national parliament in Melbourne.
1911 The works of Gabriele D'Annunzio are placed in the Index of Forbidden Books by the Vatican.
1915 World War I: Second Battle of Artois between German and French forces.
1918 World War I: Germany repels Britain's second attempt to blockade the port of Ostend, Belgium.
1920 Polish-Soviet War: The Polish army under General Edward Rydz-Śmigły celebrates its capture of Kiev with a victory parade on Khreshchatyk.
1926 Admiral Richard E. Byrd and Floyd Bennett claim to have flown over the North Pole (later discovery of Byrd's diary appears to cast some doubt on the claim.)
1927 Old Parliament House, Canberra officially opens.[3]
1936 Italy formally annexes Ethiopia after taking the capital Addis Ababa on May 5.
1941 World War II: The German submarine U-110 is captured by the Royal Navy. On board is the latest Enigma machine which Allied cryptographers later use to break coded German messages.
1942 The Holocaust in Ukraine: The SS executes 588 Jewish residents of the Podolian town of Zinkiv (Khmelnytska oblast. The Zoludek Ghetto (in Belarus) is destroyed and all its inhabitants executed or deported.
1945 World War II: The final German Instrument of Surrender is signed at the Soviet headquarters in Berlin-Karlshorst.
1946 King Victor Emmanuel III of Italy abdicates and is succeeded by Umberto II.
1948 Czechoslovakia's Ninth-of-May Constitution comes into effect.
1950 Robert Schuman presents the "Schuman Declaration", is considered by some people to be the beginning of the creation of what is now the European Union.
1955 Cold War: West Germany joins NATO.
1960 The Food and Drug Administration announces it will approve birth control as an additional indication for Searle's Enovid, making Enovid the world's first approved oral contraceptive pill.
1969 Carlos Lamarca leads the first urban guerrilla action against the military dictatorship of Brazil in São Paulo, by robbing two banks.
1974 Watergate scandal: The United States House Committee on the Judiciary opens formal and public impeachment hearings against President Richard Nixon.
1979 Iranian Jewish businessman Habib Elghanian is executed by firing squad in Tehran, prompting the mass exodus of the once 100,000-strong Jewish community of Iran.
1980 In Florida, United States, Liberian freighter MV Summit Venture collides with the Sunshine Skyway Bridge over Tampa Bay, making a 1,400-ft. section of the southbound span collapse. Thirty-five people in six cars and a Greyhound bus fall 150 ft. into the water and die.
1980 In Norco, California, United States, five masked gunmen hold up a Security Pacific bank, leading to a violent shoot-out and one of the largest pursuits in California history. Two of the gunmen and one police officer are killed and thirty-three police and civilian vehicles are destroyed in the chase.
1987 LOT Flight 5055 Tadeusz Kościuszko crashes after takeoff in Warsaw, Poland, killing all 183 people on board.
1988 New Parliament House, Canberra officially opens.[3]
1992 Armenian forces capture Shusha, marking a major turning point in the First Nagorno-Karabakh War.
1992 Westray Mine disaster kills 26 workers in Nova Scotia, Canada.
2001 In Ghana, 129 football fans die in what became known as the Accra Sports Stadium disaster. The deaths are caused by a stampede (caused by the firing of tear gas by police personnel at the stadium) that followed a controversial decision by the referee.
2002 The 38-day stand-off in the Church of the Nativity in Bethlehem comes to an end when the Palestinians inside agree to have 13 suspected terrorists among them deported to several different countries.[4]
2017 US President Donald Trump fires FBI Director James Comey.[5]
2018 The historic defeat for Barisan Nasional, the governing coalition of Malaysia since the country's independence in 1957 in 2018 Malaysian general election.
2020 The COVID-19 recession causes the U.S. unemployment rate to hit 14.9 percent, its worst rate since the Great Depression.[6]
I am working on splitting paragraph into sentences.
I googled and found that nltk mostly works well with splitting sentences, but I found one problem.
import nltk
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
summary = 'George Stanley McGovern (July 19, 1922 – October 21, 2012) was an American historian, author, U.S. Representative, U.S. Senator, and the Democratic Party presidential nominee in the 1972 presidential election.'
summary = (sent_detector.tokenize(summary))
The result should be just one sentence. However, it returns two sentences.
['George Stanley McGovern (July 19, 1922 \x96 October 21, 2012) was an American historian, author, U.S. Representative, U.S.', 'Senator, and the Democratic Party presidential nominee in the 1972 presidential election.']
I am trying to get a cleaned address from a bunch of addresses.
These are different addresses I have for Harvard University. What I want is to convert all these addresses to "Harvard University".
1) Division of Renal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard Institutes of Medicine Suite 550, 4 Blackfan Circle, Boston, Massachusetts 02115, USA
2) FAS Center for Systems Biology, Harvard University, Cambridge, Massachusetts 02138
3) Department of Neurobiology, Howard Hughes Medical Institute, Harvard Medical School, Boston, Massachusetts, United States of America.
Just simple text matching doesn't work. So, I tried difflib.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
a = "Division of Renal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard Institutes of Medicine Suite 550, 4 Blackfan Circle, Boston, Massachusetts 02115, USA"
b = "Harvard University"
print(similar(a, b)) # gives 0.11981566820276497
print(similar(a, "Toronto University")) # gives 0.04608294930875576
But I think this approach won't give correct results for my data set. How can I set a threshold for the similarity? Can anyone recommend a better approach?
I've been trying to clean some data with the below, but my regex won't go past the \n. I don't understand why because i thought .* should capture everything.
table = POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO \n2002 present Assistant Professor, Department of Development Sociology, Cornell \n University, Ithaca, NY \n \n1999 2001
output = table.encode('ascii', errors='ignore').strip()
pat = r'POSITIONS.*'.format(endword)
print pat
regex = re.compile(pat)
if regex.search(output):
print regex.findall(output)
pieces.append(regex.findall(output))
the above returns:
['POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO ']
. does not match a newline unless you specify re.DOTALL (or re.S) flag.
>>> import re
>>> re.search('.', '\n')
>>> re.search('.', '\n', flags=re.DOTALL)
<_sre.SRE_Match object at 0x0000000002AB8100>
regex = re.compile(pat, flags=re.DOTALL)