Lang: Python. Using regex for instance if I use remove1 = re.sub('\.(?!$)', '', text), it removes all periods. I am only able to remove all periods, not just prefixes. Can anyone help, please? Just put the below text for example.
Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.
You can capture what you want to keep, and match the dot that you want to replace.
\b(Mrs?)\.
Regex demo
In the replacement use group 1 like \1
import re
pattern = r"\b(Mrs?)\."
s = ("Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.\n")
result = re.sub(pattern, r"\1", s)
print(result)
Output
Mr and Mrs Jackson live up the street from us. However, Mrs Jackson's son lives in the street parallel to us.
I'm trying to create a regex expression. I looked at this stackoverflow post and some others but I haven't been able to solve my problem.
I'm trying to match part of a street address. I want to capture everything after the directional.
Here are some examples
5XX N Clark St
91XX S Dr Martin Luther King Jr Dr
I was able to capture everything left of the directional with this pattern.
\d+[X]+\s\w
It returns 5XX N and 91XX S
I was wondering how I can take the inverse of the regex expression. I want to return
Clark St and Dr Martin Luther King Jr Dr.
I tried doing
(?!\d+[X]+\s\w)
But it returns no matches.
Use the following pattern :
import re
s1='5XX N Clark St'
s2='91XX S Dr Martin Luther King Jr Dr'
pattern="(?<=N|S|E|W).*"
k1=re.search(pattern,s1)
k2=re.search(pattern,s2)
print(k1[0])
print(k2[0])
Output:
Clark St
Dr Martin Luther King Jr Dr
we do not need necessarily use regex to get the desired text from each line of the string:
text =["5XX N Clark St", "91XX S Dr Martin Luther King Jr Dr"]
for line in text:
print(line.split(maxsplit=2)[-1])
result is:
Clark St
Dr Martin Luther King Jr Dr
I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.
Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla
You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())
As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']
// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content
Using regular expression in Python 3.4, how would I extract the city names from the following text below?
replacement windows in seattle wa
basement remodeling houston texas
siding contractor new york ny
windows in elk grove village
Sometimes the city name before it has \sin\s, sometimes it doesn't. Sometimes it has a general word like 'windows', 'remodeling', ... anything. Sometimes there is no state full name or state abbreviation at the end.
Is there a single regular expression that can capture these above conditions?
Here's what I've tried so far but it only captures 'seattle'.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor new york ny',
'windows in elk grove village'
]
for i in l:
m = re.search(r'(?<=\sin\s)(.+)(?=\s(wa|texas|ny))', i)
m.group(1)
What you are after is not possible with regular expressions. Regular expressions need string patterns to work. In your case, it would seem that the pattern either does not exist or can take a myriad of forms.
What you could do would be to use a search efficient data structure and split your string in words. You would then go through each word and see if it is in your search efficient data structure.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor newyork ny',
'windows in elk grove village']
p = re.compile(r"(\w+)\s(?:(wa | texas | ny | village))", re.VERBOSE)
for words in l:
print p.search(words).expand(r"\g<1> <-- the code is --> \g<2>")
I refine this expression in my python console:
texts = re.findall(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text", text)
It works very well and his execution time is nearly instant when i execute in the console, but when I put it into my code and execute via interpreter its seems to get blocked.
I test it again in the Console and it executed in less that a second again.
I check that the blocking sentence is the regex execution and the text is the same all the executions.
What is happening?
----------------------------------------code---------------------------------------------
class Wiki:
# Regex definition
search_text_regex = re.compile(r"text[^>]*\>(?P<text>(:?[^<]|</\s?[^tT])*)\</text")
def search_by_title(self, name, text):
""" Search the slice(The last) of the text that contains the exact name and return the slice index.
"""
print "Backoff Launched:"
# extract the tex from wikipedia Pages
print "\tExtracting Texts from pages..."
texts = self.search_text_regex.findall(text) # <= The Regex Launch
# find the name in the text
print "\tFinding names on text..."
for index, text in enumerate(texts):
if name in text:
return index
return None
-----------------Source----------------------------------
<page><title>Andrew Johnson</title><id>1624</id><revision><id>244612901</id><timestamp>2008-10-11T18:30:44Z</timestamp><contributor><username>Excirial</username><id>5499713</id></contributor><minor/><comment>Reverted edits by [[Special:Contributions/71.113.103.209|71.113.103.209]] to last version by Soliloquial ([[WP:HG|HG]])</comment><text xml:space="preserve">{{otherpeople2|Andrew Johnson (disambiguation)}}
{{Infobox President
|name=Andrew Johnson
|nationality=American
|image=Andrew Johnson - 3a53290u.png
|caption=President Andrew Johnson, taken in 1865 by [[Mathew Brady|Matthew Brady]].
|order=17th [[President of the United States]]
|vicepresident=none
|term_start=April 15, 1865
|term_end=March 4, 1869
|predecessor=[[Abraham Lincoln]]
|successor=[[Ulysses S. Grant]]
|birth_date={{birth date|mf=yes|1808|12|29}}
|birth_place=[[Raleigh, North Carolina]]
|death_date={{death date and age|mf=yes|1875|7|31|1808|12|29}}
|death_place=[[Elizabethton, Tennessee]]
|spouse=[[Eliza McCardle Johnson]]
|occupation=[[Tailor]]
|party=[[History of the Democratic Party (United States)|Democratic]] until 1864 and after 1869; elected Vice President in 1864 on a [[National Union Party (United States)|National Union]] ticket; no party affiliation 1865–1869
|signature=Andrew Johnson Signature.png
|order2=16th [[Vice President of the United States]]
|term_start2=March 4, 1865
|term_end2=April 15, 1865
|president2=[[Abraham Lincoln]]
|predecessor2=[[Hannibal Hamlin]]
|successor2=[[Schuyler Colfax]]
|jr/sr3=United States Senator
|state3=[[Tennessee]]
|term_start3=October 8, 1857
|term_end3=March 4, 1862
|preceded3=[[James C. Jones]]
|succeeded3=[[David T. Patterson]]
|term_start4=March 4, 1875
|term_end4=July 31, 1875
|preceded4=[[William Gannaway Brownlow|William G. Brownlow]]
|succeeded4=[[David M. Key]]
|order5=17th
|title5=[[Governor of Tennessee]]
|term_start5=October 17, 1853
|term_end5=November 3, 1857
|predecessor5=[[William B. Campbell]]
|successor5=[[Isham G. Harris]]
|religion=[[Christian]] (no denomination; attended Catholic and Methodist services)<ref>[http://www.adherents.com/people/pj/Andrew_Johnson.html Adherents.com: The Religious Affiliation of Andrew Johnson]</ref>
}}
Johnson was nominated for the [[Vice President of the United States|Vice President]] slot in 1864 on the [[National Union Party (United States)|National Union Party]] ticket. He and Lincoln were [[United States presidential election, 1864|elected in November 1864]]. Johnson succeeded to the Presidency upon Lincoln's assassination on April 15, 1865.
==Bibliography==
{{portal|Tennessee}}
{{portal|United States Army|United States Department of the Army Seal.svg}}
{{portal|American Civil War}}
* Howard K. Beale, ''The Critical Year. A Study of Andrew Johnson and Reconstruction'' (1930). ISBN 0-8044-1085-2
* Winston; Robert W. ''Andrew Johnson: Plebeian and Patriot'' (1928) [http://www.questia.com/PM.qst?a=o&d=3971949 online edition]
===Primary sources===
* Ralph W. Haskins, LeRoy P. Graf, and Paul H. Bergeron et al, eds. ''The Papers of Andrew Johnson'' 16 volumes; University of Tennessee Press, (1967–2000). ISBN 1572330910.) Includes all letters and speeches by Johnson, and many letters written to him. Complete to 1875.
* [http://www.impeach-andrewjohnson.com/ Newspaper clippings, 1865–1869]
* [http://www.andrewjohnson.com/09ImpeachmentAndAcquittal/ImpeachmentAndAcquittal.htm Series of [[Harper's Weekly]] articles covering the impeachment controversy and trial]
*[http://starship.python.net/crew/manus/Presidents/aj2/aj2obit.html Johnson's obituary, from the ''New York Times'']
==Notes==
{{reflist|2}}
==External links==
{{sisterlinks|s=Author:Andrew Johnson}}
*{{gutenberg author|id=Andrew+Johnson | name=Andrew Johnson}}
{{s-start}}
{{s-par|us-hs}}
{{s-aft|after=[[Ulysses S. Grant]]}}
{{s-par|us-sen}}
{{s-bef|before=[[James C. Jones]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]|years=October 8, 1857{{ndash}} March 4, 1862|alongside=[[John Bell (Tennessee politician)|John Bell]], [[Alfred O. P. Nicholson]]}}
{{s-vac|next=[[David T. Patterson]]|reason=[[American Civil War|Secession of Tennessee from the Union]]}}
{{s-bef|before=[[William Gannaway Brownlow|William G. Brownlow]]}}
{{s-ttl|title=[[List of United States Senators from Tennessee|Senator from Tennessee (Class 1)]]| years=March 4, 1875{{ndash}} July 31, 1875|alongside=[[Henry Cooper (U.S. Senator)|Henry Cooper]]}}
{{s-aft|after=[[David M. Key]]}}
{{s-ppo}}
{{s-bef|before=[[Hannibal Hamlin]]}}
{{s-ttl|title=[[List of United States Republican Party presidential tickets|Republican Party¹ vice presidential candidate]]|years=[[U.S. presidential election, 1864|1864]]}}
{{Persondata
|NAME= Johnson, Andrew
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION= seventeenth [[President of the United States]]<br/> [[Union (American Civil War)|Union]] [[Union Army|Army]] [[General officer|General]]
|DATE OF BIRTH={{birth date|mf=yes|1808|12|29|mf=y}}
|PLACE OF BIRTH= [[Raleigh, North Carolina]]
|DATE OF DEATH={{death date|mf=yes|1875|7|31|mf=y}}
|PLACE OF DEATH= [[Greeneville, Tennessee]]
}}
{{Lifetime|1808|1875|Johnson, Andrew}}
[[Category:Presidents of the United States]]
[[vi:Andrew Johnson]]
[[tr:Andrew Johnson]]
[[uk:Ендрю Джонсон]]
[[ur:انڈریو جانسن]]
[[yi:ענדרו זשאנסאן]]
[[zh:安德鲁·约翰逊]]</text></revision></page>
I solve it.
The code have a pipe for cleaning the text that remove some necessary markup for correct matching.
Because the length of the text, the search of a impossible match takes too much time.
I would use this:
result = re.findall(r"(?s)<text[^>]*>(?P<text>(?:(?!</?text>).)*)</text>", subject)
(?:(?!</?text>).)* consumes one character at a time, but only after the lookahead verifies that it's not the first character of a <text> or </text> tag.