Regex - distinguishing between very similar substrings using the substring itself

Regex - distinguishing between very similar substrings using the substring itself - python

I know the title is rather confusing, but this is a hard problem to formulate simply. Hopefully this does have a solution and is the result of me being quite new to the world of regex.
I am trying to parse some text from a chemistry book and transform it into a JSON, but i'm having trouble dividing the text by its main identifier. All of this is being done on a Python 3.10 environment.
Consider the following string:
0047 Heptasilver nitrate octaoxide
[12258-22-9] Ag NO
7 11
(Ag O ) .AgNO
3 4 2 3
Alone, or Sulfides, or Nonmetals
The crystalline product produced by electrolytic oxidation
of silver nitrate (and possibly as formulated) detonates
feebly at 110°C. Mixtures with phosphorus and sulfur
explode on impact, hydrogen sulfide ignites on contact,
and antimony trisulfide ignites when ground with the salt.
Mellor, 1941, Vol. 3, 483–485
See other SILVER COMPOUNDS
See related METAL NITRATES
0048 Aluminium
[7429-90-5] Al
Al
HCS 1980, 135 (powder)
Finely divided aluminium powder or dust forms highly
explosive dispersions in air [1], and all aspects of pre-
vention of aluminium dust explosions are covered in 2
US National Fire Codes [2]. The effects on the ignition
properties of impurities introduced by recycled metal used
to prepare dust were studied [3]. Pyrophoricity is elimi-
nated by surface coating aluminium powder with poly-
styrene [4]. Explosion hazards involved in arc and flame
spraying of the powder were analyzed and discussed [5],
and the effect of surface oxide layers on flammability
was studied [6]. The causes of a severe explosion in
1983 in a plant producing fine aluminium powder were
analyzed, and improvements in safety practices discussed
[7]. A number of fires and explosions involving aluminiumdust arising from grinding, polishing, and buffing opera-
tions were discussed, and precautions detailed [8] [12]
[13]. Atomized and flake aluminium powders attain
See other METALS
See other REDUCANTS
0049 Aluminium-cobalt alloy (Raney cobalt alloy)
[37271-59-3] 50:50; [12043-56-0] Al Co; Al—Co
5
[73730-53-7] Al Co
2
Al Co
The finely powdered Raney cobalt alloy is a significant
dust explosion hazard.
See DUST EXPLOSION INCIDENTS (reference 22)
0050 Aluminium–copper–zinc alloy
(Devarda’s alloy)
[8049-11-4] Al—Cu—Zn
Al Cu Zn
Silver nitrate: Ammonia, etc.
See DEVARDA’S ALLOY
See other ALLOYS0051 Aluminium amalgam (Aluminium–
mercury alloy)
[12003-69-9] (1:1) Al—Hg
Al Hg
The amalgamated aluminium wool remaining from prepa-
ration of triphenylaluminium will rapidly oxidize and
become hot upon exposure to air. Careful disposal is nec-
essary [1]. Amalgamated aluminium foil may be pyro-
phoric and should be kept moist and used immediately [2].
1. Neely, T. A. et al., Org. Synth., 1965, 45, 109
2. Calder, A. et al., Org. Synth., 1975, 52, 78
See other ALLOYS
This string contains information on 5 distinct compounds, which are identified by a 4 digit number at the beginning, followed by the name and then in another line the CAS unique identifier in square brackets.
The way i'm trying to divide this into separate substrings for each object is by identifying the 4 digit number which is always followed by the other identifiers and divide the text at that point.
I'm currently using this regex expression which correctly identifies the 4 digit identifiers:
\n(\d{4})\s(?:[\s\S]*?)(?:\[\d*?-\d*?-\d*?\]|\[ *?\] [a-zA-Z]*?)
However, this also includes a few instances of other 4 digit numbers that are not identifiers, such as dates in the body text, such as the date "1983" in the text of the Aluminum (0048) compound entry.
I have tried to use negative lookaheads with the same expression i'm using for isolating the 4 digit identifier, but none of of the ways i've tried worked. And now i'm unsure if this is even possible, or perhaps i'm overcomplicating it.
Another way to do it would be by using the CAS (in the square brackets) but that would be worse, as there are entries with multiple or even empty CAS.
Any advice would be greatly appreciated!

A few notes about your pattern:
You can omit [a-zA-Z]*? at the end of the pattern as it is the last part and non greedy so it will not match any characters
parts like \d*?-\d*?-\d*? and \[ *?\] don't have to be non greedy as the specified character to be repeated can not cross the following character
If the match should always start with a newline:
\n(\d{4}).*(?:\n\(.*)*\n\[(?: *|\d+-\d+-\d+)]
Explanation
\n Match a newline
(\d{4}) Capture 4 digits in group 1
.* Match the rest of the line
(?:\n\(.*)* Optionally repeat matching a newline and ( followed by the rest of the line
\n Match a newline
\[(?: *|\d+-\d+-\d+)] Match [...] where there can be either only spaces or digits with a hyphen in between
See a regex demo.
If the square brackets should be directly on the next line:
^(\d{4}).*\n\[(?: *|\d+-\d+-\d+)]
See another regex demo.

Related

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

I have this long paragraph:
paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."
My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.
There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:
regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[]+)'
new_text = re.split(regex_special_chars, paragraph)
The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.
The end result will look like the following list below:
tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']
if we check the lengths of the end result; we get this many words into each tokenized segment:
[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]
Only the last segment exceeded 30 words (76 words), and that's okay!
Edit
The regex could include a colon : So the last segment would be less than 76

I would suggest using findall instead of split.
Then the regex could be:
(?:\S+\s+)*?(?:\S+\s+){17,29}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)
Break-down:
\S+\s+ a word and the space(s) that follow it
(?:\S+\s+)*?(?:\S+\s+){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+): match one more word, terminated by a terminator character, or the end of the string.
So:
regex = r'(?:\S+\s+)*?(?:\S+\s+){18,30}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)'
new_text = re.findall(regex, paragraph)
for line in new_text:
print(len(line.split()), line)
The number of words per paragraph are:
[21, 31, 25, 30, 25, 29, 27, 26, 76]

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Python: Find and remove a string starting and ending with a specific substring in python

I have a string and it has a number of substrings that I'd like to delete.
Each of the substrings start with ApPle and end with THE BEST PIE — STRAWBERRY.
I tried the suggestions on this post, but they didn't work.
Input
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian ApPle Sep 12 THE BEST PIE —
STRAWBERRY pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine.
Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually ApPle Aug 4 THE BEST PIE — STRAWBERRY containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]
import re
array = []
#open the file and delete new lines
with open('canoli.txt', 'r') as myfile:
file = myfile.readlines()
array = [s.rstrip('\n') for s in file]
text = ' '.join(array)
attempt1 = re.sub(r'/ApPle+THE.BEST.PIE.-.STRAWBERRY/','',text)
attempt2 = re.sub(r'/ApPle:.*?:THE.BEST.PIE.-.STRAWBERRY/','',text)
print(attempt1)
print(attempt2)
Desired Output
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine. Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]

I think your regex should be: ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY
and you need to add the regex option DOTALL to handle newlines properly, try this:
re.sub(r'ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY','',text, flags=re.DOTALL)

Pandas unable to parse csv with multiple lines within a cell

I have a csv file Decoded.csv
Query,Doc,article_id,data_source
5000,how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.,459,random
5001,what is hypospadia,A birth defect of the male urethra.,409,dummy
5002,difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.,461,nytimes
And there are 3 Query 5000,5001 & 5002.
Query 5000 has a Doc value which has multiple lines and that is confusing for pandas.
(1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible)
My python code is as under
def main():
import pandas as pd
dataframe = pd.read_csv("Decoded.csv")
queries, docs = dataframe['Query'], dataframe['Doc']
for idx in range(len(queries)):
print("idx: ", idx, " ", queries[idx], " <-> ", docs[idx])
query_doc_appended = (queries[idx] + " " + docs[idx])
print(query_doc_appended)
if __name__ == '__main__':
main()
And it fails. Please point me how to get rid of new line characters so that Query 5000 has the complete set of statements for Doc.

your Query 5001 row has too many fields in it, making it have 5 columns instead of the 4 that the other rows have.
5001,what is hypospadia,A birth defect of the male urethra.,409,dummy
you can double-quote your Doc content within Decoded.csv to get around this.

2 problems:
to allow multiline fields, the field data has to be enclosed in double quotes.
there are also comma's in your field data.
So, the csv should look like this:
Query,Doc,article_id,data_source
5000,"how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.",459,random
5001,"what is hypospadia,A birth defect of the male urethra.",409,dummy
5002,"difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.",461,nytimes
In case there are double quotes inside those fields, they have to be escaped with another double quote.

Splitting a string with no line breaks into a list of lines with a maximum column count

I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."

EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond

You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.