Python - Extract paragraph from text [closed]

Python - Extract paragraph from text [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
This is my first time posting so I apologize if I omit necessary information!
I am trying to extract a paragraph of text in Python that always follows a line starting with "Item 5.02". There is a line space between "Item 5.02" and the paragraph that I am trying to extract. I need the text between the "Item 5.02" line and the next section (in this case the next section starts at "Item 9.01"). Please let me know if I need to clarify anything. I have been tinkering with regular expressions but haven't had much luck. I'm pretty new to them. Thanks for the help!
I would like to extract the following:
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company. Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP. As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001). A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
From the below text:
Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangement of Certain Officers.
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company.
Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP.
As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001).
A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
Item 9.01 Financial Statements and Exhibits.

You could split it by double newlines, find the piece which contains Item 5.02, then take the next one:
def extractPassage(text):
lines = text.split("\n\n")
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
return lines[i+1]
raise Exception("No line found starting with Item 5.02")
I can't tell from the post formatting if there are any tabs or spaces before Item 5.02 on that line. If so, include them in the startswith call.
To get all text between 5.02 and 9.01, we can append lines to a string, starting after the one starting with 5.02, and ending when we see 9.01:
def extractPassage(text):
lines = text.split("\n\n")
output = ""
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
j = i+1
take_line = lines[j]
while not take_line.startswith("Item 9.01"):
output += take_line
j += 1
take_line = lines[j]
return output
raise Exception("No line found starting with Item 5.02")

The following regex will match the word Item followed by a space, one number, one period, and then two more numbers.
import re
re.split('Item \d\.\d\d', text)
To explain the regex: \d will match any number, and then to match a period we have to escape the period using \..
If you would rather accept either 1 or 2 digits after the period, you would use the regex 'Item \d\.\d{1,2}'

Related

why part of my code is ruining the other part?

Hello guys I'm trying to create a program to count the total words and the total unique words in a file but when I run the 2 parts of the codes together only the unique words counter part will work and when I delete the unique words counter part the normal words counter will work normally
here is the code full code
f = open('icuhistory.txt','r')
wordCount = 0
text = f.read()
for line in f:
lin = line.rstrip()
wds = line.split()
wordCount += len(wds) #this section alone works fine
text = text.lower() #when I start writing this one the first one will stop working
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
unique = []
for word in words:
if word not in unique:
unique.append(word)
unique.sort()
print("number of words: ",wordCount)
print("number of unique words: ",len(unique))
Here is the inside of the file
in the fall of 1945 just weeks after the end
of world war ii a group of japanese christian educators
initiated a move to establish a university based on christian
principles the foreign missions conference of north america and the
us education mission both visiting japan at the time
gave their wholehearted support conveying this plan to people in
the us amidst the post-war yearning for reconciliation and
world peace americans supported this project with great enthusiasm in
1948 the japan international christian university foundation jicuf was
established in new york to coordinate fund-raising efforts in the
us people in japan also found hope in a
cause dedicated toworld peace organizations firms and individuals made donations
to this ambitious undertaking regardless of their religious orientation anddespite
the often destitute circumstances in the immediate post-war years bank
of japan governor hisato ichimada headed the supporting organization to
lead the national fund raising drive icu has been unique from
its inception with its endowment procured through good will transcending
national borders
on june 15 1949 japanese and north american christian leaders
convened at the gotemba ymca camp to establish international christian
university with the inauguration of the board of trustees and
the board of councillors the founding principles and a fundamental
educational plan were laid down establishing an interdenominational christian university
had been a dream of japanese and american christians for
half a century the gotemba conference had finally realized their
aspirations
in 1950 icu purchased a spacious site in mitaka city
on the outskirts of tokyo with the donations it received
within japan the campus was dedicated on april 29 1952
with the language institute set up in the first year
in march 1953 the japanese ministry of education authorized icu
as an incorporated educational institution the college of liberal arts
opening on april 1 as the first four-year liberal arts
college in japan
the university celebrated its 50th anniversary in 1999 with diverse
events and projects during the commemorative five year period leading to
march 2004 in 2003 the ministry of education culture sports
science and technology selected icu s research and education
for peace security and conviviality for the 21st century center
of excellence program and its liberal arts to nurture
responsible global citizens for the distinctive university education support program
good practice
in 2008 an academic reform was enforced in the college
of liberal arts which replaced the system of six divisions
with a new organization of the division of arts
and sciences and a system of academic majors as of
april 2008 all new students simply start as college of
liberal arts students making their choice of major from 31
areas by the end of their sophomore year students now
have more time to make a decision while they study
diverse subjects through general education and foundation courses mext chose
icu for its fiscal year 2007 distinctive university education support
program educational support for liberal arts to nurture international
learning from academic advising to academic planning in acknowledgement of
the university s efforts for educational improvement in 2010 the
graduate school also conducted a reform and integrated the four
divisions into a new school of arts and sciences
icu is continually working to reconfirm its responsibilities and fulfill
its mission for the changing times

The entire file content appears to be lowercase so it's as easy as this:
result = {}
with open('icuhistory.txt') as icu:
for word in icu.read().split():
word = word.strip('.,!;()[]').replace("'s", "")
result[word] = result.get(word, 0) + 1
print(f'Number of words = {sum(result.values())}')
print(f'Number of unique words = {len(result)}')
Output:
Number of words = 547
Number of unique words = 273

Take a look at the text = f.read() line. Is it at the right place?
Also, the Python script you pasted does not have consistent indenting. Are you able to clean it up so that it looks just like the original?
Also curious if you have explored the set type in Python? It is a little like a list, but you may find it applicable in your scenario.

Explenation:
Behind files and open stands a concept of streaming or if you are more familiar with iterators think of f = open('icuhistory.txt','r') as an iterator.
You can go through it only once (if you don't tell it to reset)
text = f.read()
Will go through it once, then f is at the end of the file.
for line in f:
Now tries to continue where f currently is... at the end of the file.
So this loop will try to loop over the 0 lines left at the end.
As there is nothing left to iterate over it will not enter the for loop.
Solutions:
You could reset it with f.seek(0) this will tell the object to go back to the start of the file.
But more efficient would be if you either combine both your actions in the loop (more memory friendly) or work with the text text = f.read()

There's no need to read by line as you are counting words, also avoid sorting unless it's needed, as this can be expensive. Converting a list to a set will remove duplicates, and you can chain string methods.
with open('icuhistory.txt','r') as f:
text = f.read().lower()
words = [word.strip('.,!;()[]').replace("'s", '') for word in text.split()]
unique_words = set(words)
print("number of words: ", len(words))
print("number of unique words: ", len(unique_words))

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

I have this long paragraph:
paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."
My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.
There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:
regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[]+)'
new_text = re.split(regex_special_chars, paragraph)
The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.
The end result will look like the following list below:
tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']
if we check the lengths of the end result; we get this many words into each tokenized segment:
[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]
Only the last segment exceeded 30 words (76 words), and that's okay!
Edit
The regex could include a colon : So the last segment would be less than 76

I would suggest using findall instead of split.
Then the regex could be:
(?:\S+\s+)*?(?:\S+\s+){17,29}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)
Break-down:
\S+\s+ a word and the space(s) that follow it
(?:\S+\s+)*?(?:\S+\s+){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+): match one more word, terminated by a terminator character, or the end of the string.
So:
regex = r'(?:\S+\s+)*?(?:\S+\s+){18,30}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)'
new_text = re.findall(regex, paragraph)
for line in new_text:
print(len(line.split()), line)
The number of words per paragraph are:
[21, 31, 25, 30, 25, 29, 27, 26, 76]

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Classifying the "not applicable" category in document classification

Ive documents in the tuple format ("topic","doc"):-
('grain',
'Thailand exported 84,960 tonnes of rice in the week ended February 24, '
'689,038 tonnes of rice between the beginning of January and February 24, '
'up from 556,874 tonnes during the same period last year. It has '
'commitments to export another 658,999 tonnes this year. REUTER '),
('soybean',
'The Tokyo Grain Exchange said it will raise the margin requirement on '
'the spot and nearby month for U.S. And Chinese soybeans and red beans, '
'effective March 2. Spot April U.S. Soybean contracts will increase to '
'90,000 yen per 15 tonne lot from 70,000 now. Other months will stay '
'will be set at 70,000 from March 2. The new margin for red bean spot '),.....
Ive taken only 10 topics for the classification task.
Now my problem is that there how do I classify anything apart from these 10 topics as "NA" (not from the 10 topics) ? Im using Naive_bayes right now. Is there any other classifier better suited to "NA" topics.? If yes then how do we set a threshold for "NA".

Splitting a string with no line breaks into a list of lines with a maximum column count

I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."

EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond

You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.