Extract information part f URL in python - python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Related

why part of my code is ruining the other part?

Hello guys I'm trying to create a program to count the total words and the total unique words in a file but when I run the 2 parts of the codes together only the unique words counter part will work and when I delete the unique words counter part the normal words counter will work normally
here is the code full code
f = open('icuhistory.txt','r')
wordCount = 0
text = f.read()
for line in f:
lin = line.rstrip()
wds = line.split()
wordCount += len(wds) #this section alone works fine
text = text.lower() #when I start writing this one the first one will stop working
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
unique = []
for word in words:
if word not in unique:
unique.append(word)
unique.sort()
print("number of words: ",wordCount)
print("number of unique words: ",len(unique))
Here is the inside of the file
in the fall of 1945 just weeks after the end
of world war ii a group of japanese christian educators
initiated a move to establish a university based on christian
principles the foreign missions conference of north america and the
us education mission both visiting japan at the time
gave their wholehearted support conveying this plan to people in
the us amidst the post-war yearning for reconciliation and
world peace americans supported this project with great enthusiasm in
1948 the japan international christian university foundation jicuf was
established in new york to coordinate fund-raising efforts in the
us people in japan also found hope in a
cause dedicated toworld peace organizations firms and individuals made donations
to this ambitious undertaking regardless of their religious orientation anddespite
the often destitute circumstances in the immediate post-war years bank
of japan governor hisato ichimada headed the supporting organization to
lead the national fund raising drive icu has been unique from
its inception with its endowment procured through good will transcending
national borders
on june 15 1949 japanese and north american christian leaders
convened at the gotemba ymca camp to establish international christian
university with the inauguration of the board of trustees and
the board of councillors the founding principles and a fundamental
educational plan were laid down establishing an interdenominational christian university
had been a dream of japanese and american christians for
half a century the gotemba conference had finally realized their
aspirations
in 1950 icu purchased a spacious site in mitaka city
on the outskirts of tokyo with the donations it received
within japan the campus was dedicated on april 29 1952
with the language institute set up in the first year
in march 1953 the japanese ministry of education authorized icu
as an incorporated educational institution the college of liberal arts
opening on april 1 as the first four-year liberal arts
college in japan
the university celebrated its 50th anniversary in 1999 with diverse
events and projects during the commemorative five year period leading to
march 2004 in 2003 the ministry of education culture sports
science and technology selected icu s research and education
for peace security and conviviality for the 21st century center
of excellence program and its liberal arts to nurture
responsible global citizens for the distinctive university education support program
good practice
in 2008 an academic reform was enforced in the college
of liberal arts which replaced the system of six divisions
with a new organization of the division of arts
and sciences and a system of academic majors as of
april 2008 all new students simply start as college of
liberal arts students making their choice of major from 31
areas by the end of their sophomore year students now
have more time to make a decision while they study
diverse subjects through general education and foundation courses mext chose
icu for its fiscal year 2007 distinctive university education support
program educational support for liberal arts to nurture international
learning from academic advising to academic planning in acknowledgement of
the university s efforts for educational improvement in 2010 the
graduate school also conducted a reform and integrated the four
divisions into a new school of arts and sciences
icu is continually working to reconfirm its responsibilities and fulfill
its mission for the changing times
The entire file content appears to be lowercase so it's as easy as this:
result = {}
with open('icuhistory.txt') as icu:
for word in icu.read().split():
word = word.strip('.,!;()[]').replace("'s", "")
result[word] = result.get(word, 0) + 1
print(f'Number of words = {sum(result.values())}')
print(f'Number of unique words = {len(result)}')
Output:
Number of words = 547
Number of unique words = 273
Take a look at the text = f.read() line. Is it at the right place?
Also, the Python script you pasted does not have consistent indenting. Are you able to clean it up so that it looks just like the original?
Also curious if you have explored the set type in Python? It is a little like a list, but you may find it applicable in your scenario.
Explenation:
Behind files and open stands a concept of streaming or if you are more familiar with iterators think of f = open('icuhistory.txt','r') as an iterator.
You can go through it only once (if you don't tell it to reset)
text = f.read()
Will go through it once, then f is at the end of the file.
for line in f:
Now tries to continue where f currently is... at the end of the file.
So this loop will try to loop over the 0 lines left at the end.
As there is nothing left to iterate over it will not enter the for loop.
Solutions:
You could reset it with f.seek(0) this will tell the object to go back to the start of the file.
But more efficient would be if you either combine both your actions in the loop (more memory friendly) or work with the text text = f.read()
There's no need to read by line as you are counting words, also avoid sorting unless it's needed, as this can be expensive. Converting a list to a set will remove duplicates, and you can chain string methods.
with open('icuhistory.txt','r') as f:
text = f.read().lower()
words = [word.strip('.,!;()[]').replace("'s", '') for word in text.split()]
unique_words = set(words)
print("number of words: ", len(words))
print("number of unique words: ", len(unique_words))

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

I have this long paragraph:
paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."
My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.
There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:
regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[]+)'
new_text = re.split(regex_special_chars, paragraph)
The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.
The end result will look like the following list below:
tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']
if we check the lengths of the end result; we get this many words into each tokenized segment:
[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]
Only the last segment exceeded 30 words (76 words), and that's okay!
Edit
The regex could include a colon : So the last segment would be less than 76
I would suggest using findall instead of split.
Then the regex could be:
(?:\S+\s+)*?(?:\S+\s+){17,29}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)
Break-down:
\S+\s+ a word and the space(s) that follow it
(?:\S+\s+)*?(?:\S+\s+){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+): match one more word, terminated by a terminator character, or the end of the string.
So:
regex = r'(?:\S+\s+)*?(?:\S+\s+){18,30}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)'
new_text = re.findall(regex, paragraph)
for line in new_text:
print(len(line.split()), line)
The number of words per paragraph are:
[21, 31, 25, 30, 25, 29, 27, 26, 76]

Regex - distinguishing between very similar substrings using the substring itself

I know the title is rather confusing, but this is a hard problem to formulate simply. Hopefully this does have a solution and is the result of me being quite new to the world of regex.
I am trying to parse some text from a chemistry book and transform it into a JSON, but i'm having trouble dividing the text by its main identifier. All of this is being done on a Python 3.10 environment.
Consider the following string:
0047 Heptasilver nitrate octaoxide
[12258-22-9] Ag NO
7 11
(Ag O ) .AgNO
3 4 2 3
Alone, or Sulfides, or Nonmetals
The crystalline product produced by electrolytic oxidation
of silver nitrate (and possibly as formulated) detonates
feebly at 110°C. Mixtures with phosphorus and sulfur
explode on impact, hydrogen sulfide ignites on contact,
and antimony trisulfide ignites when ground with the salt.
Mellor, 1941, Vol. 3, 483–485
See other SILVER COMPOUNDS
See related METAL NITRATES
0048 Aluminium
[7429-90-5] Al
Al
HCS 1980, 135 (powder)
Finely divided aluminium powder or dust forms highly
explosive dispersions in air [1], and all aspects of pre-
vention of aluminium dust explosions are covered in 2
US National Fire Codes [2]. The effects on the ignition
properties of impurities introduced by recycled metal used
to prepare dust were studied [3]. Pyrophoricity is elimi-
nated by surface coating aluminium powder with poly-
styrene [4]. Explosion hazards involved in arc and flame
spraying of the powder were analyzed and discussed [5],
and the effect of surface oxide layers on flammability
was studied [6]. The causes of a severe explosion in
1983 in a plant producing fine aluminium powder were
analyzed, and improvements in safety practices discussed
[7]. A number of fires and explosions involving aluminiumdust arising from grinding, polishing, and buffing opera-
tions were discussed, and precautions detailed [8] [12]
[13]. Atomized and flake aluminium powders attain
See other METALS
See other REDUCANTS
0049 Aluminium-cobalt alloy (Raney cobalt alloy)
[37271-59-3] 50:50; [12043-56-0] Al Co; Al—Co
5
[73730-53-7] Al Co
2
Al Co
The finely powdered Raney cobalt alloy is a significant
dust explosion hazard.
See DUST EXPLOSION INCIDENTS (reference 22)
0050 Aluminium–copper–zinc alloy
(Devarda’s alloy)
[8049-11-4] Al—Cu—Zn
Al Cu Zn
Silver nitrate: Ammonia, etc.
See DEVARDA’S ALLOY
See other ALLOYS0051 Aluminium amalgam (Aluminium–
mercury alloy)
[12003-69-9] (1:1) Al—Hg
Al Hg
The amalgamated aluminium wool remaining from prepa-
ration of triphenylaluminium will rapidly oxidize and
become hot upon exposure to air. Careful disposal is nec-
essary [1]. Amalgamated aluminium foil may be pyro-
phoric and should be kept moist and used immediately [2].
1. Neely, T. A. et al., Org. Synth., 1965, 45, 109
2. Calder, A. et al., Org. Synth., 1975, 52, 78
See other ALLOYS
This string contains information on 5 distinct compounds, which are identified by a 4 digit number at the beginning, followed by the name and then in another line the CAS unique identifier in square brackets.
The way i'm trying to divide this into separate substrings for each object is by identifying the 4 digit number which is always followed by the other identifiers and divide the text at that point.
I'm currently using this regex expression which correctly identifies the 4 digit identifiers:
\n(\d{4})\s(?:[\s\S]*?)(?:\[\d*?-\d*?-\d*?\]|\[ *?\] [a-zA-Z]*?)
However, this also includes a few instances of other 4 digit numbers that are not identifiers, such as dates in the body text, such as the date "1983" in the text of the Aluminum (0048) compound entry.
I have tried to use negative lookaheads with the same expression i'm using for isolating the 4 digit identifier, but none of of the ways i've tried worked. And now i'm unsure if this is even possible, or perhaps i'm overcomplicating it.
Another way to do it would be by using the CAS (in the square brackets) but that would be worse, as there are entries with multiple or even empty CAS.
Any advice would be greatly appreciated!
A few notes about your pattern:
You can omit [a-zA-Z]*? at the end of the pattern as it is the last part and non greedy so it will not match any characters
parts like \d*?-\d*?-\d*? and \[ *?\] don't have to be non greedy as the specified character to be repeated can not cross the following character
If the match should always start with a newline:
\n(\d{4}).*(?:\n\(.*)*\n\[(?: *|\d+-\d+-\d+)]
Explanation
\n Match a newline
(\d{4}) Capture 4 digits in group 1
.* Match the rest of the line
(?:\n\(.*)* Optionally repeat matching a newline and ( followed by the rest of the line
\n Match a newline
\[(?: *|\d+-\d+-\d+)] Match [...] where there can be either only spaces or digits with a hyphen in between
See a regex demo.
If the square brackets should be directly on the next line:
^(\d{4}).*\n\[(?: *|\d+-\d+-\d+)]
See another regex demo.

Why are WDT words being marked as a sentence subject by dependency parsing?

I want to report the subject of each sentence; and also extract all its modifiers. (E.g. "Donald Trump" not just "Trump"; "(The) average remaining lease term" not just "term".)
Here is my test code:
import spacy
nlp = spacy.load('en_core_web_sm')
def handle(doc):
for sent in doc.sents:
shownSentence = False
for token in sent:
if(token.dep_=="nsubj"):
if(not shownSentence):
print("----------")
print(sent)
shownSentence = True
print("{0}/{1}".format(token.text, token.tag_))
print([ [t,t.tag_] for t in token.children])
handle(nlp('Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink." The average remaining lease term is six years, and Laura Palmer was killed by Bob. Trump added he will sell up soon.'))
The output is below. I'm wondering why I get "which/WDT" as a subject? Is it just model noise, or is it considered correct behaviour? (Incidentally, in my real sentence, which had the same structure, I also got "that/WDT" being marked as a subject.) (UPDATE: If I switch to 'en_core_web_md' then I do get "that/WDT" for my Trump example; that is the only difference switching from the small to the medium model makes.)
I can easily filter them out by looking at tag_; I'm more interested in the underlying reason.
(UPDATE: Incidentally, "Laura Palmer" doesn't get pulled out as a subject by this code, as the dep_ value is "nsubjpass", not "nsubj".)
----------
Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink."
Trump/NNP
[[Donald, 'NNP'], [,, ','], [legend, 'NN'], [,, ',']]
transaction/NN
[[This, 'DT']]
which/WDT
[]
----------
The average remaining lease term is six years, and Laura Palmer was killed by Bob.
term/NN
[[The, 'DT'], [average, 'JJ'], [remaining, 'JJ'], [lease, 'NN']]
----------
Trump added he will sell up soon.
Trump/NNP
[]
he/PRP
[]
(By the way, the bigger picture: pronoun resolution. I want to turn PRPs into the text they refer to.)

Splitting a string with no line breaks into a list of lines with a maximum column count

I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."
EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond
You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html

Categories