Cleaning text string after getting body text using Beautifulsoup - python

I'm trying to get text from articles on various webpages and write them as clean text documents. I don't want all visible text because that often includes irrelevant links on the side of webpages. I'm using Beautifulsoup to extract the information from pages. But, extra links not just on the side of the page but also those sometimes in the middle of the body text and at the bottom of the articles sometimes make it into the final product.
Does anyone know how to deal with the problem of extra links that are converted into text that are not actually a part of the real article's text?
#Some of the imports are for other portions of the code not shown here.
#I'm new to Python and am bad at remembering which library has which functions.
import os
import sys
import urllib2
import webbrowser
from bs4 import BeautifulSoup
from os import path
from cookielib import CookieJar
#I made an opener to deal with proxies and put *** instead of my information
#cookielib helps me get articles from nytimes
proxy = urllib2.ProxyHandler({'http': '***' % '***'})
auth = urllib2.HTTPBasicAuthHandler()
cj = CookieJar()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler, urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
#Uses url input as a string to upen a webpage and and pulls out all the information.
def baumeister(url):
req = urllib2.Request(url)
opened = urllib2.urlopen(req)
html_doc = opened.read()
soup = BeautifulSoup(html_doc)
return soup
#Gets the body text from that html information.
def substanz(url):
soup = baumeister(url)
body = soup.find_all("p") #This is where I have tried to fix the problem and failed
result = ""
for e in body:
i = e.getText().replace("\t", "").replace(" ", " ").strip().encode(errors="ignore")
result += i + "\r\n\r\n"
return result
One article that I have used to test substanz that gets cleaned in the exact way I want is:
http://blogs.hbr.org/2014/06/do-you-really-want-to-be-yourself-at-work/
I'm trying to test with more articles from different sites. So I'm trying to clean the result of substanz (the result is a big string). The problem I have is with this article:
http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#.
I've just used the print substanz('url') to see what the result looks like. With the cnbc article I get extra links turned into text that are not really a part of the article. Whereas in the Harvard Business Review Article everything works out just fine as included links are part of the actual text.
I'm not going to attach the full result for each article here for viewing because they are each a full page of text long.
If you try exactly the code I have posted above the opener is not going to work, so use whatever opener you like to access websites. I have to access a certain proxy at work so that's the format that works for me.
Final note, I'm using python 3.4, and am writing the code in ipython notebook.

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#")
soup = BeautifulSoup(r.content)
text =[''.join(s.findAll(text=True))for s in soup.findAll('p')]
print (text)
['>> View All Results for ""', 'Enter multiple symbols separated by commas', 'London quotes now available', 'Interest rates on loans to jump', "Because federal student loans are tied to the 10-year Treasury note, CNBC's Sharon Epperson reports borrowers will see the impact of the rise in Treasury yields over the past year.", ' Congratulations, graduates, on your diploma. Now what about that $29,000 student loan debt? ', ' More than 70 percent of graduates will carry student debt into the real world, according to the Institute for College Access and Success. And the average debt is just shy of $30,000. ', ' But the news will get worse next week when interest rates on student loans are set to rise again. ', ' Though federal student loan rates are fixed for the life of the loan, these rates reset for new borrowers every July 1, thanks to legislation that ties the rates to the performance of the financial markets. ', ' The interest rate on federal Stafford loans will go from its current fixed rate of just under 4 percent to 4.66 percent for loans that are distributed between July 1 and June 30, 2015. ', ' Read MoreStudent loan problem an easy fix: Sen. Warren ', ' For graduate students, the rate on Stafford loans will rise from just over 5 percent to 6.21 percent. ', ' Direct PLUS Loans for graduates and parents are still the most expensive, with rates rising to 7.21 percent.', 'Which college major pays off most?', "CNBC's Sharon Epperson reports majoring in engineering is the most lucrative. ", " The increase in monthly federal student loan payments can add up quickly, but shouldn't be too burdensome for most students. For every $10,000 in loans, new borrowers will pay about $4 more a month based on a 10-year repayment period. ", " Read MoreWhy millennial women don't save for retirement ", ' Still, experts warn that this is only just the beginning. ', ' "Federal student loan rates will continue to increase in the next few years and will likely hit the maximum rate caps which are as high as 10.5 percent for some loans," said Mark Kantrowitz, senior vice president and publisher of Edvisors.com. ', ' For sophomore student Samantha Cook, the decision to go to George Washington University was a big one financially. She says she had doubts about it. ', ' "My parents wanted to assure me that no matter what I picked, we\'d find a way to make it work," Cook said. Like most families, Cook and her parents are making it work by combining their household savings, scholarships and grants—and student loans. ', ' Read MoreCramer: Offset high cost of higher education ', ' Despite rising tuition and borrowing costs, the Cook family decided against Samantha transferring to an in-state university. ', ' Despite the debt load she is taking on, she said, "the value of a GW degree for me at least would be more valuable when looking for jobs later on." ', " —By CNBC's Sharon Epperson ", 'Hosting a yard sale may not be the most profitable way to get rid of your old junk.', 'Many Americans with debit cards tied to their checking accounts are still confused about how these programs work. ', "Here's how to avoid these deadly sins if you're contemplating or already in a divorce.", "The IRS offers a lot of help for students. Problem is, the educational tax breaks and how they work together -- or don't -- are confusing.", 'Get the best of CNBC in your inbox', 'Tips for home buyers that will help you find the right home for your bank account.', 'Complaints about movers are down. How to find the right one—and save.', "Forget bathing suit season. Why it's really time to join the gym. ", 'Drivers might see lower gas prices this year, but smart shopping tactics could help them save even more.', 'Data is a real-time snapshot *Data is delayed at least 15 minutesGlobal Business and Financial News, Stock Quotes, and Market Data and Analysis', '© 2014 CNBC LLC. All Rights Reserved.', 'A Division of NBCUniversal']
From the website in your link to get text from the main article.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#")
soup = BeautifulSoup(r.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
print (text)
['\n Congratulations, graduates, on your diploma. Now what about that $29,000 student loan debt? \n More than 70 percent of graduates will carry student debt into the real world, according to the Institute for College Access and Success. And the average debt is just shy of $30,000. \n But the news will get worse next week when interest rates on student loans are set to rise again. \n Though federal student loan rates are fixed for the life of the loan, these rates reset for new borrowers every July 1, thanks to legislation that ties the rates to the performance of the financial markets. \n The interest rate on federal Stafford loans will go from its current fixed rate of just under 4 percent to 4.66 percent for loans that are distributed between July 1 and June 30, 2015. \n Read MoreStudent loan problem an easy fix: Sen. Warren \n For graduate students, the rate on Stafford loans will rise from just over 5 percent to 6.21 percent. \n Direct PLUS Loans for graduates and parents are still the most expensive, with rates rising to 7.21 percent.\n', '\n The increase in monthly federal student loan payments can add up quickly, but shouldn\'t be too burdensome for most students. For every $10,000 in loans, new borrowers will pay about $4 more a month based on a 10-year repayment period. \n Read MoreWhy millennial women don\'t save for retirement \n Still, experts warn that this is only just the beginning. \n "Federal student loan rates will continue to increase in the next few years and will likely hit the maximum rate caps which are as high as 10.5 percent for some loans," said Mark Kantrowitz, senior vice president and publisher of Edvisors.com. \n For sophomore student Samantha Cook, the decision to go to George Washington University was a big one financially. She says she had doubts about it. \n "My parents wanted to assure me that no matter what I picked, we\'d find a way to make it work," Cook said. Like most families, Cook and her parents are making it work by combining their household savings, scholarships and grants—and student loans. \n Read MoreCramer: Offset high cost of higher education \n Despite rising tuition and borrowing costs, the Cook family decided against Samantha transferring to an in-state university. \n Despite the debt load she is taking on, she said, "the value of a GW degree for me at least would be more valuable when looking for jobs later on." \n —By CNBC\'s Sharon Epperson \n']

Related

pandas/dask csv multiple line read

I have CSV this way:
name,sku,description
Bryce Jones,lay-raise-best-end,"Art community floor adult your single type. Per back community former stock thing."
John Robinson,cup-return-guess,Produce successful hot tree past action young song. Himself then tax eye little last state vote. Country down list that speech economy leave.
Theresa Taylor,step-onto,"**Choice should lead budget task. Author best mention.
Often stuff professional today allow after door instead. Model seat fear evidence. Now sing opportunity feeling no season show.**"
that whole multi-line is value of description column of 3rd row
But when
df = ddf.read_csv(
file_path,blocksize=2000,engine="python",encoding='utf-8-sig',quotechar='"',delimiter='[,]',quoting=csv.QUOTE_MINIMAL
)
I use the above code it reads this way
['Bryce Jones', 'lay-raise-best-end', '"Art community floor adult your single type. Per back community former stock thing."']
['John Robinson', 'cup-return-guess', 'Produce successful hot tree past action young song. Himself then tax eye little last state vote. Country down list that speech economy leave.']
['Theresa Taylor', 'step-onto', '"Choice should lead budget task. Author best mention.']
['Often stuff professional today allow after door instead. Model seat fear evidence. Now sing opportunity feeling no season show."', None, None]
How to do this?
1
You can use double linebreak between rows and single linebreak inside texts and pandas will understand. So, csv will be-
name,sku,description
Bryce Jones,lay-raise-best-end,"Art community floor adult your single type. Per back community former stock thing."
John Robinson,cup-return-guess,Produce successful hot tree past action young song. Himself then tax eye little last state vote. Country down list that speech economy leave.
Theresa Taylor,step-onto,"Choice should lead budget task. Author best mention.
Often stuff professional today allow after door instead. Model seat fear evidence. Now sing opportunity feeling no season show."
And here is how you read it.
df = pd.read_csv(filepath) # you can keep other parameters if you want
output is,
name sku \
0 Bryce Jones lay-raise-best-end
1 John Robinson cup-return-guess
2 Theresa Taylor step-onto
description
0 Art community floor adult your single type. Pe...
1 Produce successful hot tree past action young ...
2 Choice should lead budget task. Author best me...
2
Use \n where you need linebreaks.
name,sku,description
Bryce Jones,lay-raise-best-end,"Art community floor adult your single type. Per back community former stock thing."
John Robinson,cup-return-guess,Produce successful hot tree past action young song. Himself then tax eye little last state vote. Country down list that speech economy leave.
Theresa Taylor,step-onto,"Choice should lead budget task. Author best mention.\nOften stuff professional today allow after door instead. Model seat fear evidence. Now sing opportunity feeling no season show."
While reading, use codecs library of python.
import codecs
df = pd.read_csv('../../data/stack.csv')
print(codecs.decode(df.iloc[2,2], 'unicode_escape'))
Output:
Choice should lead budget task. Author best mention.
Often stuff professional today allow after door instead. Model seat fear evidence. Now sing opportunity feeling no season show.
We had to use codecs.decode() because pandas escapes the character \ with \\. And decoding undo that. Without print() function, you will not see the linebreak though.

NLTK - Python extract names from csv

i've got a CSV which contains article's text in different raws.
Like we have column 1:
Hello i am John
Tom has got a Dog
... more text.
I'm trying the extract the first names and surname from those text and i was able to do that if i copy and paste the single text in the code.
But i don't know how to read the csv in the code and then it has to processes the different texts in the raws extracting name and surname.
Here is my code working with the text in it:
import operator,collections,heapq
import csv
import pandas
import json
import nltk
from nameparser.parser import HumanName
def get_human_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
person_list = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
return (person_list)
text = """
M.F. Husain, Untitled, 1973, oil on canvas, 182 x 122 cm. Courtesy the Pundole Family Collection
In her essay ‘Worlding Asia: A Conceptual Framework for the First Delhi Biennale’, Arshiya Lokhandwala explores Gayatri Spivak’s provocation of ‘worlding’, which has been defined as imperialism’s epistemic violence of inscribing meaning upon a colonized space to bring it into the world through a Eurocentric framework. Lokhandwala extends this concept of worlding to two anti-cartographical terms: ‘de-worlding’, rejecting or debunking categories that are no longer useful such as the binaries of East-West, North-South, Orient-Occidental, and ‘re-worlding’, re-inscribing new meanings into the spaces that have been de-worlded to create one’s own worlds. She offers de-worlding and re-worlding as strategies for active resistance against epistemic violence of all forms, including those that stem from ‘colonialist strategies of imperialism’ or from ‘globalization disguised within neo-imperialist practices’.
Lokhandwala writes: Fourth World. The presence of Arshiya is really the main thing here.
Re-worlding allows us to reach a space of unease performing the uncanny, thereby locating both the object of art and the postcolonial subject in the liminal space, which prevents these categorizations as such… It allows an introspected view of ourselves and makes us seek our own connections, and look at ourselves through our own eyes.
In a recent exhibition on the occasion of the seventieth anniversary of India’s Independence, Lokhandwala employed the term to seemingly interrogate this proposition: what does it mean to re-world a country through the agonistic intervention of art and activism? What does it mean for a country and its historiography to re-world? What does this re-worlded India, in active resistance and a state of introspection, look like to itself?
The exhibition ‘India Re-Worlded: Seventy Years of Investigating a Nation’ at Gallery Odyssey in Mumbai (11 September 2017–21 February 2018) invited artists to select a year from the seventy years since the country’s independence that had personal import or resonated with them because of the significance of the events that occurred at the time. The show featured works that responded to or engaged with these chosen years. It captured a unique history of post-independent India told through the perspective of seventy artists. The works came together to collectively reflect on the history and persistence of violence from pre-independence to the present day and made reference to the continued struggle for political agency through acts of resistance, artistic and otherwise. Through the inclusion of subaltern voices, imagined geographies, particular experiences, solidarities and critical dissent, the exhibition offered counter-narratives and multiple histories.
Anita Dube, Missing Since 1992, 2017, wood, electrical wire, holders, bulbs, voltage stabilizers, 223 x 223 cm. Courtesy the artist and Gallery Odyssey
Lokhandwala says she had been thinking hard about an appropriate response to the seventy years of independence. ‘I wanted to present a new curatorial paradigm, a postcolonial critique of the colonisation and an affirmation of India coming into her own’, she says. ‘I think the fact that I tried to include seventy artists to [each take up] one year in the lifetime of the nation was also a challenging task to take on curatorially.’
Her previous undertaking ‘After Midnight: Indian Modernism to Contemporary India: 1947/1997’ at the Queens Museum in New York in 2015 juxtaposed two historical periods in Indian art: Indian modern art that emerged in the post-independence period from 1947 through the 1970s, and contemporary art from 1997 onwards when the country experienced the effects of economic liberalization and globalization. The 'India Re-Worlded' exhibition similarly presented art practices that emerged from the framework of postcolonial Indian modernity. It attempted to explore the self-reflexivity of the Indian artist as a postcolonial subject and, as Lokhandwala described in the curatorial note, the artists’ resulting ‘sense of agency and renewed connection with the world at large’. The exhibition included works by Progressive Artists' Group core members F.N. Souza, S.H. Raza, M.F. Husain and their peers Krishen Khanna, Tyeb Mehta and V.S. Gaitonde, presented under the year in which they were produced. Other important and pioneering pieces included work from Somnath Hore’s paper pulp print series Wounds (1970); a blowtorch on plywood work by abstractionist Jeram Patel, who was one of the founding members of Group 1890 ; and a video documenting one of Rummana Husain’s last performances.
The methodology of their display removed the didactic, art historical preoccupation with chronology and classification, instead opting to intersperse them amongst contemporary works. This fits in with Lokhandwala’s curatorial impulses and vision: to disrupt and resist single narratives, to stage dialogues and interactions between the works, to offer overlaps, intersections and nuances in the stories, but also in the artistic impetuses.
Jeram Patel, Untitled, 1970, blowtorch Fourht World on plywood, 61 x 61 cm. Courtesy the artist and Gallery Odyssey
The show opened with Jitish Kallat’s Death of Distance (2006), then we have Arshiya, which through lenticular prints presented two overlaid found texts from 2005 and 2006. One was a harrowing news story of a twelve-year-old Indian girl committing suicide after her mother tells her she cannot afford one rupee – two US cents – for a school meal. The other one was a news clipping in which the head of the state-run telecommunications company announces a new one-rupee-per-minute tariff plan for interstate phone calls and declares the scheme as ‘the death of distance’. The images offer two realities that are distant from and at odds with each other. They highlight an economic disparity heightened by globalization. A rupee coin, enlarged to a human scale and covered in black lead, stood poised on the gallery floor in front of the prints.
Bose Krishnamachari chose 1962, the year of his birth, to discuss the relationship between memory and age. As a visual representation of the country’s past through a timeline, within which he situated his own identity-questioning experiences as an artist, his work epitomized the themes and intentions of the exhibition. In Shilpa Gupta’s single channel video projection 100 Hand drawn Maps of India (2007–8) ordinary Indian people sketch outlines of the country from memory. The subjective maps based on the author’s impression and perception of space show how each person sees the country and articulates its borders. The work seems to ask, what do these incongruent representations reveal about our collective identities and our ideas about nationhood?
The repetition of some of the years selected, or even the absence of certain years, suggested that the parameters set by the curatorial concept sought to guide rather than clamp down on. This allowed greater freedom for the artists and curator, and therefore more considered and wide responses.
Surekha’s photographic series To Embrace (2017) celebrated the Chipko tree-hugging movement that originated on 25 March 1974, when 27 women from Reni village in Uttar Pradesh in northern India staged a self-organised, non-violent resistance to the felling of trees by clinging to them and linking arms around them. The photographs showed women embracing the branches of the giant, 400-year-old Dodda Alada Mara (Big Banyan Tree) in rural Bengaluru – paying a homage to both the pioneering eco-feminist environmental movement and the grand old tree.
Anita Dube’s Missing Since 1992 (2017) hung from the ceiling like a ghost of a terrible, dark past. Its electrical wires and bulbs outlined a sombre dome to represent the demolition of the Babri Masjid on 6 December 1992, which Dube calls ‘the darkest day I have experienced as a citizen’. This piece was one of several works in the exhibition that dealt with this event and the many episodes of communal riots that followed. These works document a decade when the country witnessed economic reform and growth but also the rise of a religious right-wing.
Riyas Komu, Fourth World, 2017, rubber and metal, 244 x 45 cm each. Courtesy the artist and Gallery Odyssey
Near the end of the exhibition, Riyas Komu’s sculptural installation Fourth World (2017) alerted us to the divisive forces that are threatening to dismantle the ethical foundations of the Republic symbolized by its official emblem, the Lion Capital – a symbol seen also on the blackened rupee coin featured in Kallat’s work – and in a way rounded off the viewing experience.
The seventy works that attempted to represent seventy years of the country’s history built a dense and complicated network of voices and stories, and also formed a cross section of the art emerging during this period. Although the show’s juxtaposition of modern and contemporary art made it seem like an extension of the themes presented in the curator’s previous exhibition at the Queens Museum, here the curatorial concept made the process of staging the exhibition more democratic blurring the sequence of modern and contemporary Indian art. Furthermore, the multi-pronged curatorial intentions brought renewed criticality to the events of past and present, always underscoring the spirit of resistance and renegotiation as the viewer could actively de-world and re-world.
"""
names = get_human_names(text)
print ("LAST, FIRST")
namex=[]
for name in names:
last_first = HumanName(name).last + ' ' + HumanName(name).first
print (last_first)
namex.append(last_first)
print (namex)
print('Saving the data to the json file named Names')
try:
with open('Names.json', 'w') as outfile:
json.dump(namex, outfile)
except Exception as e:
print(e)
So i would like to remove all the text from the code and want the code to process the text from my csv.
Thanks a lot :)
CSV stands for Comma Separated Values and is a text format used to represent tabular data in plain text. Commas are used as column separators and line breaks as row separators. Your string does not look like a real csv file. Nevermind the extension you can still read your text file like this:
with open('your_file.csv', 'r') as f:
my_text = f.read()
Your text file is now available as my_text in the rest of your code.
Pandas has read_csv command:
yourText= pandas.read_csv("csvFile.csv")

BeautifulSoup page number

I'm trying to extract text from the online version of The Wealth of Nations and create a data frame where each observation is a page of the book. I do it in a roundabout way, trying to imitate something similar I did in R, but I was wondering if there was a way to do this directly in BeautifulSoup.
What I do is first get the entire text from the page:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm')
soup = BeautifulSoup(r.text,'html.parser')
But from here on, I'm just working with regular expressions and the text. I find the beginning and end of the book text:
beginning = [a.start() for a in re.finditer(r"BOOK I\.",soup.text)]
beginning
end = [a.start() for a in re.finditer(r"FOOTNOTES",soup.text)]
book = soup.text[beginning[1]:end[0]]
Then I remove the carriage returns and new lines and split on strings of the form "[Pg digits]" and put everything into a pandas data frame.
book = book.replace('\r',' ').replace('\n',' ')
l = re.compile('\[[P|p]g\s?\d{1,3}\]').split(book)
df = pd.DataFrame(l,columns=['col1'])
df['page'] = range(2,df.shape[0]+2)
There are indicators in the HTML code for page numbers of the form <span class='pagenum'><a name="Page_vii" id="Page_vii">[Pg vii]</a></span>. Is there a way I can do the text extraction in BeautifulSoup by searching for text between these "spans"? I know how to search for the page markers using findall, but I was wondering how I can extract text between those markers.
To get the page markers and the text associated with it, you can use bs4 with re. In order to match text between two markers, itertools.groupby can be used:
from bs4 import BeautifulSoup as soup
import requests
import re
import itertools
page_data = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm').text
final_data = [(i.find('a', {'name':re.compile('Page_\w+')}), i.text) for i in soup(page_data, 'html.parser').find_all('p')]
new_data = [list(b) for a, b in itertools.groupby(final_data, key=lambda x:bool(x[0]))][1:]
final_data = {new_data[i][0][0].text:'\n'.join(c for _, c in new_data[i+1]) for i in range(0, len(new_data), 2)}
Output (Sample, the actual result is too long for SO format):
{'[Pg vi]': "'In recompense for so many mortifying things, which nothing but truth\r\ncould have extorted from me, and which I could easily have multiplied to a\r\ngreater number, I doubt not but you are so good a christian as to return good\r\nfor evil, and to flatter my vanity, by telling me, that all the godly in Scotland\r\nabuse me for my account of John Knox and the reformation.'\nMr. Smith having completed, and given to the world his system of\r\nethics, that subject afterwards occupied but a small part of his lectures.\r\nHis attention was now chiefly directed to the illustration of\r\nthose other branches of science which he taught; and, accordingly, he\r\nseems to have taken up the resolution, even at that early period, of\r\npublishing an investigation into the principles of what he considered\r\nto be the only other branch of Moral Philosophy,—Jurisprudence, the\r\nsubject of which formed the third division of his lectures. At the\r\nconclusion of the Theory of Moral Sentiments, after treating of the\r\nimportance of a system of Natural Jurisprudence, and remarking that\r\nGrotius was the first, and perhaps the only writer, who had given any\r\nthing like a system of those principles which ought to run through,\r\nand be the foundation of the law of nations, Mr. Smith promised, in\r\nanother discourse, to give an account of the general principles of law\r\nand government, and of the different revolutions they have undergone\r\nin the different ages and periods of society, not only in what concerns\r\njustice, but in what concerns police, revenue, and arms, and whatever\r\nelse is the object of law.\nFour years after the publication of this work, and after a residence\r\nof thirteen years in Glasgow, Mr. Smith, in 1763, was induced to relinquish\r\nhis professorship, by an invitation from the Hon. Mr. Townsend,\r\nwho had married the Duchess of Buccleugh, to accompany the\r\nyoung Duke, her son, in his travels. Being indebted for this invitation\r\nto his own talents alone, it must have appeared peculiarly flattering\r\nto him. Such an appointment was, besides, the more acceptable,\r\nas it afforded him a better opportunity of becoming acquainted with\r\nthe internal policy of other states, and of completing that system of\r\npolitical economy, the principles of which he had previously delivered\r\nin his lectures, and which it was then the leading object of his studies\r\nto perfect.\nMr. Smith did not, however, resign his professorship till the day\r\nafter his arrival in Paris, in February 1764. He then addressed the\r\nfollowing letter to the Right Honourable Thomas Millar, lord advocate\r\nof Scotland, and then rector of the college of Glasgow:—", '[Pg vii]': "His lordship having transmitted the above to the professors, a meeting\r\nwas held; on which occasion the following honourable testimony\r\nof the sense they entertained of the worth of their former colleague\r\nwas entered in their minutes:—\n'The meeting accept of Dr. Smith's resignation in terms of the above letter;\r\nand the office of professor of moral philosophy in this university is therefore\r\nhereby declared to be vacant. The university at the same time, cannot\r\nhelp expressing their sincere regret at the removal of Dr. Smith, whose distinguished\r\nprobity and amiable qualities procured him the esteem and affection\r\nof his colleagues; whose uncommon genius, great abilities, and extensive\r\nlearning, did so much honour to this society. His elegant and ingenious\r\nTheory of Moral Sentiments having recommended him to the esteem of men\r\nof taste and literature throughout Europe, his happy talents in illustrating\r\nabstracted subjects, and faithful assiduity in communicating useful knowledge,\r\ndistinguished him as a professor, and at once afforded the greatest pleasure,\r\nand the most important instruction, to the youth under his care.'\nIn the first visit that Mr. Smith and his noble pupil made to Paris,\r\nthey only remained ten or twelve days; after which, they proceeded\r\nto Thoulouse, where, during a residence of eighteen months, Mr. Smith\r\nhad an opportunity of extending his information concerning the internal\r\npolicy of France, by the intimacy in which he lived with some of\r\nthe members of the parliament. After visiting several other places in\r\nthe south of France, and residing two months at Geneva, they returned\r\nabout Christmas to Paris. Here Mr. Smith ranked among his\r\nfriends many of the highest literary characters, among whom were\r\nseveral of the most distinguished of those political philosophers who\r\nwere denominated Economists."}

How do I remove 2 consecutive newlines from csv file in Python?

I tried this code:
import re
re.sub('\r\n\r\n','','Summary_csv.csv')
It did not do anything. As in, it did not even touch the file (there is no modification to the date and time of the file after running this code). Could anyone please explain why?
Then I tried this:
import re
output = open("Summary.csv","w", encoding="utf8")
input = open("Summary_csv.csv", encoding="utf8")
for line in input:
output.write(re.sub('\r\n\r\n','', line))
input.close()
output.close()
This one does something to the file, as in the modified data and time in the file changes after I run this code, but it does not remove the consecutive newlines, and the output is the same as the original file.
EDIT: This a small sample from the original csv file:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary.
Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)
"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.
The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.
Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)
"
I want the output to be the following:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary. Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)"
The answer to your question is that re.sub is being applied to the string 'Summary_csv.csv' not the file. It expects a string for the third argument and it does the substitution on that string.
In the second piece of code, you open the file and read it one line at a time. This means that no line will ever contain two newlines. Two newlines will result in two consecutive lines being returned from the input file with the second line being empty.
To get rid of the extra new lines, just test for a blank line and don't write it to the output. Calling line.strip() on an empty line (one containing only whitespace characters) will return an empty string which will evaluate to False in an if statement. If line.strip() isn't empty, then write it to your output file.
output = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
for line in infile:
if line.strip():
output.write(line)
infile.close()
output.close()
Note: Python treats text files in a platform-independent way and converts line endings to '\n' by default, so testing for '\r\n' wouldn't work even without the other problems. If you really want the endings to be '\r\n', you must specify newline='\r\n' when you call open() for the input file. See the documentation on https://docs.python.org/3/library/functions.html#open for a full explanation.
Part II
With the example input and output files posted by the OP, it appears that the problem was more complex than stripping extra newlines. The following code reads the input file, finds text between pairs of " characters and combines all of the lines onto a single line in the output file. Extra newlines not inside " are sent to the output file unaltered.
import re
outfile = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
text = infile.read()
text = re.sub('\n\n', '\n', text) #remove double newlines
for p in re.split('(\".+?\")', text, flags=re.DOTALL):
if p: #skip empty matches
if p.strip(): #this is a paragraph of text and should be a line
p = p[1:-2] #get everything between the quotes
p = p.strip() #remove leading and trailing whitespace
p = re.sub('\n+', ' ', p) #replace any remaining \n with two spaces
p = '"' + p + '"\n' #replace the " around the paragraph and add newline
outfile.write(p)
infile.close()
outfile.close()

Why does my webcrawler not follow into the next link containing keywords

I have written a simple webcrawler that will eventually follow only news link to scrape the article text into a database. I am having problems actually following the link from the source url. This is the code so far:
import urlparse
import mechanize
url ="https://news.google.co.uk"
def spider(root, steps):
urls = [root]
visited =[root]
counter = 0
while counter < steps:
step_url = scrape(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counter+=1
return visited
def scrape(root):
result_urls = []
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Chrome')]
for url in root:
try:
br.open(url)
keyWords = ['news','article','business', 'world']
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
result_urls.append(newurl)
[newslinks for newslinks in result_urls if newslinks in keyWords]
print newslinks
except:
print "scrape error"
return result_urls
print spider(url, 2)
Edit:NLTK
`for text in (parse_links_text(get_links(url), d)):
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary=True)
entities = re.findall(r'NE\s(.*?)/',str(namedEnt))
descriptives = re.findall(r'\(\'(\w*)\',\s\'JJ\w?\'', str(tagged))`
then add to database after this.
Mechanize is not the best tool to use for what you want, this will get all the links and pull the main text from the links pages using BeautifulSoup, we can use a dict to create a mapping between the correct css selectors and the website names using a regex to pull the key from the link and pass the correct css to select based on that:
url ="https://news.google.co.uk"
import requests
import re
from bs4 import BeautifulSoup
def get_links(start):
cont = requests.get(start).content
soup = BeautifulSoup(cont, "lxml")
keys = ['news','article','business', 'world']
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
return [a["url"] for a in soup.select(".esc-layout-table a.article") if any(k in a["url"] for k in keys)]
def parse_links_text(links, css_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate selector from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.")
for link in links:
print(link)
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link).group()
p = [p.text for p in soup.select(css_d[css])]
yield p
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "p.mol-para-with-font","telegraph.":"#mainBodyArea",
"bbc.": "div.story-body p","independent.":"div.text-wrapper p"}
for text in (parse_links_text(get_links(url), d)):
print(text)
That pulls all the main body text from the articles on the telegraph, the dailymail, the bbc and the independent links. There is no magic bullet where one tag is going to get all the data you want, you will have to add any more potential selectors for other page or tweak them if the html changes.
A snippet of the output:
http://www.telegraph.co.uk/news/politics/12199759/The-IDS-explosion-could-do-untold-damage-to-David-Camerons-reputation.html
[u' In a sense, David Cameron owes his job to Iain Duncan Smith. Without the abject failure of Mr Duncan Smith\u2019s leadership between 2001 and 2003, the Conservatives might not have reached the collective conclusion that a traditional Tory focus on issues such as Europe would not win an election and realise that, to use a Cameron phrase, they had to change to win. ', u' Mr Cameron\u2019s leadership is easily understood as a political reaction to Mr Duncan Smith\u2019s, but the two have more in common than is easily visible. Intellectually, there is a continuity between the two leaderships that is not often realised. Even as he was failing dismally as leader, Mr Duncan Smith was saying things about the party that Mr Cameron would endorse today. ', u'\n', u"Huge respect for IDS. Welfare reform must b done the right way. The electorate will not trust us again if we don't look after the vulnerable", u' So in his awful 2002 \u201cquiet man\u201d speech to the Conservative conference in Bournemouth, we find IDS outlining a vision of \u201ccompassionate conservatism\u201d, declaring: \u201cWe believe that the privileges of the few must be turned into the opportunities of the many.\u201d We also hear him telling the Tory faithful (and you had to be devoted to be at that miserable gathering) to acknowledge that many voters felt bitterly angry about the party\u2019s last spell in government: \u201cAll of us here want to remember the good things we did and there were many. But beyond this hall, people too often remember the hurt we caused and the anger they felt,\u201d he said. ', u' That is a decent exposition of what the Cameron team would, four years later, describe as the Tory \u201cbrand problem\u201d: the perception among some voters that the party governed for the privileged few at the direct expense of the less fortunate many. Changing that perception has been the most consistent objective in Mr Cameron\u2019s politics, a near-constant in a career whose successes owe much to his willingness to shift strategy and tactics according to circumstance. ', u' But \u201cdetoxifying the Tory brand\u201d is not, whatever his critics may say, simply a marketing exercise for a PM who used to work in PR. In another similarity with Mr Duncan Smith, Mr Cameron is a believer. People close to David Cameron know that what really drives and excites him is not reforming the EU (whatever he says in public, the topic bores him) or balancing the budget. Those things may dominate his Government\u2019s agenda, but friends say what raises his political passion is social reform \u2013 ensuring that people born without his privileges can share a little of the riches he has known all his life. ', u'\n', u' The origins of this feeling are hard to pinpoint with certainty, but those who have known him longest credit both his wife Samantha and their tragically short-lived first child, Ivan, with opening the eyes of a previously conventionally upper-class Conservative to the reality of life for those who suffer misfortune. ', u' So when he was, to everyone\u2019s surprise including his own, re-elected with a majority last year, the first thing Mr Cameron said was that he wished to pursue a One Nation agenda, to govern for rich and poor alike, and to make it easier for the latter to become the former. That agenda might have been recently eclipsed by Europe, and often reduced to an empty slogan, but that is where the Prime Minister\u2019s heart truly lies. For evidence, consider the series of speeches Mr Cameron gave in the early weeks of this year, focusing on social mobility, racism, and equal opportunities. ', u' I was among those who thought the speeches mostly good and impressive, though many others, including a fair few Conservatives, disagreed and took a more cynical view. But both admirers and critics alike would, I think, concede that Mr Cameron was genuine in his talk of social reform. And this is the agenda that Mr Duncan Smith is threatening with his softly spoken, hard-hitting words on The Andrew Marr Show \u2013 which were, arguably, more inflammatory than his incendiary resignation letter. ', u'\n', u'Goodbye, Iain Duncan Smith. Hello, Stephen Crabb. pic.twitter.com/fs5gscKCh3', u' Mr Duncan Smith says that Mr Cameron is not, in fact, seeking to make Britain one nation. He says the policies overseen by the Prime Minister \u2013 and let\u2019s remember that the Prime Minister, no matter how mighty he lets his Chancellor of the Exchequer become, is ultimately responsible for policy \u2013 are in the interests of the better-off and harmful to those without means or opportunity. More grave yet, he suggests his leader is indifferent to causing suffering among the poor and weak: \u201cIt just looks like we see this as a pot of money, that it doesn\u2019t matter because they don\u2019t vote for us.\u201d ', u' Coming from the man who spent six years running welfare policy, that is a potentially devastating assessment in political terms. Mr Duncan Smith makes a case for the prosecution of Mr Cameron\u2019s administration that Jeremy Corbyn could not fault. ', u'\n', u' But it is also intensely personal. Mr Duncan Smith is challenging the Prime Minister on the turf that Mr Cameron is most committed to claiming for his own. Can you really hope to go down in history as a great social-reforming premier when, in the assessment of your own welfare secretary, you have chosen to help the rich and fortunate by harming the poor and vulnerable? In this context, it is no surprise that Mr Cameron has reacted to Mr Duncan Smith\u2019s departure with true rage. (A hot temper and tendency to profanity are also things he shares with IDS, as I and several others can attest.) ', u' Amid recent events, much attention is rightly being paid to the severe damage the IDS explosion has done George Osborne\u2019s already damaged hopes of the leadership. But for Mr Cameron, this is about something else, something even more important than ambition. It is about purpose. ', u' There are already many reasons for the Prime Minister to want to win his EU referendum and run his government for a few more years. But he now has another. If Mr Cameron cannot make good on his fine words about One Nation and social mobility and equality of opportunity, and thus disprove the charges Mr Duncan Smith levels against him, then his life in politics has all been for nothing. ', u'\n\nIDS career\n']
http://www.bbc.co.uk/news/uk-politics-35855616
[u'Iain Duncan Smith has warned that the government risks dividing society, in his first interview since resigning as work and pensions secretary.', u'He attacked the "desperate search for savings" focused on benefit payments to people who "don\'t vote for us".', u'And he told the BBC\'s Andrew Marr his "painful" decision was "not personal" against Chancellor George Osborne.', u'Downing Street said it was sorry to see Iain Duncan Smith go but was determined to help "everyone in our society".', u'BBC political correspondent Alan Soady said Mr Duncan Smith\'s interview - which followed his resignation over cuts to disability benefits on Friday - was an "absolutely blistering attack".', u'He added: "This was not just about his objections to one change in disability benefit, he was questioning the fundamental principles underpinning the government."', u'Mr Duncan Smith told the BBC he had supported a consultation on the changes to Personal Independence Payments but had come under "massive pressure" to deliver the savings ahead of last week\'s Budget.', u'The way the cuts were presented in the Budget had been "deeply unfair", he said, because they were "juxtaposed" with tax cuts for the wealthy.', u'He criticised the "arbitrary" decision to lower the welfare cap after the general election and suggested the government was in danger of losing "the balance of the generations", expressing his "deep concern" at a "very narrow attack on working-age benefits" while also protecting pensioner benefits.', u'If the focus on the working-age benefit budget continued, he said, "it just looks like we see this as a pot of money, that it doesn\'t matter because they don\'t vote for us".', u'Mr Duncan Smith, who said he felt he had become "semi-detached" from government, said the Conservatives had to return to being a party "that cares about even those who do not vote for us".', u'He said he cared "passionately" about "people who don\'t get the choices my children get" and "bringing people back in to an arena where we play daily but they do not".', u'He suggested the government was in "danger of drifting in a direction that divides society rather than unites it, and that, I think, is unfair".', u'In his interview, Mr Duncan Smith gave his version of a deteriorating relationship with the government, saying he had considered resigning last year and had "long-running" concerns about cuts imposed since May\'s general election.', u'He said the disability benefit cuts should have been part of a "much wider programme" - but after Christmas "pressure began to grow" to rush a consultation so they could feature in Wednesday\'s Budget.', u'Asked why he had not spoken out when the measures were presented to cabinet, he said he "sat silently" as he "realised the full state of what was happening" with tax cuts featuring elsewhere in the Budget.', u'After thinking "long and hard", he said he agreed to write to MPs to reassure them over the disability cuts, saying "it\'s not what it sounds like in the Budget".', u'But he said he realised in the following two days "there was no way I would able to stop this process" and resigned on Friday evening.', u'Alan Soady, BBC political correspondent', u'What pushes a cabinet minister to resign so sensationally?', u"Its origins lie partly in the rapid shift of the economic gloom-o-meter. Forecasts in December's Autumn Statement were upbeat, predicting more money rolling into the Treasury.", u'By Wednesday\'s Budget, the sunshine had turned into "storm clouds". They blew over Iain Duncan-Smith\'s department because welfare changes of recent years have so far brought in nothing like the savings originally projected.', u'IDS signed off on tightening the rules around Personal Independence Payments five days before the Budget, but now says he would rather have been allowed to wait so he could see who were the winners and losers.', u"As the row gathered momentum after the Budget, Education Secretary Nicky Morgan suggested the plans weren't set in stone.", u"Mr Duncan Smith's people disagreed, firmly believing the proposals were final. The following day, Downing Street suggested a U-turn was on the cards.", u"For IDS, it was the final straw, believing he was going to carry the can for a policy he claims he'd been bounced into prematurely. Others question his account - asking why he signed off the proposal in the first place if he was so against it.", u'Mr Duncan Smith spoke of his "love" for the Conservative Party and described claims he was trying to undermine David Cameron as "nonsense", saying he had had a "robust" conversation with the PM after telling him of his resignation.', u'Asked whether Mr Osborne would make a good prime minister, he added: "If he was to stand and if he was elected by the electorate, which is not just me it is everybody else, I would hope that he would."', u'A Number 10 spokesman said: "We are sorry to see Iain Duncan Smith go, but we are a \'one nation\' government determined to continue helping everyone in our society have more security and opportunity, including the most disadvantaged.', u'"That means we will deliver our manifesto commitments to make the welfare system fairer, cut taxes and ensure we have a stable economy by controlling welfare spending and living within our means."', u'He said more people were in work under this government with fewer "trapped" on unemployment benefits.', u'Former Lib Dem minister David Laws told Andrew Marr divisions between Mr Osborne and Mr Duncan Smith over welfare had been a "running sore throughout the last parliament".', u'He said: "George Osborne, I think it\'s fair to say, did regard the welfare budget as something of a cash cow to be squeezed in order to help to deliver deficit reduction. Iain Duncan Smith had a different view."', u"Mr Duncan Smith's resignation has divided his former ministerial team at the DWP.", u'Pensions minister Baroness Ros Altmann attacked his tenure, describing him as "exceptionally difficult" to work for, and accused him of using his resignation "to do maximum damage to the party leadership" in order to support the campaign to leave the EU.', u'But her fellow DWP minister Shailesh Vara said he was "surprised" at Baroness Altmann\'s comments, saying: "Ros\'s recollection does not accord with mine and I\'m sorry that this has all happened."', u'Disabilities minister Justin Tomlinson said the former secretary of state had "always conducted himself in a professional, dedicated and determined manner", while employment minister Priti Patel told BBC Radio 5 live it had been a "privilege" to work for him.', u'Owen Smith, Labour\'s welfare spokesman, said Mr Duncan Smith had been "very honest in explaining how George Osborne could have taken different choices" and had revealed "the fundamental unfairness at the heart of government policy".']
You could of course just p = [p.text for p in soup.select("p")] to select all the text from the paragraphs but that is going to contain a lot of data you don't want. If you are only interested in certain pages you can also filter based on whether you find a match in the css_d dict using something like the following:
for link in links:
cont = requests.get(link).content
soup = BeautifulSoup(cont)
css = r.search(link)
if not css:
continue
css = css.group()
yield [p.text for p in soup.select(css)]
As discussed in the comments, for flexibility lxml is a great tool, to get the sections we can use the following code:
from urlparse import urljoin
import requests
url = "https://news.google.co.uk"
def get_sections(start, sections):
'''Pulls the links for each sections we pass in, i.e World, Business etc...'''
cont = requests.get(start).content
xml = fromstring(cont, HTMLParser())
# links are all in the a tag inside the esc-layout-table table
# where the a tag class is article
secs = xml.xpath("//span[#class='section-name']")
for sec in secs:
_sec = sec.text.rsplit(None, 1)[0].lower().rstrip(".")
if _sec in sections:
yield _sec, urljoin(url, sec.xpath(".//parent::a/#href")[0])
def get_section_links(sec_url):
''''Get all links from individual sections.'''
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
seen = set()
for url in xml.xpath("//div[#class='section-stream-content']//a/#url"):
if url not in seen:
yield url
seen.add(url)
# set of sections we want
s = {'business', 'world', "sports", "u.k"}
for sec, link in get_sections(url, s):
for sec_link in (get_section_links(link)):
print(sec, sec_link)
So if we run the code above we get all the links from each section, a very small snippet of each section is below, there are actually a considerable amount of links returned:
(u'world', 'http://www.theguardian.com/commentisfree/2016/mar/21/new-york-millionaires-who-want-taxes-raised')
(u'world', 'http://www.abc.net.au/news/2016-03-22/berg-turnbull%27s-only-real-option-was-bluff-and-bravado/7264350')
(u'world', 'http://www.swissinfo.ch/eng/reuters/australian-pm-takes-bold-gamble--sets-in-motion-july-2-poll/42037074')
(u'world', 'https://www.washingtonpost.com/news/checkpoint/wp/2016/03/21/these-are-the-new-u-s-military-bases-near-the-south-china-sea-china-isnt-impressed/')
(u'world', 'http://www.reuters.com/article/southchinasea-china-usa-idUSL3N16T3BH')
(u'world', 'http://atimes.com/2016/03/philippine-election-question-marks-sow-panic-in-south-china-sea/')
(u'world', 'http://www.manilatimes.net/what-if-china-attacks-bases-used-by-america/251946/')
(u'world', 'http://www.arabnews.com/world/news/898816')
(u'world', 'http://macaudailytimes.com.mo/koreas-seoul-north-korea-fires-five-short-range-projectiles.html')
(u'world', 'http://gulftoday.ae/portal/cb0e2530-0769-411d-9622-2e991191656b.aspx')
(u'world', 'http://38north.org/2016/03/aabrahamian032116/')
(u'u.k', 'http://www.irishnews.com/news/2016/03/22/news/judge-tells-madonna-and-richie-to-settle-rocco-dispute-458929/')
(u'u.k', 'http://www.marilynstowe.co.uk/2016/03/21/judge-urges-amicable-resolution-in-madonna-dispute-over-son/')
(u'u.k', 'http://www.mercurynews.com/celebrities/ci_29666212/judge-tells-madonna-and-guy-ritchie-get-it')
(u'u.k', 'http://www.telegraph.co.uk/news/celebritynews/madonna/12199922/Madonnas-UK-court-fight-with-Guy-Ritchie-over-son-Rocco-can-end-judge-rules.html')
(u'u.k', 'http://www.pbo.co.uk/news/boaty-mcboatface-leading-public-vote-to-name-200m-polar-research-ship-28429')
(u'u.k', 'http://www.theguardian.com/environment/shortcuts/2016/mar/21/from-bell-end-boaty-mcboatface-trouble-letting-public-name-things')
(u'u.k', 'http://www.independent.co.uk/news/uk/boaty-mcboatface-debacle-shows-the-perils-of-crowdsourcing-opinion-from-hooty-mcowlface-to-mr-a6944801.html')
(u'u.k', 'http://www.sacbee.com/news/nation-world/world/article67322252.html')
(u'u.k', 'http://www.westerndailypress.co.uk/Jury-discharged-manslaughter-case-Thomas-Orchard/story-28964162-detail/story.html')
(u'u.k', 'http://www.exeterexpressandecho.co.uk/Breaking-Thomas-Orchard-manslaughter-trial-jury/story-28963859-detail/story.html')
(u'u.k', 'http://www.theguardian.com/uk-news/2016/mar/21/thomas-orchard-trial-jury-discharged-judge-halts-proceedings')
(u'u.k', 'http://www.ft.com/cms/s/0/0bf3e966-ef57-11e5-9f20-c3a047354386.html')
(u'u.k', 'http://www.theweek.co.uk/london-mayor-election-2016/62681/london-mayor-election-2016-whos-in-the-running-as-starting-gun')
(u'business', 'https://uk.finance.yahoo.com/news/companies-may-soon-stop-reporting-162707837.html')
(u'business', 'http://www.theweek.co.uk/70785/why-youre-about-to-stop-getting-quarterly-reports-on-your-investments')
(u'business', 'http://uk.reuters.com/article/uk-starwood-hotels-m-a-marriott-idUKKCN0WN142')
(u'business', 'http://www.reuters.com/article/us-global-oil-idUSKCN0WN00I')
(u'business', 'http://www.digitallook.com/news/commodities/commodities-oil-futures-recoup-previous-sessions-losses--1087119.html')
(u'business', 'http://news.sky.com/story/1664056/new-top-dog-at-pets-at-home-as-ceo-retires')
(u'business', 'http://money.aol.co.uk/2016/03/21/sky-tv-price-hike-shock/')
(u'business', 'http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=11609694')
(u'business', 'http://www.dailymail.co.uk/sciencetech/article-3502838/The-Flying-Bum-ready-lift-World-s-largest-aircraft-Airlander-10-fitted-fins-engines-ahead-flight.html')
(u'business', 'http://www.business-standard.com/article/pti-stories/world-s-longest-aircraft-revealed-in-new-pictures-116032000569_1.html')
(u'sports', 'http://www.telegraph.co.uk/football/2016/03/21/gary-neville-consulted-roy-hodgson-on-england-delay/')
(u'sports', 'http://www.dailymail.co.uk/sport/football/article-3502767/Gary-Neville-leaving-Valencia-join-England-gritted-teeth-feels-like-La-Liga-club-giving-fans-chant-manager-now.html')
(u'sports', 'http://www.irishexaminer.com/sport/soccer/gary-neville-in-firing-line-as-valencia-lose-again-388634.html')
(u'sports', 'http://timesofindia.indiatimes.com/sports/tennis/top-stories/Male-tennis-players-should-earn-more-than-females-Djokovic/articleshow/51499959.cms')
(u'sports', 'http://www.sport24.co.za/soccer/livescoring?mid=23948674&st=football')
(u'sports', 'http://www.dispatch.com/content/stories/sports/2016/03/21/0321-serena-williams-rips-indian-wells-ceo.html')
(u'sports', 'http://www.bbc.co.uk/sport/football/35864765')
(u'sports', 'http://indianexpress.com/article/sports/football/joachim-loew-throws-max-kruse-out-of-germany-squad/')
(u'sports', 'http://www.si.com/planet-futbol/2016/03/21/max-kruse-germany-kicked-jogi-low')
(u'sports', 'http://www.dw.com/en/coach-joachim-l%C3%B6w-drops-max-kruse-from-german-national-team/a-19132035')
(u'sports', 'http://www.bbc.co.uk/sport/football/35865092')
(u'sports', 'http://news.sky.com/story/1664218')
(u'sports', 'http://www.theguardian.com/business/2016/mar/21/sports-direct-founder-mike-ashley-snubs-call-mps-parliamentary-select-committee')
(u'sports', 'http://www.mirror.co.uk/news/business/sports-direct-boss-mike-ashley-7604067')
(u'sports', 'http://www.independent.ie/sport/soccer/mike-ashley-says-he-is-wedded-to-newcastle-even-if-they-go-down-34558617.html')
(u'sports', 'http://www.heraldscotland.com/sport/14373924.Michael_Carrick_praises_performance_after_United_win_Manchester_derby/')
(u'sports', 'http://www.dorsetecho.co.uk/sport/national/14373773.Michael_Carrick_hails_vital_Manchester_derby_victory/')
If we just return a set get_section_links we can pass that to the funtions to parse the text:
def get_section_links(sec_url):
cont = requests.get(sec_url).content
xml = fromstring(cont, HTMLParser())
return set(xml.xpath("//div[#class='section-stream-content']//a/#url"))
So using lxml to parse using xpaths, for the few sites we parsed already we can add a bit more logic to catch the variations:
# map each page to its correct css selector to pull the main text
d = {"dailymail.": "//div[#itemprop='articleBody']//p",
"telegraph.": "//div[#id='mainBodyArea']//p",
"bbc.": "//div[#class='story-body']//p",
"independent.": "//div[#class='text-wrapper']//p",
"www.mirror.": "//*[#class='live-now-entry' or #class='lead-entry' or #itemprop='articleBody']//p"}
import logging
logger = logging.getLogger(__file__)
logging.basicConfig()
logger.setLevel(logging.DEBUG)
def parse_links_text(links, xpath_d):
# use regex to extract find out what page the link points to
# so we can pull the appropriate xpath from the dict
r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.|www.mirror.")
for link in links:
try:
cont = requests.get(link).content
except requests.exceptions.RequestException as e:
logging.error(e.message)
continue
xml = fromstring(cont, HTMLParser())
xpath = r.search(link)
if xpath:
p = "".join(filter(None, ("".join(p.xpath("normalize-space(.//text())"))
for p in xml.xpath(xpath_d[xpath.group()]))))
if p:
yield p
else:
logger.debug("No match for {}".format(link))
Again you will have to decide what sites are possible to hit and find the correct xpaths to pull the main articles text but this should get you well along the way. I will add some logic to run the requests asynchronously when I have more time.

Categories