I am having a weird problem with my code
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
import requests
def get_text(url):
data=""
p=requests.get(url).content
soup=BeautifulSoup(p)
paragraphs=soup.select("p.story-body-text.story-content")
data=p
text=""
for paragraph in paragraphs:
text+=paragraph.text
text=text.encode('ascii', 'ignore')
return str(text)
Basically what my code should be doing is getting the html by using "requests" and then using BS4 to find all the "p.story-body-text.story-content" which contains the actual article content.
It works great on some articles such as:
http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0
and
http://www.nytimes.com/2014/04/13/world/asia/coalition-building-season-in-india.html?
However, it will not work on these links:
http://www.nytimes.com/2014/04/06/world/middleeast/break-in-syrian-war-brings-brittle-calm.html?_r=0#
and
http://www.nytimes.com/2014/02/23/magazine/instagram-travel-diary.html?nav
I think it is a problem with the "requests" library because it does not fetch the correct HTML
Any ideas?
Edit: pastebin link http://pastebin.com/n3svnKTQ
Related
I am trying to write some code to extract tweets from a public twitter page (Nike store) using the Python BS4 module. When I print the page HTML into the console, only some of the HTML is printed - when I try to search (ctrl +F) the specific class values for a tag from the console output and it returns with zero results. Why is this happening?
Here a code snippet:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import re
if __name__ == '__main__':
# Read webpage into page_html' and close connection to webpage'
first_page = 'https://twitter.com/nikestore'
url_client = urlopen(first_page)
page_html = url_client.read()
url_client.close()
print(page_html)
I came across the accepted answer in the following link. Answer also suggests using selenium to circumvent the problem.
Problem while scraping twitter using beautiful soup
i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.
When I attempt to parse a locally stored copy of a webpage, beautifulsoup returns gibberish to me. I don't understand why as I've never faced this problem when using the requests and bs4 modules together for scraping tasks.
here's my code
import requests
from bs4 import BeautifulSoup as BS
import os
url_2 = r'/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/'
os.chdir(url_2)
f = open('re_2.html')
soup = BS(url_2, "lxml")
f.close()
print soup
this code returns the following :
<html><body><p>/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/</p></body></html>
I wasn't able to find a similar problem online so I've posted it here. any help would be much appreciated.
You are passing the path (which you named url_2) to BeautifulSoup so it treats that as a web page text and returns it, neatly wrapped in some minimal HTML. Seems fine.
Try constructing the BS from the file's contents instead. See here how it works: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
soup = BS(f)
should do...
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
html_content = html.fromstring(page.content)
for i in html_content.iterchildren():
print i
news_stories = html_content.xpath('//h2[#data-analytics]/a/span/text()')
news_links = html_content.xpath('//h2[#data-analytics]/a/#href')
I am trying to run this code to understand how web scraping in python works.
I want to scrap top news stories and their links from CNN.
When i run this in Python Shell, the output for news_stories and news_links i get is:
[]
My question is where am i going wrong with this and is there a better way to achieve what i am trying to than this one?
In your code html_content is returning only page address and not the actual content of the page.
html_content = html.fromstring(page.content)
You can try printing following to see complete HTML code for that page:
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
print page.text
Even though if you'll get the content also somehow, you will get it a gzipped response from the server. (Get html using Python requests?)
I would highly recommend you to use httplib2 library and BeautifulSoup to scrape news stories from CNN. That is really handy in use and get you what you want. You can see another stackoverflow post here (retrieve links from web page using python and BeautifulSoup)
I hope that help you.
I'm trying to create a basic scraper that will scrape Username and Song Title from a search on Soundcloud. By inspecting the element I needed (using Chrome), I found I needed to find the string associated with every tag 'span' with title="soundTitle__usernameText". Using BeautifulSoup, urllib2, and lxml, I have the following code for a search 'robert delong':
from lxml import html
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
def search_results(url):
html = urlopen(url).read()
# html = requests.get(url) I've tried this also
soup = BeautifulSoup(html, "lxml")
usernames = [span.string for span in soup.find_all("span", "soundTitle__usernameText")]
return usernames
print search_results('http://soundcloud.com/search?q=robert%20delong')
This returns an empty list. However, when I save the complete webpage on Chrome by selecting File>Save>Format-Webpage, Complete, and use that associated HTML file instead of the file obtained with urlopen, the code then prints
[u'Two Door Cinema Club', u'whatever-28', u'AWOLNATION', u'Two Door Cinema Club', u'Sean Glass', u'Capital Cities', u'Robert DeLong', u'RAC', u'JR JR']
which is the ideal outcome. To me, it appears that urlopen uses somewhat truncated HTML code to conduct its search, which is why it returns an empty list.
Any thoughts on how I may be able to access the same HTML obtained by manually saving the webpage, but using Python/Terminal? Thank you.
You guessed right. Downloaded HTML does not contain all the data. Javascript is used to request information in JSON format which is then inserted into the document.
By looking at the request Chrome made (ctrl+shift+i, "Network"), I see that it requested https://api-v2.soundcloud.com/search?q=robert%20delong. I believe the response to that has the information you need.
Actually, this is good for you. Reading JSON should me much more straight-forward than parsing HTML ;)
This is the code that you can use to download the html of the webpage using terminal and its related links and images:
wget -p --convert-links http://www.website.com/directory/webpage.html