Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?
Related
When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.
I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.
Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()
I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.
I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?