I am trying to scrape company info from http://fortune.com/fortune500 for my thesis. As I downloaded the web_text from the link, there were no links for parsing. However, opening the link on Chrome will automatically lead to #1 company page.
Could someone kindly help explain to me what happened and how I can trace the links to company page from the original url?
First you need to get the postid, then make a request to /data/franchise-list, then get the url from the first article:
import json
import re
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup
data = urlopen('http://fortune.com/fortune500/')
soup = BeautifulSoup(data)
postid = next(attr for attr in soup.body['class'] if attr.startswith('postid'))
postid = re.match(r'postid-(\d+)', postid).group(1)
url = "http://fortune.com/data/franchise-list/{postid}/1/".format(postid=postid)
data = json.load(urlopen(url))
resulting_url = urljoin(url, data['articles'][0]['url'])
print resulting_url
Prints:
http://fortune.com/fortune500/wal-mart-stores-inc-1/
Related
i want to get the informatiom from booking.com (like hotel names, prices...), but I cannot find these information when I access the website through python using BeautifulSoup.
This is what I did:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
url="https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaGKIAQGYAQm4AQfIAQzYAQPoAQGIAgGoAgO4AtDrhJMGwAIB0gIkZjQzNmY0MTQtMjY3OS00NGE0LTkwOWEtNGQ3YzQ0OTY1Mjc42AIE4AIB&lang=en-gb&sid=b9d75b447deb2624c8cfaadad9969120&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.en-gb.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaGKIAQGYAQm4AQfIAQzYAQPoAQGIAgGoAgO4AtDrhJMGwAIB0gIkZjQzNmY0MTQtMjY3OS00NGE0LTkwOWEtNGQ3YzQ0OTY1Mjc42AIE4AIB%3Bsid%3Db9d75b447deb2624c8cfaadad9969120%3Bsb_price_type%3Dtotal%26%3B&ss=Hong+Kong&is_ski_area=0&ssne=Hong+Kong&ssne_untouched=Hong+Kong&dest_id=-1353149&dest_type=city&checkin_year=2022&checkin_month=4&checkin_monthday=25&checkout_year=2022&checkout_month=4&checkout_monthday=30&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
requests.get(url)
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content,'html.parser')
print(soup)
after I print soup, I can only see the information like scores but I cannot find anything about the hotel names when I use find(), can you tell me what I did wrong and how can I do it right? Thank you so much!!
You just simply need to inspect the HTML of the page that is returned in the soup, for example if you inspect hotel heading in the browser you will notice top 10 results of hotels are being shown in the tag with class of card
Then finally you can use find to fetch all the info e.g. check the following modified version of your code
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
url="https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaGKIAQGYAQm4AQfIAQzYAQPoAQGIAgGoAgO4AtDrhJMGwAIB0gIkZjQzNmY0MTQtMjY3OS00NGE0LTkwOWEtNGQ3YzQ0OTY1Mjc42AIE4AIB&lang=en-gb&sid=b9d75b447deb2624c8cfaadad9969120&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.en-gb.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaGKIAQGYAQm4AQfIAQzYAQPoAQGIAgGoAgO4AtDrhJMGwAIB0gIkZjQzNmY0MTQtMjY3OS00NGE0LTkwOWEtNGQ3YzQ0OTY1Mjc42AIE4AIB%3Bsid%3Db9d75b447deb2624c8cfaadad9969120%3Bsb_price_type%3Dtotal%26%3B&ss=Hong+Kong&is_ski_area=0&ssne=Hong+Kong&ssne_untouched=Hong+Kong&dest_id=-1353149&dest_type=city&checkin_year=2022&checkin_month=4&checkin_monthday=25&checkout_year=2022&checkout_month=4&checkout_monthday=30&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
requests.get(url)
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content,'html.parser')
#filter all elements with tag span, class bui-card__title and itemprop as name
hotels = soup.findAll("span", {"class": "bui-card__title", "itemprop": "name"})
for hotel in hotels:
print(hotel.decode_contents().strip())
Output is following
I try to scrape website of stock company to get stock name and count that user had.
below is my code
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
url = "https://www.mynamuh.com/tx/login/login.action"
data = {'userid':'ryulstor', "passwd_e2e_1_pwd__":"liams09", "ca_gb" :"Y"}
s = requests.Session()
with requests.Session() as s:
login_page = s.get(url)
html = login_page.text
soup = BeautifulSoup(html, 'lxml')
login_req = s.post(url, data= data)
print(login_req.status_code)
Whether user id and password are correct or not, status_code return 200.
so I can't know login success. could you help me?
So bs4 has a find function. So if you know what html pops up when you enter your id correctly, you can see if bs4 can find that html and if not you know you entered it wrong. I recommend you look at Beautiful Soup's documentary or watch sentdex's videos to find more about this. https://www.youtube.com/watch?v=aIPqt-OdmS0
I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:
https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da
Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.
Below are what I tried, but no data are printed (i.e. it did not find any pdfs).
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')
for link in soup.select("a[href$='.pdf']"):
print(link['href'].split('/')[-1])
All help and guidance will be much appreciated.
you should use select instead of findAll
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')
for link in pdfs:
print(link['href'].split('/')[-1])
I am trying to scrape reviews from Imdb movies using python3.6. However when I print my 'review', only 1 review pops up and I am not sure why the rest does not pop up. This does not happen for my 'review_title'. Any advise or help is greatly appreciated as I've been searching forums and googling but no avail.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
print(soup.prettify())
review_title = soup.find("div",attrs={"class":"lister"}).findAll("div",{"class":"title"})
review = soup.find("div",attrs={"class":"text"})
review = soup.find("div",attrs={"class":"text"}).findAll("div",{"class":"text"})
rating = soup.find("span",attrs={"class":"rating-other-user-rating"}).findAll("span")
Without creating any loop how can you reach all the content of that page? The way you have written your script is exactly doing what it is supposed to do (parsing the single review content).Try the below way instead. It will fetch you all the visible data.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
for item in soup.find_all(class_="review-container"):
review_title = item.find(class_="title").text
review = item.find(class_="text").text
try:
rating = item.find(class_="point-scale").previous_sibling.text
except:
rating = ""
print("Title: {}\nReview: {}\nRating: {}\n".format(review_title,review,rating))
I am creating a web crawler now and I want to scrape the user reviews from imdb. It's easy to directly get the 10 reviews and rate from the origin page. For example http://www.imdb.com/title/tt1392170/reviews The problem is to get all reviews, I need to press the "load more" so that more reviews will be shown while the url address doesn't change! So I don't know how can I get all the reviews in Python3. What I use now are requests, bs4.
My code now:
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
url_link='http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
html=urlopen(url_link)
content_bs=BeautifulSoup(html)
for b in content_bs.find_all('div',class_='text'):
print(b)
for rate_score in content_bs.find_all('span',class_='rating-other-user-rating'):
print(rate_score)
You can't press the load more button without initiating click event. However, BeautifulSoup doesn't have that property. But, what you can do to get the full content is something like what i've demonstrated below. It will fetch you all the review title along with reviews:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
main_content = urljoin(url,soup.select(".load-more-data")[0]['data-ajaxurl']) ##extracting the link leading to the page containing everything available here
response = requests.get(main_content)
broth = BeautifulSoup(response.text,"lxml")
for item in broth.select(".review-container"):
title = item.select(".title")[0].text
review = item.select(".text")[0].text
print("Title: {}\n\nReview: {}\n\n".format(title,review))