Web Scraping with BeautifulSoup with python

Web Scraping with BeautifulSoup with python - python

I'm scraping with BeutifulSoup via python on :
i woud like to extract the title name "Don Ḳarlos: gresṭe drama der ṿelṭ"
i get this as ouptput: "Don á¸²arlos: gresá¹e drama der á¹¿elá¹"
my code:
resp = requests.get(url)
tree = html.fromstring(resp.content)
element = tree.xpath('/html/body/div[1]/div[7]/div[2]/text()')
t = element[1]
print(t)
the html:
<dt>Title</dt>Don Ḳarlos: gresṭe drama der ṿelṭ<dd>Additional title: Don Ḳarlos</dd>
Thanks

Related

a data collection with web scraping

I'am trying to extract data from a site and then to create a DataFrame out of it. the program doesnt work properly. I'am new in web scraping. Hope somoene help me out and find the problem.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/chart/top/?sort=rk,asc&mode=simple&page=1'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
film_in= soup.find('tbody').findAll('tr')
#print(film_in)
film = film_in[0]
#print(film)
titre = film.find("a",{'title':'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'})
print(titre.text)
rang = film.find("td",{'class':'ratingColumn imdbRating'}).find('strong').text
#print(rang)
def remove_parentheses(string):
return string.replace("(","").replace(")","")
année = film.find("span",{'class':'secondaryInfo'}).text
#print(année)
imdb =[]
for films in film_in:
titre = film.find("a",{'title':'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'})
rang = film.find("td",{'class':'ratingColumn imdbRating'}).find('strong').text
année =(remove_parentheses(film.find("span",{'class':'secondaryInfo'}).text))
dictionnaire = {'film': film,
'rang': rang,
'année':année
}
imdb.append(dictionnaire)
df_imdb = pd.DataFrame(imdb)
print(df_imdb)
I'am trying to extract data from a site and then to create a DataFrame out of it. the program doesnt work properly. I need to solve it using urllib, is there a way. thanks in advance
I'am new in web scraping.

You can try the next example:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://www.imdb.com/chart/top/?sort=rk,asc&mode=simple&page=1'
#soup = BeautifulSoup(requests.get(url).text,'html.parser')# It's the perfect and powerful
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
imdb = []
film_in = soup.select('table[class="chart full-width"] tr')
for film in film_in[1:]:
titre = film.select_one('.titleColumn a').get_text(strip=True)
rang = film.select_one('[class="ratingColumn imdbRating"] > strong').text
année =film.find("span",{'class':'secondaryInfo'}).get_text(strip=True)
dictionnaire = {'titre': titre,
'rang': rang,
'année':année
}
imdb.append(dictionnaire)
df_imdb = pd.DataFrame(imdb)
print(df_imdb)
Output:
titre rang année
0 The Shawshank Redemption 9.2 (1994)
1 The Godfather 9.2 (1972)
2 The Dark Knight 9.0 (2008)
3 The Godfather Part II 9.0 (1974)
4 12 Angry Men 9.0 (1957)
.. ... ... ...
245 Dersu Uzala 8.0 (1975)
246 Aladdin 8.0 (1992)
247 The Help 8.0 (2011)
248 The Iron Giant 8.0 (1999)
249 Gandhi 8.0 (1982)
[250 rows x 3 columns]

Why do I run into trouble webscraping this website in Python?

I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles' titles from this website. I follow a procedure I found on SO which is as follows:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
movies = soup.select(".title a , .date")
print(movies)
movies_titles = [title.text for title in movies]
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
print(movies_titles)
print(movies_links)
I got .title a , .date using SelectorGadget in the url I shared. However, print(movies) is empty. What am I doing wrong?
Can anyone help me?
Thanks!

The content is not part of index.en.html but is loaded in by js from
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
Then you can't select pairs afaik, so you need to select for titles and dates separately:
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
Then you can print them out like this:
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)
Result:
['Christine Lagarde:\xa0Interview with CNBC', 'Fabio Panetta:\xa0Interview with El País ', 'Isabel Schnabel:\xa0Interview with Der Spiegel', 'Philip R. Lane:\xa0Interview with CNBC', 'Frank Elderson:\xa0Q&A on Twitter', 'Isabel Schnabel:\xa0Interview with Les Echos ', 'Philip R. Lane:\xa0Interview with the Financial Times', 'Luis de Guindos:\xa0Interview with Público', 'Philip R. Lane:\xa0Interview with Expansión', 'Isabel Schnabel:\xa0Interview with LETA', 'Fabio Panetta:\xa0Interview with Der Spiegel', 'Christine Lagarde:\xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:\xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:\xa0Interview with Deutschlandfunk', 'Philip R. Lane:\xa0Interview with SKAI TV', 'Isabel Schnabel:\xa0Interview with Der Standard']
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']
Full code:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)

I would recommend using Python Selenium
Try something like this :
from selenium.webdriver import Chrome
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
browser = Chrome()
browser.get(url)
interviews = browser.find_elements_by_class_name('title')
links = []
for interview in interviews:
try:
anchor = interview.find_element_by_tag_name('a')
link = anchor.get_attribute('href')
links.append(link)
except NoSuchElementException:
pass
Links will contain the links to all the interviews. You can do something similar for the dates

Parsing JSON web scraper output

I am practicing web scraping using the requests and BeautifulSoup modules on the following website:
https://www.imdb.com/title/tt0080684/
My code thus far properly outputs the json in question. I'd like help in extracting from the json only the name and description into a response dictionary.
Code
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
print(p)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
Desired Output
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}

You can parse the dictonary and then print a new JSON object using the dumps method:
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
output = json.dumps({"title": p["name"], "description": p["description"]})
print(output)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
Output:
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}

You just need to create a new dictionary from p given 2 keys name and description.
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
desired_output = {"title": p["name"], "description": p["description"]}
print(desired_output)
else:
print('\nInvalid movie page!')
Output:
{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}

beautiful soup scraping table after logging in a website

I have a piece of python code which logs me into a website. I am trying to extract the data of a particular table, I'm getting errors and I'm not sure how to resolve it after searching online.
Here is my code written in my f.py file:
import mechanize
from bs4 import BeautifulSoup
import cookielib
import requests
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("http://kingmedia.tv/home")
br.select_form(nr=0)
br.form['vb_login_username'] = 'abcde'
br.form['vb_login_password'] = '12345'
br.submit()
a = br.response().read()
url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
print (url)
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for table in soup.select('table#tborder tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
print (url) gives me the HTML data of the url which I have shown below from which I want to extract the table data. The table data that I am interested in is table class="tborder"
Update: 07/05/2021
Using soup = BeautifulSoup(content, 'lxml') as suggested by #Code-Apprentice, I am able to get the desired data. However I am struggling to obtain it fully.
I need this table and the source code from the link is the following:
<td class="alt1" width="100%"><div><font size="2"><div>
Live: EPL - Leicester v Newcastle (CH3): 05/07/21 to 05/07/21
</div><div>
Live: EPL - Liverpool v Southampton (CH3): 05/08/21 to 05/08/21
</div><div>
Live: UFC PreLims (CH2): 05/08/21 to 05/08/21
</div><div>
Live: UFC - Sandhagen v Dillashaw (CH2): 05/08/21 to 05/09/21
</div><div>
Live: EPL - Man City v Chelsea (CH3): 05/08/21 to 05/08/21
</div><div>
Live: La Liga - Barcelona v Atletico Madrid (CH6)(beIn): 05/08/21 to 05/08/21
</div><div>
Live: EPL - Leeds v Tottenham (CH3): 05/08/21 to 05/08/21
</div><div>
Live: F1 Qualifying (CH2): 05/08/21 to 05/08/21
</div><div>
Live: EPL - Sheff Utd v Crystal Palace (CH3): 05/08/21 to 05/08/21
</div></font><br>View More Detailed Calendar HERE</div></td>

url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
print (url)
soup = BeautifulSoup(requests.get(url).text, 'lxml')
This looks very suspicious. You are reading the content of one HTTP response then using it as the URL for another request. Instead, just parse the content of the first request with beautiful soup:
content = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
soup = BeautifulSoup(content, 'lxml')
First, I renamed url to content to reflect what the variable actual represents. Second, I use content directly in the creation of the BeautifulSoup object.
Disclaimer: this still might not be exactly correct, but it should get you headed in the right direction.

python beautiful soup web crawling with json

I'm new to beautifulsoup in python and I"m trying to extract certain information from a website. In detail, the url and the title.
I use beautifulsoup to extract the json which I successfully did but I´m unsure about the next steps, how to get the url and title
I did not manage to extract the desired information yet. I hope you guys can help me out
That is my logic so far:
import json
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib.request
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.citydis.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
metaConfigTxt = metaConfig["content"]
csrf = json.loads(metaConfigTxt)["pageToken"]
jsonUrl = "https://www.citydis.com/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
print(response.content)
And that is the output:
b'{"searchResults":{"customer":null,"signupUrl":"\\/signup\\/?pos=activityCard","isMobile":false,"tours":[{"tourId":5459,"title":"Ticket f\\u00fcr Coca-Cola London Eye 4D-Erlebnis","url":"https:\\/\\/www.getyourguide.de\\/london-l57\\/ohne-anstehen-edf-london-eye-4d-erlebnis-t5459\\/","price":{"original":"27,10\\u00a0\\u20ac","min":"27,10\\u00a0\\u20ac","type":"individual"},"horizontalImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-70.jpg","horizontalAlternativeImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-85.jpg","verticalImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-92.jpg","mobileImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-53.jpg","horizontalSlimImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-67.jpg","highlightedDetailedImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-91.jpg","smallDescription":"Sehen Sie London aus einer anderen Perspektive vom London Eye aus und genie\\u00dfen Sie beim neuen 4D-Erlebnis einen bahnbrechenden 3D-Film mit\\u2026","description":"Sehen Sie London aus einer anderen Perspektive vom London Eye aus und genie\\u00dfen Sie beim neuen 4D-Erlebnis einen bahnbrechenden 3D-Film mit spektakul\\u00e4ren Spezialeffekten, einschlie\\u00dflich Wind und Nebel. Genie\\u00dfen Sie au\\u00dferdem bevorzugten Einlass am Eingang.","isBestseller":false,"isFeatured":false,"languageIds":[],"hasDeal":false,"dealMaxPercentage":0,"isBoostedNewTour":false,"hasBanner":false,"hasRibbon":false,"priceTag":true,"detailsLink":false,"isCertifiedPartner":true,"hasFencedDiscountDeal":false,"hasFreeCancellation":false,"hasRating":true,"averageRating":"4,5","totalRating":1633,"totalRatingTitle":"1633 Bewertungen","averageRatingClass":"45","ratingLink":"","ratingStyleModifier":"","ratingStarsClasses":"","ratingTitle":"Bewertung: 4,5 von 5","hasDuration":true,"duration":"40 Minuten","displayAbstract":true,"displayDuration":true,"displayDate":false,"displayWishlist":false,"displayRemoveButton":false,"hasDiscountedRecommendation":false,"hideImage":false,"isSkipTheLine":false,"likelyToSellOutBadge":true,"isPromoted":false,"isSpecialOffer":false,"experiments":{"hasRatingsExperiment":false,"numericRatingLabel":"Basierend auf 1633 Bewertungen","verticalImageForPriceSegmentation":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-412120-150.jpg"},"id":"searchResults","activityCardVersion":"horizontal","limit":false,"likelyToSellOutExperiment":{"deviceDetector":{}},"hasNumericReviews":true,"resultSetPosition":0,"activityCardStyle":"plain","highlightedOrientation":"horizontal"},{"tourId":51268,"title":"Bustransfer: Flughafen Stansted - Stadtzentrum London","url":"https:\\/\\/www.getyourguide.de\\/london-l57\\/bustransfer-flughafen-stansted-stadtzentrum-london-t51268\\/","price":{"original":"9,43\\u00a0\\u20ac","min":"9,43\\u00a0\\u20ac","type":"individual"},"horizontalImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-70.jpg","horizontalAlternativeImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-85.jpg","verticalImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-92.jpg","mobileImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-53.jpg","horizontalSlimImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-67.jpg","highlightedDetailedImageUrl":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-91.jpg","smallDescription":"Beginnen oder beenden Sie Ihren Aufenthalt in London mit dem praktischen Bustransfer zwischen dem Flughafen Stansted und dem Stadtzentrum London.\\u2026","description":"Beginnen oder beenden Sie Ihren Aufenthalt in London mit dem praktischen Bustransfer zwischen dem Flughafen Stansted und dem Stadtzentrum London. Sparen Sie sich die Fahrt mit \\u00f6ffentlichen Verkehrsmitteln und erreichen Sie London schnell und bequem.","isBestseller":false,"isFeatured":false,"languageIds":[],"hasDeal":false,"dealMaxPercentage":0,"isBoostedNewTour":false,"hasBanner":false,"hasRibbon":false,"priceTag":true,"detailsLink":false,"isCertifiedPartner":false,"hasFencedDiscountDeal":false,"hasFreeCancellation":true,"hasRating":true,"averageRating":"4,4","totalRating":541,"totalRatingTitle":"541 Bewertungen","averageRatingClass":"45","ratingLink":"","ratingStyleModifier":"","ratingStarsClasses":"","ratingTitle":"Bewertung: 4,4 von 5","hasDuration":true,"duration":"60 Minuten \\u2013 90 Minuten","displayAbstract":true,"displayDuration":true,"displayDate":false,"displayWishlist":false,"displayRemoveButton":false,"hasDiscountedRecommendation":false,"hideImage":false,"isSkipTheLine":false,"likelyToSellOutBadge":true,"isPromoted":false,"isSpecialOffer":false,"experiments":{"hasRatingsExperiment":false,"numericRatingLabel":"Basierend auf 541 Bewertungen","verticalImageForPriceSegmentation":"https:\\/\\/cdn.getyourguide.com\\/img\\/tour_img-451822-150.jpg"}
What I would like to get out is the title and url only. For example:
title":"Ticket f\\u00fcr Coca-Cola London Eye 4D-Erlebnis","url":"https:\\/\\/www.getyourguide.de\\/london-l57\\/ohne-anstehen-edf-london-eye-4d-erlebnis-t5459
Any feedback much appreciated
UPDATE
Thanks to the feedback I was able to solve the problem.
I´m now able to get the desired result but now I have the issue that I´m just getting one result back instead of all available:
js_dict = (json.loads(response.content.decode('utf-8')))
url = (js_dict['searchResults']["tours"][0]["url"])
print(url)
title = (js_dict['searchResults']["tours"][0]["title"])
print(title)
price = (js_dict['searchResults']["tours"][0]["price"]["original"])
print(price)
Output is the following one:
https://www.citydis.de/london-l57/ohne-anstehen-edf-london-eye-4d-erlebnis-t5459/
Ticket für Coca-Cola London Eye 4D-Erlebnis
27,10 €
I would like to get all the titles, prices and urls back of the sightseeings which are in the JSON. I tried with the for loop but somehow it does not work.
Any feedback appreciated
UPDATE 2
Found a solution:
jsonUrl = "https://www.citydis.com/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
urls = js_dict['searchResults']["tours"]
for title, price, url in zip(headers, prices, urls):
title_final = title.get("title")
url_final = url.get("url")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Deeplink: " + url_final + " | " + "Price: " + price_final)

The string response.content is indeed the JSON output. You could import the json module, and parse the JSON with a statement like
js_dict = json.loads(response.content)
This will parse the JSON and produce a Python dictionary in js_dict. You can then use standard dictionary subscripting techniques to access and display the fields of interest.
Because this is such a common requirement, the response object has a json method that will do this decoding for you. You could, therefore, simply write
js_dict = response.json()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping with BeautifulSoup with python - python

Related

a data collection with web scraping

Why do I run into trouble webscraping this website in Python?

Parsing JSON web scraper output

beautiful soup scraping table after logging in a website

python beautiful soup web crawling with json

Categories

Resources