I am trying to scrape a website:-
https://media.info/newspapers/titles
This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.
Below is my code to scrape the URLs of all the newspapers starting from A to Z:-
driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[#class='pages']//a")
for i in pages:
page_title.append(i.get_attribute("href"))
names = []
for i in page_title:
driver.get(i)
time.sleep(1)
name = driver.find_elements(By.XPATH,"//div[#class='info thumbBlock']//a")
for i in name:
names.append(i.get_attribute("href"))
len(names) :-> 1688
names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
'https://media.info/newspapers/titles/abergavenny-free-press',
'https://media.info/newspapers/titles/abergavenny-gazette-diary',
'https://media.info/newspapers/titles/the-abingdon-herald',
'https://media.info/newspapers/titles/academies-week']
moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.
test = []
c = 0
for i in names:
driver.get(i)
time.sleep(2)
r = requests.get(i)
soup = BeautifulSoup(r.content,'lxml')
try:
name = driver.find_element(By.XPATH,"//*[#id='mainpage']/article/div[3]/h1").text
try:
twitter = driver.find_element(By.XPATH,"//*[#id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
except:
twitter = None
try:
twitter_followers = driver.find_element(By.XPATH,"//*[#id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
except:
twitter_followers = None
people = []
try:
persons = driver.find_elements(By.XPATH,"//div[#class='columns']")
for i in persons:
people.append(i.text)
except:
people.append(None)
try:
owner = soup.select_one('th:contains("Owner") + td').text
except:
owner = None
try:
postal_address = soup.select_one('th:contains("Postal address") + td').text
except:
postal_address = None
try:
Telephone = soup.select_one('th:contains("Telephone") + td').text
except:
Telephone = None
try:
company_website = soup.select_one('th:contains("Official website") + td > a').get('href')
except:
company_website = None
try:
main_email = soup.select_one('th:contains("Main email") + td').text
except:
main_email = None
try:
personal_email = soup.select_one('th:contains("Personal email") + td').text
except:
personal_email = None
r2 = requests.get(company_website)
soup2 = BeautifulSoup(r2.content,'lxml')
try:
is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
except:
is_wordpress = None
news_Data = {
"Name": name,
"Owner": owner,
"Postal Address": postal_address,
"main Email":main_email,
"Telephone": Telephone,
"Personal Email": personal_email,
"Company Wesbite": company_website,
"Twitter_Handle": twitter,
"Twitter_Followers": twitter_followers,
"People":people,
"Is Wordpress?":is_wordpress
}
test.append(news_Data)
c=c+1
print("completed",c)
except Exception as Argument:
print(f"There is an exception with {i}")
pass
I am using both Selenium and BesutifulSoup with requests to scrape the data. The code is fulfilling the requirements.
Firstly, is it a good practice to use it in this manner like using selenium and soup in the same code?
Secondly, the code is taking too much time. is there any alternate way to reduce the runtime of the code?
BeautifulSoup is not slow: making requests and waiting for responses is slow.
You do not necessarily need selenium/chromedriver setup for this task, it's doable with requests (or other python library).
Yes, there are ways to speed it up, however keep in mind you are making requests to a server, which might become overwhelmed if you send too many requests at once, or it might block you.
Here is an example without selenium, which will accomplish what you're after:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://media.info/newspapers/titles')
soup = bs(r.text)
letter_links = [x.get('href') for x in soup.select_one('div.pages').select('a')]
newspaper_links = []
for x in tqdm(letter_links):
soup = bs(s.get(x).text)
ns_links = soup.select_one('div.columns').select('a')
for n in ns_links:
newspaper_links.append((n.get_text(strip=True), 'https://media.info/' + n.get('href')))
detailed_infos = []
for x in tqdm(newspaper_links[:50]):
soup = bs(s.get(x[1]).text)
owner = soup.select_one('th:contains("Owner")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Owner")') else None
website = soup.select_one('th:contains("Official website")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Official website")') else None
detailed_infos.append((x[0], x[1], owner, website))
df = pd.DataFrame(detailed_infos, columns = ['Newspaper', 'Info Url', 'Owner', 'Official website'])
print(df)
Result in terminal:
Newspaper Info Url Owner Official website
0 Abergavenny Chronicle https://media.info//newspapers/titles/abergavenny-chronicle Tindle Newspapers abergavenny-chronicle-today.co.uk
1 Abergavenny Free Press https://media.info//newspapers/titles/abergavenny-free-press Newsquest Media Group freepressseries.co.uk
2 Abergavenny Gazette & Diary https://media.info//newspapers/titles/abergavenny-gazette-diary Tindle Newspapers abergavenny-chronicle-today.co.uk/tn/index.cfm
3 The Abingdon Herald https://media.info//newspapers/titles/the-abingdon-herald Newsquest Media Group abingdonherald.co.uk
4 Academies Week https://media.info//newspapers/titles/academies-week None academiesweek.co.uk
5 Accrington Observer https://media.info//newspapers/titles/accrington-observer Reach plc accringtonobserver.co.uk
6 Addlestone and Byfleet Review https://media.info//newspapers/titles/addlestone-and-byfleet-review Reach plc woking.co.uk
7 Admart & North Devon Diary https://media.info//newspapers/titles/admart-north-devon-diary Tindle Newspapers admart.me.uk
8 AdNews Willenhall, Wednesbury and Darlaston https://media.info//newspapers/titles/adnews-willenhall-wednesbury-and-darlaston Reach plc reachplc.com
9 The Advertiser https://media.info//newspapers/titles/the-advertiser DMGT dmgt.co.uk
10 Aintree and Maghull Champion https://media.info//newspapers/titles/aintree-and-maghull-champion Champion Media group champnews.com
11 Airdrie & Coatbridge World https://media.info//newspapers/titles/airdrie-coatbridge-world Reach plc icLanarkshire.co.uk
12 Airdrie and Coatbridge Advertiser https://media.info//newspapers/titles/airdrie-and-coatbridge-advertiser Reach plc acadvertiser.co.uk
13 Aire Valley Target https://media.info//newspapers/titles/aire-valley-target Newsquest Media Group thisisbradford.co.uk
14 Alcester Chronicle https://media.info//newspapers/titles/alcester-chronicle Newsquest Media Group redditchadvertiser.co.uk/news/alcester
15 Alcester Standard https://media.info//newspapers/titles/alcester-standard Bullivant Media redditchstandard.co.uk
16 Aldershot Courier https://media.info//newspapers/titles/aldershot-courier Guardian Media Group aldershot.co.uk
17 Aldershot Mail https://media.info//newspapers/titles/aldershot-mail Guardian Media Group aldershot.co.uk
18 Aldershot News & Mail https://media.info//newspapers/titles/aldershot-news-mail Reach plc gethampshire.co.uk/aldershot
19 Alford Standard https://media.info//newspapers/titles/alford-standard JPI Media skegnessstandard.co.uk
20 Alford Target https://media.info//newspapers/titles/alford-target DMGT dmgt.co.uk
21 Alfreton and Ripley Echo https://media.info//newspapers/titles/alfreton-and-ripley-echo JPI Media jpimedia.co.uk
22 Alfreton Chad https://media.info//newspapers/titles/alfreton-chad JPI Media chad.co.uk
23 All at Sea https://media.info//newspapers/titles/all-at-sea None allatsea.co.uk
24 Allanwater News https://media.info//newspapers/titles/allanwater-news HUB Media allanwaternews.co.uk
25 Alloa & Hillfoots Shopper https://media.info//newspapers/titles/alloa-hillfoots-shopper Reach plc reachplc.com
26 Alloa & Hillfoots Advertiser https://media.info//newspapers/titles/alloa-hillfoots-advertiser Dunfermline Press Group alloaadvertiser.com
27 Alloa and Hillfoots Wee County News https://media.info//newspapers/titles/alloa-and-hillfoots-wee-county-news HUB Media wee-county-news.co.uk
28 Alton Diary https://media.info//newspapers/titles/alton-diary Tindle Newspapers tindlenews.co.uk
29 Andersonstown News https://media.info//newspapers/titles/andersonstown-news Belfast Media Group irelandclick.com
30 Andover Advertiser https://media.info//newspapers/titles/andover-advertiser Newsquest Media Group andoveradvertiser.co.uk
31 Anfield and Walton Star https://media.info//newspapers/titles/anfield-and-walton-star Reach plc icliverpool.co.uk
32 The Anglo-Celt https://media.info//newspapers/titles/the-anglo-celt None anglocelt.ie
33 Annandale Herald https://media.info//newspapers/titles/annandale-herald Dumfriesshire Newspaper Group dng24.co.uk
34 Annandale Observer https://media.info//newspapers/titles/annandale-observer Dumfriesshire Newspaper Group dng24.co.uk
35 Antrim Times https://media.info//newspapers/titles/antrim-times JPI Media antrimtoday.co.uk
36 Arbroath Herald https://media.info//newspapers/titles/arbroath-herald JPI Media arbroathherald.com
37 The Arden Observer https://media.info//newspapers/titles/the-arden-observer Bullivant Media ardenobserver.co.uk
38 Ardrossan & Saltcoats Herald https://media.info//newspapers/titles/ardrossan-saltcoats-herald Newsquest Media Group ardrossanherald.com
39 The Argus https://media.info//newspapers/titles/the-argus Newsquest Media Group theargus.co.uk
40 Argyllshire Advertiser https://media.info//newspapers/titles/argyllshire-advertiser Oban Times Group argyllshireadvertiser.co.uk
41 Armthorpe Community Newsletter https://media.info//newspapers/titles/armthorpe-community-newsletter JPI Media jpimedia.co.uk
42 The Arran Banner https://media.info//newspapers/titles/the-arran-banner Oban Times Group arranbanner.co.uk
43 The Arran Voice https://media.info//newspapers/titles/the-arran-voice Independent News Ltd voiceforarran.com
44 The Art Newspaper https://media.info//newspapers/titles/the-art-newspaper None theartnewspaper.com
45 Ashbourne News Telegraph https://media.info//newspapers/titles/ashbourne-news-telegraph Reach plc ashbournenewstelegraph.co.uk
46 Ashby Echo https://media.info//newspapers/titles/ashby-echo Reach plc reachplc.com
47 Ashby Mail https://media.info//newspapers/titles/ashby-mail DMGT thisisleicestershire.co.uk
48 Ashfield Chad https://media.info//newspapers/titles/ashfield-chad JPI Media chad.co.uk
49 Ashford Adscene https://media.info//newspapers/titles/ashford-adscene DMGT thisiskent.co.uk
You can extract more information for each newspaper, as you wish - the above is just an example, going through the first 50 newspapers. Now if you want a multithreaded/async solution, I recommend you read the following, and apply it to your own scenario:
BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing
Lastly, Requests docs can be found here: https://requests.readthedocs.io/en/latest/
BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
For TQDM: https://pypi.org/project/tqdm/
names = []
for letter in string.ascii_lowercase:
page = requests.get("https://media.info/newspapers/titles/starting-with/{}".format(letter))
soup = BeautifulSoup(page.content, "html.parser")
for i in soup.find_all("a"):
if i['href'].startswith("/newspapers/titles/"):
names.append(i['href'])
Related
I am webscraping indeed.nl for "Junior UX Designer" in "Nederland". The website for that search term contains 6 webpages with vacancies - meaning, if one webpage contains 15 vacancies, I should get in total around 90 vacancies.
However, when I put it into a json file, I can see that I receive 90 rows - however, multiple duplicates are in there, and many job vacancies are not even displayed in the file.
This is the code I'm using:
import requests
from bs4 import BeautifulSoup
import json
jobs_NL = []
for i in range(1,7):
url = "https://nl.indeed.com/vacatures?q=junior+ux+designer&l=Nederland&start="+str(i)
print("Getting page",i)
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
job_title = html.find_all("table", class_="jobCard_mainContent")
for item in job_title:
title = item.find("h2").get_text()
company = item.find("span", class_="companyName").get_text()
location = item.find("div", class_="companyLocation").get_text()
if item.find("div", class_="salary-snippet") != None:
salary = item.find("div", class_="heading6 tapItem-gutter metadataContainer").get_text()
else:
salary = "No salary found"
vacancy = {
"title": title,
"company": company,
"location": location,
"salary": salary
}
jobs_NL.append(vacancy)
You need to multiply the start variable by 10 to get correct page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
jobs_NL = []
for i in range(7):
url = "https://nl.indeed.com/vacatures?q=junior+ux+designer&l=Nederland&start={}".format(
10 * i
)
print("Getting page", i)
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
job_title = html.find_all("table", class_="jobCard_mainContent")
for item in job_title:
title = item.find("h2").get_text()
company = item.find("span", class_="companyName").get_text()
location = item.find("div", class_="companyLocation").get_text()
if item.find("div", class_="salary-snippet") != None:
salary = item.find(
"div", class_="heading6 tapItem-gutter metadataContainer"
).get_text()
else:
salary = "No salary found"
vacancy = {
"title": title,
"company": company,
"location": location,
"salary": salary,
}
jobs_NL.append(vacancy)
df = pd.DataFrame(jobs_NL)
print(df)
Prints:
...
90 UX Designer | SaaS Platform StarApple Amersfoort €3.000 - €4.500 per maand
91 Frontend Developer JustBetter Alkmaar No salary found
92 Software Engineer Infinitas Learning Thuiswerken No salary found
93 UX Researcher Cognizant Technology Solutions Amsterdam No salary found
94 Junior Front End developer StarApple Zeist+1 plaats €2.500 - €3.000 per maand
95 nieuwSenior User Experience Designer Trimble Bodegraven No salary found
96 Senior UX Designer - Research Agency Found Professionals B.V. Amsterdam+1 plaats No salary found
97 HubSpot marketing lead Comaxx Waalre No salary found
98 nieuwJunior Technisch CRO Specialist Finest People Amsterdam West €50.000 per jaar
99 iOS developer Infoplaza Houten No salary found
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))
I am having an inconsistent issue that is driving me crazy. I am trying to scrape data about rental units. Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:
Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'
If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!
Here is a minimum code to show the issue I am having. Please note that this code will print the HTML for a functioning ad and also for the one with the error. What is printed is so different.
Run the code then change the value of i to see the results.
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')
Edit 1:
In my simple example I want to loop through each ad in the provided link, open it, and get the title. In my actual code I am not only getting the title but also every other info about the ad. So I need to load the data from the link associated with every ad. My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with. This is driving my nuts!
To get all pages from the URL you can use next example:
import requests
from bs4 import BeautifulSoup
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]
Prints:
Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ
I think I figured out the problem here. I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!
You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.
Here is the modified code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
import time
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')
I am trying to web scrape an upcoming event date on reuters.com using Python and Beautifulsoup package.
Unfortunately it seems harder than expected to get out the upcoming earnings event date and time from HTML.
I do not understand why I cannot get a visible output via the below script although I can see the value while web inspecting the target URL. Does anybody know why? Is there any viable work-around?
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:70.0) Gecko/20100101 Firefox/70.0', }
URL = f'https://www.reuters.com/companies/SAPG.DE/events'
page = requests.get(URL, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='__next')
job_elems = results.find_all('section', class_='Events-section-2YwsJ')
for job_elem in job_elems:
event_type = job_elem.find('h3').text
if event_type.find('Events') != -1:
print(job_elem.find('h3').text)
items = job_elem.find_all('div', class_='EventList-event-Veu-f')
for item in items:
title = item.find('span').text
earnings_time = item.find('time').get_text()
if title.find('Earnings Release') != -1:
print(earnings_time)
The attributes class of the "object" in question is EventList-date-cLNT9 which I have never seen before.
This happens as time tag is using js to load, but bs4 uses html,, you have 2 options :
one is to use selenium ,or to use their API.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
URL = f'https://www.reuters.com/companies/SAPG.DE/events'
page = driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='__next')
job_elems = results.find_all('section', class_='Events-section-2YwsJ')
for job_elem in job_elems:
event_type = job_elem.find('h3').text
if event_type.find('Events') != -1:
print(job_elem.find('h3').text)
items = job_elem.find_all('div', class_='EventList-event-Veu-f')
for item in items:
title = item.find('span').text
time = item.find('time').text
print(f"Title: {title}, Time: {time}")
driver.quit()
output :
Upcoming Events
Title: SAP SE at Morgan Stanley Technology, Media and Telecom Conference (Virtual), Time: 1 Mar 2021 / 6PM EET
Title: Q1 2021 SAP SE Earnings Release, Time: 22 Apr 2021 / 8AM EET
The reason for that is those events are added dynamically by JavaScript, which means that they are not visible in the HTML you get back.
However, there's an API you can query to get the events
Here's how:
import requests
api_url = "https://www.reuters.com/companies/api/getFetchCompanyEvents/SAPG.DE"
response = requests.get(api_url).json()
for event in response["market_data"]["upcoming_event"]:
print(f"{event['name']} - {event['time']}")
Output:
SAP SE at Morgan Stanley Technology, Media and Telecom Conference (Virtual) - 2021-03-01T16:45:00Z
Q1 2021 SAP SE Earnings Release - 2021-04-22T06:30:00Z
I'm new to programming and Python. I'm adopting code(https://github.com/rileypredum/East-Bay-Housing-Web-Scrape/blob/master/EB_Room_Prices.ipynb) to scrape Craiglist. My goal is to retrieve and store all the automotive posts in Chicago. I am able to store the Post Title, Post Time, Price, and Neighborhood. My next goal is to create a new column adding only the make of the vehicle, i.e. Toyota, Nissan, Honda, etc by searching the Post Title. How do I do this?
I believe this would be where I would add logic here: In [13]" for a variable "post_make" to search "post_title".
#build out the loop
from time import sleep
from random import randint
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text)
pages = np.arange(0, results_total, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://sfbay.craigslist.org/search/eby/roo?"
+ "s="
+ str(page)
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
post_price = post.a.text
post_prices.append(post_price)
iterations += 1
print("Finished iteration: " + str(iterations))
Trying to figure out how to show the output.
Current output in excel is:
posted, neighborhood, post title, url, price
My goal is to add "post make" after the price.
I'm also looking for advice on how to show output from Jupyter notebooks here.
It's rather tricky to pull that out. I gave it a shot using another package Spacy to try to pull out the entities that are linked to organisations/car companies. It's not perfect, but it's a start:
Code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spacy
nlp = spacy.load("en_core_web_sm")
req_url = 'https://chicago.craigslist.org/search/cta'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36'}
payload = {
's': '0',
'query': 'automotive',
'sort': 'rel'}
response = requests.get(req_url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
total_posts = int(soup.find('span',{'class':'totalcount'}).text)
pages = list(range(0, total_posts, 120))
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
post_makes = []
post_models = []
for page in pages:
payload = {
's': page,
'query': 'automotive',
'sort': 'rel'}
response = requests.get(req_url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
posts = soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
post_price = post.a.text.strip()
post_prices.append(post_price)
try:
# Used Spacy and Named Entity Recognition (NER) to pull out makes/models within the title text
post_title_text = post_title_text.replace('*', ' ')
post_title_text = [ each.strip() for each in post_title_text.split(' ') if each.strip() != '' ]
post_title_text = ' '.join( post_title_text)
doc = nlp(post_title_text)
model = [ent.text for ent in doc.ents if ent.label_ == 'PRODUCT']
make_model_list = [ent.text for ent in doc if ent.tag_ == 'NNP']
doc = nlp(' '.join(make_model_list))
make = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
post_make = make[0]
post_makes.append(post_make)
post_model = model[0]
post_models.append(post_model)
except:
post_makes.append('')
post_models.append('')
iterations += 1
print("Finished iteration: " + str(iterations))
data = list(zip(post_timing,post_hoods,post_title_texts,post_links,post_prices,post_makes,post_models))
df = pd.DataFrame(list(zip(post_timing,post_hoods,post_title_texts,post_links,post_prices,post_makes,post_models)),
columns = ['time','hood','title','link','price','make','model'])
Output:
print (df.head(20).to_string())
time hood title link price make model
0 2019-10-03 07:12 (TEXT 855-976-4304 FOR CUSTOM PAYMENT) 2015 Ford Focus SE Sedan 4D sedan Dk. Gray - F... https://chicago.craigslist.org/chc/ctd/d/chica... $11500 Ford Focus SE
1 2019-10-03 06:03 (EVERYBODY DRIVES IN SOUTH ELGIN) $174/mo [][][] 2013 Hyundai Sonata BAD CREDIT OK https://chicago.craigslist.org/nwc/ctd/d/south... $174 Sonata BAD
2 2019-10-03 00:04 (EVERYBODY DRIVES IN SOUTH ELGIN) $658/mo [][][] 2016 Jeep Grand Cherokee BAD CR... https://chicago.craigslist.org/nwc/ctd/d/south... $658 Hyundai
3 2019-10-02 21:04 (EVERYBODY DRIVES IN SOUTH ELGIN) $203/mo [][][] 2010 Chevrolet Traverse BAD CRE... https://chicago.craigslist.org/nwc/ctd/d/south... $203 Jeep Grand Cherokee BAD Traverse BAD
4 2019-10-02 20:24 (DENVER) 2017 Jeep Cherokee Latitude 4x4 4dr SUV SKU:60... https://chicago.craigslist.org/chc/ctd/d/denve... $8995 Cherokee
5 2019-10-02 20:03 ( Buy Here Pay Here!) Good Credit, Bad Credit, NO Credit = NO Problem https://chicago.craigslist.org/nwc/ctd/d/chica... $0 Chevrolet
6 2019-10-02 20:03 ( Buy Here Pay Here!) Aceptamos Matricula!!! Te pagan en efectivo?? ... https://chicago.craigslist.org/wcl/ctd/d/chica... $0 Jeep
7 2019-10-02 20:02 ( Buy Here Pay Here!) Good Credit, Bad Credit, No Credit = No Problem https://chicago.craigslist.org/chc/ctd/d/vista... $0 Credit Bad Credit
8 2019-10-02 20:00 ( Buy Here Pay Here!) Good Credit, Bad Credit, No Credit= No Problem https://chicago.craigslist.org/sox/ctd/d/chica... $0
9 2019-10-02 19:15 (* CHRYSLER * TOWN AND COUNTRY * WWW.YOURCHOI... 2013*CHRYSLER*TOWN & COUNTRY*TOURING LEATHER K... https://chicago.craigslist.org/nwc/ctd/d/2013c... $9499
10 2019-10-02 19:09 (*CADILLAC* *DTS* WWW.YOURCHOICEAUTOS.COM) 2008*CADILLAC*DTS*1OWNER LEATHER SUNROOF NAVI ... https://chicago.craigslist.org/sox/ctd/d/2008c... $5999 Credit Bad Credit
11 2019-10-02 18:59 (WAUKEGANAUTOAUCTION.COM OPEN TO PUBLIC OVER ... 2001 *GMC**YUKON* XL DENALI AWD 6.0L V8 1OWNER... https://chicago.craigslist.org/nch/ctd/d/2001-... $1200
12 2019-10-02 18:47 (*GMC *SAVANA *CARGO* WWW.YOURCHOICEAUTOS.COM) 1999 *GMC *SAVANA *CARGO*G2500 SHELVES CABINET... https://chicago.craigslist.org/sox/ctd/d/1999-... $2999 Credit Bad Credit
13 2019-10-02 18:04 ( Buy Here Pay Here!) GoodCredit, Bad Credit, No credit = No Problem https://chicago.craigslist.org/nwc/ctd/d/chica... $0
14 2019-10-02 18:05 ( Buy Here Pay Here!) Rebuild your credit today!!! https://chicago.craigslist.org/sox/ctd/d/chica... $0 CHRYSLER
15 2019-10-02 18:03 ( Buy Here Pay Here!) Rebuild your credit today!!! Repo? No Problem!... https://chicago.craigslist.org/chc/ctd/d/vista... $0
16 2019-10-02 17:59 (* ACURA * TL * WWW.YOURCHOICEAUTOS.COM) 2006 *ACURA**TL* LEATHER SUNROOF CD KEYLES ALL... https://chicago.craigslist.org/sox/ctd/d/2006-... $4499
17 2019-10-02 18:00 ( Buy Here Pay Here!) Buy Here Pay Here!!! We Make it Happen!! Bad C... https://chicago.craigslist.org/wcl/ctd/d/chica... $0
18 2019-10-02 17:35 (ST JOHN) 2009 NISSAN VERSA https://chicago.craigslist.org/nwi/ctd/d/saint... $4995
19 2019-10-02 17:33 (DENVER) 2013 Scion tC Base 2dr Coupe 6M SKU:065744 Sci... https://chicago.craigslist.org/chc/ctd/d/denve... $5995 GoodCredit Bad Credit