How to open scraped links one by one automatically using Python? - python

So here is my situation: Let's say you search on eBay for "Motorola DynaTAC 8000x". The bot that I build is going to scrape all the links of the listings. My goal is now, to make it open those scraped links one by one.
I think something like that would be possible with using loops, but I am not sure on how to do it. Thanks in advance!
Here is the code of the bot:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
print(link)

To get information from the link you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
s = BeautifulSoup(requests.get(link).content, "lxml")
price = s.select_one('[itemprop="price"]')
print(s.h1.text)
print(price.text if price else "-")
print(link)
print("-" * 80)
Prints:
...
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  Vintage Pulsar Extra Thick Brick Cell Phone Has Dynatac 8000X Display
US $3,000.00
https://www.ebay.com/itm/163814682288?hash=item26241daeb0:g:sTcAAOSw6QJdUQOX
--------------------------------------------------------------------------------
...

Related

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!
You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)
Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

how do i get the next tag

I am trying to get the headlines that are in between a class. the headlines are wrapped around the h2 tag. headlines come after the tag.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
mytags = mydivs.findNext('h2')
for tag in mytags:
print(tag.text.strip())
You must iterate through mydivs to use findNext()
mydivs is a list of web elements. findNextonly applies to a single web element. You must iterate through the divs and run findNext on each of them.
Just add this line
for div in mydivs:
and put it before
mytags = div.findNext('h2')
Here is the full code for your working program:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
Try replacing the last 3 lines with:
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
soup.findAll() returns a list (or None), so you cannot call findNext() on it. However, you can iterate the tags and call find_next() on each tag separately:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
Prints:
BREAKING: Another federal lawmaker dies in Dubai hospital
Cross-Over Night: Enugu Govt bans burning of tyres on roads
Dadiyata: DSS breaks silence as Nigerian govt critic remains missing
CAC: Nigerian govt appoints new Acting Registrar-General
What Buhari told me – Dabiri-Erewa
What soldiers should expect in 2020 – Buratai
Only earthquake can erase Amosun’s legacies in Ogun – Akinlade
Civil War: Militia leader sentenced to 20yrs in prison
2020: Prophet Omale releases prophecies on Buhari, Aisha, Kyari, govs, coup plot
BREAKING: EFCC arrests Shehu Sani
Armed Forces Day: Yobe Governor Buni, donates N40 million for emblem appeal fund
Zamfara govt bans illegal gathering in the state
Agbenu Kacholalo: Colours of culture at Idoma International Carnival 2019 [PHOTOS]
Men of God are too fearful, weak to challenge government activities
2020: Peter Obi sends message to Nigerians
TETFUND: EFCC, ICPC asked to probe agency over alleged corruption
Two inmates regain freedom from Uyo prison
Buhari meets President of AfDB, Adeshina at Aso Rock
New Kogi CP resumes office, promises crime free state
Nothing stops you from paying N30,000 minimum wage to workers – APC challenges Makinde
EDIT: This script will scrape headlines from several pages:
import requests
from bs4 import BeautifulSoup
url = 'https://dailypost.ng/hot-news/page/{}/'
for page in range(1, 5): # <-- change how many pages do you want
print('Page no.{}'.format(page))
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
print('-' * 80)

Python bs4 BeautifulSoup: findall gives empty bracket

when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.
So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...

Python not finding search terms in strings parsed by BeautifulSoup

In Python 3, when I want to return only strings with the term I am interested in, I can do this:
phrases = ["1. The cat was sleeping",
"2. The dog jumped over the cat",
"3. The cat was startled"]
for phrase in phrases:
if "dog" in phrase:
print(phrase)
Which of course prints "2. The dog jumped over the cat"
Now what I'm trying to do is make the same concept work with parsed strings in BeautifulSoup. Craigslist, for example, has lots of A Tags, but only the A Tags that also have "hdrlnk" in them are of interest to us. So I:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link:
print(link)
Problem is, instead of printing all the A Tags with "hdrlnk" inside, Python prints nothing. And I'm not sure what's going wrong.
"hdrlnk" is a class attribute on the links. As you say you are only interested in these links just find the links based on class like this:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})
for link in links:
print(link)
Outputs:
<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D & Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>
To get the link href or text use:
print(link["href"])
print(link.text)
Try:
for link in links:
if "hdrlnk" in link["href"]:
print(link)
Just search term in link content, otherwise your code seems fine
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link.contents[0]:
print(link)
Or, if you want to search inside href or title, use link['href'] and link['title']
To get the required links, you can use selectors within your script to make the scraper robust and concise.
import requests
from bs4 import BeautifulSoup
base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml")
for link in soup.select(".hdrlnk"):
print(base_link + link.get("href"))

Categories