how to scrape the first element of multiple existing elements / beautifulsoup / python

how to scrape the first element of multiple existing elements / beautifulsoup / python - python

I try to scrape an ID element of an html code. it exists twice and everytime I print it, I get it twice. this is how I scrape it:
for review in soup.find_all("div", {"class": "reviewContainer"}):
for review2 in review.findAll(True, {'id':True}):
if len(review2) > 0:
userid = review2['id']
print(userid)
else:
userid = "N/A"
print(userid)
Output:
ID_123
ID_123
ID_456
ID_456
I tried to add "review2['id'].next_element" to just get the first coming element but I get an error. is there a solution, how I can get the first found element, instead of getting it twice?

Try adding a conditional check to see if you've already found that userid before:
for review in soup.find_all("div", {"class": "reviewContainer"}):
userid_found = []
for review2 in review.findAll(True, {'id':True}):
if len(review2) > 0:
userid = review2['id']
if userid not in userid_found:
userid_found.append(userid)
print(userid)
else:
userid = "N/A"
print(userid)

Related

Yellow Pages Python web scraping stuck on first iteration

I'm trying to scrape yellow pages, my code is stuck in taking the first business of each page but skips every other business on the page. Ex. 1st company of page 1, 1st company of page2 etc.
I have no clue why it isn't iterating first through the 'web_page' variable, then checking for additional pages and thirdly looking for closing statement and executing ´break´.
If anyone can provide me with clues or help it would be highly appreciated!
web_page_results = []
def yellow_pages_scraper(search_term, location):
page = 1
while True:
url = f'https://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page}'
r = requests.get(url, headers = headers)
soup = bs(r.content, 'html.parser')
web_page = soup.find_all('div', {'class':'search-results organic'})
for business in web_page:
business_dict = {}
try:
business_dict['name'] = business.find('a', {'class':'business-name'}).text
print(f'{business_dict["name"]}')
except AttributeError:
business_dict['name'] = ''
try:
business_dict['street_address'] = business.find('div', {'class':'street-address'}).text
except AttributeError:
business_dict['street_address'] = ''
try:
business_dict['locality'] = business.find('div', {'class':'locality'}).text
except AttributeError:
business_dict['locality'] = ''
try:
business_dict['phone'] = business.find('div', {'class':'phones phone primary'}).text
except AttributeError:
business_dict['phone'] = ''
try:
business_dict['website'] = business.find('a', {'class':'track-visit-website'})['href']
except AttributeError:
business_dict['website'] = ''
try:
web_page_results.append(business_dict)
print(web_page_results)
except:
print('saving not working')
# If the last iterated page doesn't find the "next page" button, break the loop and return the list
if not soup.find('a', {'class': 'next ajax-page'}):
break
page += 1
return web_page_results

It's worth looking at this line;
web_page = soup.find_all('div', {'class':'search-results organic'})
When I go to the request url I can only find one instance of search-results organic on the page. You then go and iterate over the list (web_page), but there will only be 1 value in the list. So when you do the for loop;
for business in web_page:
you will always only do it once, due to the single item in the list and therefore only get the first result on the page.
You need to loop through the list of businesses on the page not the container holding the business listings. I recommend creating a list from class='srp-listing':
web_page = soup.find_all('div', {'class':'srp-listing'})
This should give you a list of all the businesses on the page. When you iterate over the new list of businesses you will go through more than just the one listing.

Why is for looping not looping?

Im new to programming and cannot figure out why this wont loop. It prints and converts the first item exactly how I want. But stops after the first iteration.
from bs4 import BeautifulSoup
import requests
import re
import json
url = 'http://books.toscrape.com/'
page = requests.get(url)
html = BeautifulSoup(page.content, 'html.parser')
section = html.find_all('ol', class_='row')
for books in section:
#Title Element
header_element = books.find("article", class_='product_pod')
title_element = header_element.img
title = title_element['alt']
#Price Element
price_element = books.find(class_='price_color')
price_str = str(price_element.text)
price = price_str[1:]
#Create JSON
final_results_json = {"Title":title, "Price":price}
final_result = json.dumps(final_results_json, sort_keys=True, indent=1)
print(title)
print(price)
print()
print(final_result)

First, clarify what you are looking for? Probably, you wish to print the title, price and final_result for every book that has been scraped from the URL books.toscrape.com. The code is working as it is written though the expectation is different. If you notice you are finding all the "ol" tags with class name = "row" and there's just one such element on the page thus, section has only one element eventually the for loop iterates just once.
How to debug it?
Check the type of section, type(section)
Print the section to know what it contains
write some print statements in for loop to understand what happens when
It isn't hard to debug this one.
You need to change:
section = html.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

there is only 1 <ol> in that doc
I think you want
for book in section[0].find_all('li'):
ol means ordered list, of which there is one in this case, there are many li or list items in that ol

How can I plug this section of code into my BeautifulSoup script?

I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.

From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.

PY script list index out of range python

I've written a simple python script for web scraping:
import requests
from bs4 import BeautifulSoup
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
list = soup.find_all("li",{"class":"column"})
for li in list:
name = li.div.a.h3.text.strip()
print(name)
link = li.div.a.get("href")
oldprice = li.find("div",{"class":"proDetail"}).find_all("a")[0].text.strip().strip('TL')
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
print(f"name: {name} link: {link} old price: {oldprice} new price: {newprice}")
It gives me a list index out of range error in the line newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
Why am I getting this error? How can I fix it?

As it is mentioned above, your code is not returning as many elements as you expect.
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL') This find_all("a") is only returning a list of 1 <a> tag.
Additionally, you should check which web page this is happening in. By that I mean,
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
It could also be the case that the code fails when i=1 or when i=2 oe both. So you should examine each web page also.

Extracting data from a web page using BS4 in Python

I am trying to extract data from this site: http://www.afl.com.au/fixture
in a way such that I have a dictionary having the date as key and the "Preview" links as Values in a list, like
dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}
Please help me get it, I have used the code below:
def extractData():
lDateInfoMatchCase = False
# lDateInfoMatchCase = []
global gDict
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
ldateList.append(lDateRowIndex.text)
print ldateList
for index in ldateList:
#print index
lPreviewLinkList = []
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
if lDateRowIndex.text == index:
lDateInfoMatchCase = True
else:
lDateInfoMatchCase = False
if lDateInfoMatchCase == True:
for lInfoRowIndex in row.findAll("td", {"class": "info"}):
for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
print lPreviewLinkList
gDict[index] = lPreviewLinkList
My main aim is to get the all player names who are playing for a match in home and in away team according to date in a data structure.

I prefer using CSS Selectors. Select the first table, then all rows in the tbody for ease of processing; the rows are 'grouped' by tr th rows. From there you can select all next siblings that don't contain th headers and scan these for preview links:
previews = {}
table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
date = group_header.string
for next_sibling in group_header.parent.find_next_siblings('tr'):
if next_sibling.th:
# found a next group, end scan
break
for preview in next_sibling.select('a.preview'):
previews.setdefault(date, []).append(
"http://www.afl.com.au" + preview.get('href'))
This builds a dictionary of lists; for the current version of the page this produces:
{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to scrape the first element of multiple existing elements / beautifulsoup / python - python

Related

Yellow Pages Python web scraping stuck on first iteration

Why is for looping not looping?

How can I plug this section of code into my BeautifulSoup script?

PY script list index out of range python

Extracting data from a web page using BS4 in Python

Categories

Resources