Why Does My Code Scrape The First Record Only? - python

My code goes into a website, and clicks on records which causes drop downs.
My current code only prints the first drop down record, and not the others.
For example, the first record of the webpage when clicked, drops down 1 record. This record is shown attached. This is also the first and only dropdown record that gets printed as my output.
The code prints this
How do I get it to pull all drop down titles?
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,2):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
print(title)
buttonToClick = product.find_element_by_xpath('.//div[#class="sign"]')
buttonToClick.click()
time.sleep(5)
subProduct=driver.find_element_by_xpath(".//li[#class='sub_accordin_presentation']")
otherTitle=subProduct.find_element_by_xpath('.//h4').text.strip()
print(otherTitle)

You don't need selenium at all. Not sure what all the info is that you are after but the following shows you that the content is available, from within those expand blocks, with the response from a simple requests.get().:
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=1')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
print(session.select_one('h4').text)
sub_session = session.select('.sub_accordin_presentation')
if sub_session:
print([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print()
print()

Try:
productlist_length = len(driver.find_elements_by_xpath('//*[#class="jscroll-inner"]/ul/li'))
for product in productlist_length:
title = product.find_element_by_xpath('(.//*[#class="accordin_title"]/div)[3]/h4').text

Related

Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code

I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.
from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3
# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
def mechanicalKeyboards():
url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
category = "Mechanical Keyboards"
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
# getCode = requests.get(url).status_code
# print(getCode)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
conn = sqlite3.connect('trademe.db')
c = conn.cursor()
c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
allResult = str(c.fetchall())
for listing in listingContainer:
title = listing.select("#-title")
location = listing.select("#-region")
auctionID = listing['data-aria-id'].split("-").pop()
fullListingURL = "https://www.trademe.co.nz/a/" + auctionID
image = listing.select("picture img")
try:
buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
except:
buyNow = "None"
try:
price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
except:
price = "None"
for t, l, i in zip(title, location, image):
if auctionID not in allResult:
print("Adding new data - " + t.text)
c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
conn.commit()
sleep(5)
I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.
When using requests.get(url) to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image) function to only yield 8 items since image variable is empty list after the 8th listing in listingContainer
To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.

How can scrape prices from next pages?

I'm new to python and web scraping.
I wrote some codes by using requests and beautifulsoup. One code is for scraping prices and names and links. Which works fine and is as follows:
from bs4 import BeautifulSoup
import requests
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1"
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print(name)
print(link)
and also one for making other urls that I need those information scraped from, which also gives the correct urls when I use print():
x = 0
counter = 1
for x in range(0, 70)
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
counter += 1
x += 1
print(urls)
But when I try to combine these two in order to scrape a page and then change url to new one and then scrape it, it just gives the scraped information on the first page 70 times. please guide me through this. the whole code is as follows:
from bs4 import BeautifulSoup
import requests
x = 0
counter = 1
for x in range(0, 70):
urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
counter += 1
x += 1
print(urls)
for figcaption in soup.find_all('figcaption'):
price = figcaption.div.text
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print(price)
print()
print(name)
print()
print(link)
Your x=0 and then incriminating it by 1 is redundant and not needed, as you have it iterating through that range range(0, 70). I'm also not sure why you have a counter as you don't need that either. Here's how you would do it below:
HOWEVER, I believe that issue is not with the iteration or looping, but the url itself. If you manually go to the two pages as listed below, the content doesn’t change:
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1
and then
https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-2
Since the site is dynamic, you'll need to find a different way to iterate page to page, or figure out what the exact url is. So try:
from bs4 import BeautifulSoup
import requests
for x in range(0, 70):
try:
urls = 'https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html&pagesize[]=24&order[]=new&stock[]=1&page[]=' +str(x+1) + '&ajax=ok?_=1561559181560'
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')
print('Page: %s' %(x+1))
for figcaption in soup.find_all('figcaption'):
price = figcaption.find('span', {'class':'new_price'}).text.strip()
name = figcaption.find('a', class_='title').text
link = figcaption.find('a', class_='title')['href']
print('%s\n%s\n%s' %(price, name, link))
except:
break
You can find that link by going to the website and looking at the dev tools (Ctrl +Shift+I or right-click 'Inspect') -> network -> XHR
When I did that and then physically click to the next page, I can see how that data was rendered, and found the reference url.

Google news crawler flip pages

continuing on previous work to crawl all news result about query and to return title and url, I am refining the crawler to get all results from all pages in Google News. Current code seems can only return the 1st page Googel news search result. Would be grateful to know how to get all pages results. Many thanks!
my codes below:
import requests
from bs4 import BeautifulSoup
import time
import datetime
from random import randint
import numpy as np
import pandas as pd
query2Google = input("What do you want from Google News?\n")
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb")
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y" # URL for query of news results within one year and sort by date
#htmlpage = urllib2.urlopen(url).read()
time.sleep(randint(0, 2))#waiting
htmlpage = requests.get(url)
print("Status code: "+ str(htmlpage.status_code))
soup = BeautifulSoup(htmlpage.text,'lxml')
df = []
for result_table in soup.findAll("div", {"class": "g"}):
a_click = result_table.find("a")
#print ("-----Title----\n" + str(a_click.renderContents()))#Title
#print ("----URL----\n" + str(a_click.get("href"))) #URL
#print ("----Brief----\n" + str(result_table.find("div", {"class": "st"}).renderContents()))#Brief
#print ("Done")
df=np.append(df,[str(a_click.renderContents()).strip("b'"),str(a_click.get("href")).strip('/url?q='),str(result_table.find("div", {"class": "st"}).renderContents()).strip("b'")])
df = np.reshape(df,(-1,3))
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Search Crawl Done!")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
QGN(query2Google)
There used to be an ajax api, but it's no longer avaliable .
Still , you can modify your script with a for loop if you want to get a number of pages , or a while loop if you want to get all pages .
Example :
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y&start="
pages = 10 # the number of pages you want to crawl #
for next in range(0, pages*10, 10) :
page = url + str(next)
time.sleep(randint(1, 5)) # you may need longer than that #
htmlpage = requests.get(page) # you should add User-Agent and Referer #
print("Status code: " + str(htmlpage.status_code))
if htmlpage.status_code != 200 :
break # something went wrong #
soup = BeautifulSoup(htmlpage.text, 'lxml')
... process response here ...
next_page = soup.find('td', { 'class':'b', 'style':'text-align:left' })
if next_page is None or next_page.a is None :
break # there are no more pages #
Keep in mind that google doesn't like bots , you might get a ban .
You could add 'User-Agent' and 'Referer' in headers to simulate a web browser , and use time.sleep(random.uniform(2, 6)) to simulate a human ... or use selenium.
You can also add &num=25 to the end of your query and you'll get back a webpage with that number of results. In this example youll get back 25 google results back.

Python Beautiful Soup

I am learning Beautiful Soup for Python and trying to parse a website "https://www.twitteraudit.com/". When I enter a twitter id in the search bar, it returns the results for some id in a fraction of seconds, but some id takes about a minute to process the data. In this case, how can I parse the HTML after it gets loaded or the result is done? And I tried to loop it, but it doesn't work that way. But what I figured was if I open a browser and load the web link and once its done it is storing the cache in the computer and the next time when I run for the same id it works perfectly.
Can anyone help me out with this? I appreciate the help. I attach the code below>>
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
from re import sub
def HTML(myURL):
uClient = uReq(myURL)
pageHTML = uClient.read()
uClient.close()
pageSoup = soup(pageHTML, "html.parser")
return pageSoup
def fakecheck(usr):
myURLfc = "https://www.twitteraudit.com/" + usr
pgSoup = HTML(myURLfc)
foll = pgSoup.findAll("div",{"class":"audit"})
link = foll[0].div.a["href"]
real = foll[0].findAll("span",{"class":"real number"})[0]["data-value"]
fake = foll[0].findAll("span",{"class":"fake number"})[0]["data-value"]
scr = foll[0].findAll("div",{"class":"score"})[0].div
scoresent = scr["class"][1]
score = re.findall(r'\d{1,3}',str(scr))[0]
return [link, real, fake, scoresent, score]
lis = ["BarackObama","POTUS44","ObamaWhiteHouse","MichelleObama","ObamaFoundation","NSC44","ObamaNews","WhiteHouseCEQ44","IsThatBarrak","obama_barrak","theprezident","barrakubama","BarrakObama","banackkobama","YusssufferObama","barrakisdabomb_","BarrakObmma","fuzzyjellymasta","BarrakObama6","bannalover101","therealbarrak","ObamaBarrak666","barrak_obama"]
for u in lis:
link, real, fake, scoresent, score = fakecheck(u)
print ("link : " + link)
print ("Real : " + real)
print ("Fake : " + fake)
print ("Result : " + scoresent)
print ("Score : " + score)
print ("=================")
I think the problem is some of the Twitter ID's have not yet been audited, and so I was getting an IndexError. However, putting the call to fakecheck(u) in a while True: loop that catches that error will continually check the website until an audit has been performed on that ID.
I put this code after the lis definition:
def get_fake_check(n):
return fakecheck(n)
for u in lis:
while True:
try:
link, real, fake, scoresent, score = get_fake_check(u)
break
except:
pass
I'm not sure if there is a way to automate the audit request on the website, but when a query is waiting, I manually clicked the "Audit" button on the website for that ID, and once the audit was completed, the script continued as usual until all ID audits were processed.

Simulate clicking a link inside a link - Selenium Python

Python Knowledge: beginner
I managed to create a script to scrape contact information. The flow I followed since I am a beginner is to extract all the first links and copied it to text file and this is being used in link = browser.find_element_by_link_text(str(link_text)) Scraping of contact details have been confirmed working (based on my separate run). The problem is that after clicking the first links, it won't go on clicking the links inside it, hence it cannot scrape the contact info.
What is wrong with my script? Please bear in mind I am a beginner so my script is a little bit manual and lengthy.
Thanks very much!!!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml
######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################
################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################
link_texts = readfilesplit
for link_text in link_texts:
link = browser.find_element_by_link_text(str(link_text))
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
link.click() #click link
time.sleep(5)
print "-------------------------------------------------------------------------------------------------"
print("Getting listings for '%s'" % link_text)
################# get list name #######################
urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
r = requests.get(browser.current_url)
if (urlNoList != browser.current_url):
soup = BeautifulSoup(r.content, 'html.parser')
g_data = soup.find_all("div", {"class":"listing-summary"})
pageRange = soup.find_all("span", {"class":"xlistings"})
pageR = [pageRange[0].text]
pageMax = str(pageR)[-4:-2] # get max item for lists
X = str(pageMax).replace('nd', '0')
# print "Number of listings: ", X
Y = int(X) #convert string to int
print "Number of listings: ", Y
for item in g_data:
try:
listingNames = item.contents[1].text
lstList = []
lstList[len(lstList):] = [listingNames]
replStr = re.sub(r"u'", "'",str(lstList)) #strip u' char
replStr1 = re.sub(r"\s+'", "'",str(replStr)) #strip space and '
replStr2 = re.sub(r"\sFeatured", "",str(replStr1)) #strip Featured string
print "Cleaned string: ", replStr2
################ SCRAPE INFO ################
################### This is where the code is not executing #######################
count = 0
while (count < Y):
for info in replStr2:
link2 = browser.find_element_by_link_text(str(info))
time.sleep(10)
link2.click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
print "count", count
count+= 1
print("Contact info for: '%s'" % link_text)
r2 = requests.get(browser.current_url)
soup2 = BeautifulSoup(r2.content, 'html.parser')
g_data2 = soup.find_all("div", {"class":"fields"})
for item2 in g_data2:
# print item.contents[0]
print item2.contents[0].text
print item2.contents[1].text
print item2.contents[2].text
print item2.contents[3].text
print item2.contents[4].text
print item2.contents[5].text
print item2.contents[6].text
print item2.contents[7].text
print item2.contents[8].text
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
############ END SCRAPE INFO ####################
except NoSuchElementException:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
else:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
print "Number of listings: 0"
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
By the way this is some of the result:
-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings: 5
Cleaned string: ['Hydroflame Plumbing & Gas Ltd']
Cleaned string: ['Osborne Plumbing Ltd']
Cleaned string: ['Plumbers Auckland Central']
Cleaned string: ['Griffiths Plumbing']
Cleaned string: ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings: 2
Cleaned string: ['North Shore Chiropractor']
Cleaned string: ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings: 7
Cleaned string: ['Auckland Tree Services']
Cleaned string: ['Bob the Tree Man']
Cleaned string: ['Flawless House Washing & Drain Unblocking']
Cleaned string: ['Yardiez']
Cleaned string: ['Build Corp Apartments Albany']
Cleaned string: ['Auckland Trellis']
Cleaned string: ['Landscape Design']
What I would do is change the logic some. Here's the logic flow I would suggest you use. This will eliminate the writing off of the links and speed up the script.
1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
in an array of string (links to each category page)
3. Loop through the href array
3.1. Navigate to href
3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
.text of each (company names)
3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.
If you want business contact info off of the company pages, you would want to store the href in 3.1.1. and then loop through that list and grab what you want off the page.
Sorry about the weirdness of the formatting of the list. It won't let me indent more than one level.
okay I found a solution after thinking #jeffC's suggestion:
extract the href values and append it to the base url which is http://aucklandtradesmen.co.nz, so for example the if the extracted href is /home-mainmenu-1/alarms-a-security/armed-alarms-ltd-.html,and tell browser to navigate to that URL..and then I can do whatever I want in the current page..

Categories