When I run this code to get the titles and links, I get 10X results. Any idea what I am doing wrong? Is there a way to stop the scraping when we reach the last result on the page?
Thanks!
while True:
web = 'https://news.google.com/search?q=weather&hl=en-US&gl=US&ceid=US%3Aen'
driver.get(web)
time.sleep(3)
titleContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
linkContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
if (len(titleContainers) != 0):
for i in range(len(titleContainers)):
counter = counter + 1
print("Counter: " + str(counter))
titles.append(titleContainers[i].text)
links.append(linkContainers[i].get_attribute("href"))
else:
break
You put yourself in an infinite loop, with that 'while True' statement. if (len(titleContainers) != 0): condition will always evaluate to True, once they're found in page (they're 100). You're not posting your full code as well, I imagine that counter, titles and links are lists defined somewhere in your code. You may want to test for counter to be less or equal to titleContainers length.
Related
I am trying to scrape sales data from eBay with BeautifulSoup in Python for recently sold items and it works very well with the following code which finds all prices and all dates from sold items.
price = []
try:
p = soup.find_all('span', class_='POSITIVE')
except:
p = 'nan'
for x in p:
x = str(x)
x = x.replace(' ','"')
x = x.split('"')
if '>Sold' in x:
continue
else:
price.append(x)
Now I am running into a problem though. As seen in the picture below for this URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=babe+ruth+1933+goudey+149+psa+%281.5%29&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=babe+ruth+1933+goudey+149+psa+1.5&LH_Complete=1&rt=nc&LH_Sold=1), eBay sometimes suggests other search results if there are not enough for specific search queries. Check out the image
By that, my code not only finds the correct prices but also those of the suggested results below the warning. I was trying to find out where the warning message is located and delete every listing that is being found afterward, but I cannot figure it out. I also thought that I can search for the prices one by one but even then I cannot figure out how to notice when the warning appears.
Is there any other way you guys can think of to solve this?
I am aware that this is really specific
You can scrape the number of results (Shown in picture) and make a loop with the range of the results.
The code will be something like:
results = soup.find...
#You have to make the variable a int so replace everything extra
results = int(results)
for i in range(1, results):
price[i] = str(price[i])
price[i] = price[i].replace(' ','"')
price[i] = price[i].split()
if '>Sold' in price[i]:
continue
else:
I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!
You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives
for element in driver.find_elements_by_xpath('.//span[#data-bind = "text: $salableQuantityData.qty"]'):
elem = element.text
stock = int(elem)
if stock < 0 :
print(stock)
After this loop have to click this driver.find_element_by_xpath('.//button[#class="action-next"]').click() again continue the same loop.
Note: The web table has 5 paginations and each page has few negative values, I'm trying to get negative values from all pages.
If I understand correctly you will need a function. Neat when you need to do the same thing several times.
Just a simple function wrap, and call it every time you need it, if I understood correctly you need to click some sort of 'next page' button and continue, right?
def some_work():
for element in driver.find_elements_by_xpath('.//span[#data-bind = "text: $salableQuantityData.qty"]'):
elem = element.text
stock = int(elem)
if stock < 0 :
print(stock)
driver.find_element_by_xpath('.//button[#class="action-next"]').click()
some_work()
or just nest in for/while loops. Why not?
Try this to find all pages until neither 'QuantityData' nor 'action-next' was not found. First time seeing selenium, but their document suggests using 'NoSuchElementException'.
from selenium.common.exceptions import NoSuchElementException
while True:
try:
some_work()
except NoSuchElementException:
break
I am scraping a website and I used their url in order to paginate/get next values. I created an infinite loop so that it wont stop since the data in there are too many. But I dont know why my variable wont increment inside my loop. Also i want to break the infinite loop if there is no data rendered in the json results
Here is my code
pageNumber = 0
while True:
driver.implicitly_wait(3)
driver.get('https://reversewhois.domaintools.com/?ajax=mReverseWhois&call=ajaxUpdateRefinePreview&q=%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D&to='+str(pageNumber))
time.sleep(3)
pre = driver.find_element_by_tag_name("pre").text
data = json.loads(pre)
table = data['results']
tables = pd.read_html(table,skiprows=1)
df = tables[-1]
# print(df.to_string(index=False))
pageNumber = pageNumber + 1 //it always prints 1
print(pageNumber)
continue // i want to loop again with the incremented pageNumber value
I think i did it wrong. In case, can you tell me what should i do? Thanks.
I don't see any problem with how you increment pageNumber.
In order to break the look, try:
table = data['results']
if not table:
break
Hi everyone first time poster long term reader.
my problem is I want an else statment to loop inside a for loop.
I want the else statment to loop until the if statment above it is met?
can anyone tell me where I am going wrong I have tried so many diffrent ways including while loops inside if statments cant get my head round this ?
edit changed the code to a while loop not on prunes suggestion but cant escape the while loop
for url in results:
webdriver.get(url)
try:
liked = webdriver.find_elements_by_xpath("//span[#class=\"glyphsSpriteHeart__filled__24__red_5 u-__7\" and #aria-label=\"Unlike\"]")
maxcount = 0
while not liked:
sleep(1)
webdriver.find_element_by_xpath('//span/button/span').click()
numberoflikesgiven += 1
maxcount += 1
print'number of likes given : ',numberoflikesgiven
sleep(2)
webdriver.find_element_by_link_text('Next').click()
if maxcount >= 10:
print('max count reached .... moving on.')
continue
else:
print ('picture has already been liked...')
continue
Yes, you very clearly wrote an infinite loop:
while not liked:
... # no assignments to "liked"
You do not change the value of liked anywhere in the loop. You do not break the loop. Thus, you have an infinite loop. You need to re-evaluate liked on each iteration. A typical way to to this is to duplicate the evaluation at the bottom of the loop
sleep(2)
webdriver.find_element_by_link_text('Next').click()
liked = webdriver.find_elements_by_xpath(
"//span[#class=\"glyphsSpriteHeart__filled__24__red_5 u-__7\" \
and #aria-label=\"Unlike\"]")
Also note that your continue does nothing, as it's at the bottom of the loop. Just let your code reach the bottom naturally, and the whilewill continue on its own. If you intended to go to the next iteration offor url in results, then you need tobreakthewhile` loop instead.