Noticing a warning to limit scraped results with BeautifulSoup in Python - python

I am trying to scrape sales data from eBay with BeautifulSoup in Python for recently sold items and it works very well with the following code which finds all prices and all dates from sold items.
price = []
try:
p = soup.find_all('span', class_='POSITIVE')
except:
p = 'nan'
for x in p:
x = str(x)
x = x.replace(' ','"')
x = x.split('"')
if '>Sold' in x:
continue
else:
price.append(x)
Now I am running into a problem though. As seen in the picture below for this URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=babe+ruth+1933+goudey+149+psa+%281.5%29&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=babe+ruth+1933+goudey+149+psa+1.5&LH_Complete=1&rt=nc&LH_Sold=1), eBay sometimes suggests other search results if there are not enough for specific search queries. Check out the image
By that, my code not only finds the correct prices but also those of the suggested results below the warning. I was trying to find out where the warning message is located and delete every listing that is being found afterward, but I cannot figure it out. I also thought that I can search for the prices one by one but even then I cannot figure out how to notice when the warning appears.
Is there any other way you guys can think of to solve this?
I am aware that this is really specific

You can scrape the number of results (Shown in picture) and make a loop with the range of the results.
The code will be something like:
results = soup.find...
#You have to make the variable a int so replace everything extra
results = int(results)
for i in range(1, results):
price[i] = str(price[i])
price[i] = price[i].replace(' ','"')
price[i] = price[i].split()
if '>Sold' in price[i]:
continue
else:

Related

10X Repeat - Scraping Google News Search Results Python Selenium

When I run this code to get the titles and links, I get 10X results. Any idea what I am doing wrong? Is there a way to stop the scraping when we reach the last result on the page?
Thanks!
while True:
web = 'https://news.google.com/search?q=weather&hl=en-US&gl=US&ceid=US%3Aen'
driver.get(web)
time.sleep(3)
titleContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
linkContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
if (len(titleContainers) != 0):
for i in range(len(titleContainers)):
counter = counter + 1
print("Counter: " + str(counter))
titles.append(titleContainers[i].text)
links.append(linkContainers[i].get_attribute("href"))
else:
break
You put yourself in an infinite loop, with that 'while True' statement. if (len(titleContainers) != 0): condition will always evaluate to True, once they're found in page (they're 100). You're not posting your full code as well, I imagine that counter, titles and links are lists defined somewhere in your code. You may want to test for counter to be less or equal to titleContainers length.

Python Web Scraping - Find only n items

I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!
You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives

Python when checking if scraped element exists in list

I keep getting an error when I am using an if else statement in python. I want my script to check if an index exists and if it does then run the code, if not then run another code. I get the error ValueError: 'Named Administrator' is not in list
import requests
from bs4 import BeautifulSoup
url_3 = 'https://www.brightscope.com/form-5500/basic-info/107299/Orthopedic-Institute-Of-Pennsylvania/15801790/Orthopedic-Institute-Of-Pennsylvania-401k-Profit-Sharing-Plan/'
page = requests.get(url_3)
soup = BeautifulSoup(page.text, 'html.parser')
divs = [e.get_text() for e in soup.findAll('span')]
if divs.index('Named Administrator'):
index = divs.index('Named Administrator')
contact = divs[index + 1]
else:
contact = '-'
Rather than doing index, do a __contains__ test:
if 'Named Administrator' in divs:
and move forward only if Named Administrator actually exists in divs list, so you won't get the ValueError.
Another consideration is that membership test in lists has O(N) time complexity, so if you are doing this for a large list, probably use a set instead:
{e.get_text() for e in soup.findAll('span')}
but as sets are unordered you won't be able to use index-ing.
So either think about something else that would work on sets as well i.e. no need to get next value by indexing.
Or you can use a set for membership test, and list for getting the next value. The cost here might be higher or lower based on your actual context and you can only find out that by profiling:
divs_list = [e.get_text() for e in soup.findAll('span')]
divs_set = set(divs_list)
if 'Named Administrator' in divs_set:
index = divs_list.index('Named Administrator')
contact = divs_list[index + 1]

Scraping multiple pages into list with beautifulsoup

I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)
Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Looping through xpath variables

How can I increment the Xpath variable value in a loop in python for a selenium webdriver script ?
search_result1 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])").text
search_result2 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])").text
search_result3 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])").text
why dont you create a list for storing search results similar to
search_results=[]
for i in range(1,11) #I am assuming 10 results in a page so you can set your own range
result=sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])"%(i,i)).text
search_results.append(result)
this sample code will create list of 10 values of results. you can get idea from this code to write your own. its just matter of automating task.
so
search_results[0] will give you first search result
search_results[1] will give you second search results
...
...
search_results[9] will give you 10th search result
#Alok Singh Mahor, I don't like hardcoding ranges. Guess, better approach is to iterate through the list of webelements:
search_results=[]
result_elements = sel.find_elements_by_xpath("//not/indexed/xpath/for/any/search/result")
for element in result_elements:
search_result = element.text
search_results.append(search_result)

Categories