Python when checking if scraped element exists in list

Python when checking if scraped element exists in list - python

I keep getting an error when I am using an if else statement in python. I want my script to check if an index exists and if it does then run the code, if not then run another code. I get the error ValueError: 'Named Administrator' is not in list
import requests
from bs4 import BeautifulSoup
url_3 = 'https://www.brightscope.com/form-5500/basic-info/107299/Orthopedic-Institute-Of-Pennsylvania/15801790/Orthopedic-Institute-Of-Pennsylvania-401k-Profit-Sharing-Plan/'
page = requests.get(url_3)
soup = BeautifulSoup(page.text, 'html.parser')
divs = [e.get_text() for e in soup.findAll('span')]
if divs.index('Named Administrator'):
index = divs.index('Named Administrator')
contact = divs[index + 1]
else:
contact = '-'

Rather than doing index, do a __contains__ test:
if 'Named Administrator' in divs:
and move forward only if Named Administrator actually exists in divs list, so you won't get the ValueError.
Another consideration is that membership test in lists has O(N) time complexity, so if you are doing this for a large list, probably use a set instead:
{e.get_text() for e in soup.findAll('span')}
but as sets are unordered you won't be able to use index-ing.
So either think about something else that would work on sets as well i.e. no need to get next value by indexing.
Or you can use a set for membership test, and list for getting the next value. The cost here might be higher or lower based on your actual context and you can only find out that by profiling:
divs_list = [e.get_text() for e in soup.findAll('span')]
divs_set = set(divs_list)
if 'Named Administrator' in divs_set:
index = divs_list.index('Named Administrator')
contact = divs_list[index + 1]

Related

Noticing a warning to limit scraped results with BeautifulSoup in Python

I am trying to scrape sales data from eBay with BeautifulSoup in Python for recently sold items and it works very well with the following code which finds all prices and all dates from sold items.
price = []
try:
p = soup.find_all('span', class_='POSITIVE')
except:
p = 'nan'
for x in p:
x = str(x)
x = x.replace(' ','"')
x = x.split('"')
if '>Sold' in x:
continue
else:
price.append(x)
Now I am running into a problem though. As seen in the picture below for this URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=babe+ruth+1933+goudey+149+psa+%281.5%29&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=babe+ruth+1933+goudey+149+psa+1.5&LH_Complete=1&rt=nc&LH_Sold=1), eBay sometimes suggests other search results if there are not enough for specific search queries. Check out the image
By that, my code not only finds the correct prices but also those of the suggested results below the warning. I was trying to find out where the warning message is located and delete every listing that is being found afterward, but I cannot figure it out. I also thought that I can search for the prices one by one but even then I cannot figure out how to notice when the warning appears.
Is there any other way you guys can think of to solve this?
I am aware that this is really specific

You can scrape the number of results (Shown in picture) and make a loop with the range of the results.
The code will be something like:
results = soup.find...
#You have to make the variable a int so replace everything extra
results = int(results)
for i in range(1, results):
price[i] = str(price[i])
price[i] = price[i].replace(' ','"')
price[i] = price[i].split()
if '>Sold' in price[i]:
continue
else:

Python Web Scraping - Find only n items

I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!

You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives

How to append dictionaries to a list without overwriting data

I'm having a bit of trouble with a function I'm trying to write. What it is supposed to do is 1) go to a particular URL and get a list of financial sectors stored in a particular div; 2) visit each sector's respective page and get 3 particular pieces of information from there; 3) put the gathered collection into a dictionary; and 4) append that dictionary to another dictionary.
The desired output is a dictionary containing a list of dictionaries for all the sectors.
Here is my function:
def fsr():
fidelity_sector_report = dict()
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
import requests
from bs4 import BeautifulSoup
# scrape the url page and locate links for each sector
try:
response = requests.get(url)
if not response.status_code == 200:
return 'Main page error'
page = BeautifulSoup(response.content, "lxml")
sectors = page.find_all('a',class_="heading1")
for sector in sectors:
link = 'https://eresearch.fidelity.com/' + sector['href']
name = sector.text
sect = dict()
lst = []
# scrape target pages for required information
try:
details = requests.get(link)
if not details.status_code == 200:
return 'Details page error'
details_soup = BeautifulSoup(details.content,'lxml')
fundamentals = details_soup.find('div',class_='sec-fundamentals')
values = dict()
#locate required values by addressing <tr> text and put them in a dictionary
values['Enterprise Value'] = fundamentals.select_one('th:contains("Enterprise Value") + td').text.strip()
values['Return on Equity (TTM)'] = fundamentals.select_one('th:contains("Return on Equity (TTM)") + td').text.strip()
values['Dividend Yield'] = fundamentals.select_one('th:contains("Dividend Yield") + td').text.strip()
#add values to the sector dictionary
sect[name] = values
# add the dictionary to the list
lst.append(dict(sect))
# for a dictionary using the list
fidelity_sector_report['results'] = lst
except:
return 'Something is wrong with details request'
return fidelity_sector_report
except:
return "Something is horribly wrong"
AS far as I can tell, it performs the main taks wonderfully, and the problem appears at the stage of appending a formed dictionary to a list - instead of adding new piece, it gets overwritten completely. I figured that out by putting print(lst) right after the fidelity_sector_report['results'] = lst line.
What should I change so that list (and, correspondingly, dictionary) gets formed as planned?

You should move the lst=[] outside of the sectors loop.
Your problem appears since for each sector, you reset lst and you append the current sector data to an empty list.

The following code causes the value of fidelity_sector_report['results'] to be replaced with lst.
fidelity_sector_report['results'] = lst
I presume you would want to access the respective values using a key, you can add the following line below fidelity_sector_report = dict() to initialize a dictionary:
fidelity_sector_report['results'] = {}
Then, create a key for each sector using the sector name and set the value with your values dictionary by replacing fidelity_sector_report['results'] = lst with:
fidelity_sector_report['results'][name] = dict(values)
You can access the data by using the relevant keys, i.e fidelity_sector_report['results']['Financials']['Dividend Yield'] for the dividend yield of the financials sector.

Scraping multiple pages into list with beautifulsoup

I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Looping through xpath variables

How can I increment the Xpath variable value in a loop in python for a selenium webdriver script ?
search_result1 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])").text
search_result2 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])").text
search_result3 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])").text

why dont you create a list for storing search results similar to
search_results=[]
for i in range(1,11) #I am assuming 10 results in a page so you can set your own range
result=sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])"%(i,i)).text
search_results.append(result)
this sample code will create list of 10 values of results. you can get idea from this code to write your own. its just matter of automating task.
so
search_results[0] will give you first search result
search_results[1] will give you second search results
...
...
search_results[9] will give you 10th search result

#Alok Singh Mahor, I don't like hardcoding ranges. Guess, better approach is to iterate through the list of webelements:
search_results=[]
result_elements = sel.find_elements_by_xpath("//not/indexed/xpath/for/any/search/result")
for element in result_elements:
search_result = element.text
search_results.append(search_result)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python when checking if scraped element exists in list - python

Related

Noticing a warning to limit scraped results with BeautifulSoup in Python

Python Web Scraping - Find only n items

How to append dictionaries to a list without overwriting data

Scraping multiple pages into list with beautifulsoup

Looping through xpath variables

Categories

Resources