Web scraping Loop python issue - python

I am a python newbie, and wondering if someone would be able to highlight where about I am going wrong with the following webscraping script.
I am trying to recursively loop through the list of matches, to pull a cumulative value (metric) for each match.
My problem is, it is returning the exact same value each time.
I've tried to add notes to explain each of my points, any help appreciated.
#use Selenium & Beautiful Soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
#define URL/driver
my_url = "https://www.bet365.com/#/IP/"
driver = webdriver.Edge()
driver.get(my_url)
#allow a sleep of 10 seconds
time.sleep(10)
#parse the page
pSource= driver.page_source
soup = BeautifulSoup(pSource, "html.parser")
#containers tag - per match
containers = soup.findAll("div", {"class": "ipn-TeamStack "})
for container in containers:
#Total Match Shots
cumul_match_shots = 0
match = container.find_all('div')
for data in soup.findAll('div',{'class':'ml1-SoccerStatsBar '}):
for result in data.find_all('span'):
a = result.text
if len(a) > 0:
cumul_match_shots += int(a)
#print out values
print(match)
print(cumul_match_shots)
#close the webpage
driver.close() `

I think that you need to change the indentation (and move it a little bit higher) of the print(cumul_match_shots), as in the current state - it will give (print) you always the value from the last for loop.
And I am not sure if you are have a right place to reset the value to 0 again. Currently it looks like it will be a cumulative value of score in ALL matches.
As for match - it should be ok, as you do not modify it in the for loops.

Related

Extracting Hyperlinks from Basketball Reference Column(On pages With multiple Tables) to new Column

I have recently been working on a scraper for NBA MVPS from basketball reference and hope to incorporate the embedded Player page hyperlinks that appear into a new column in the same row as the player. I have done the same for scraping other pages but unfortunately, due to the many tables and indeterminate order of them, my prior method returns many random links from around the page. Below is my code, the column player links is the one in question. While these links are acceptably formatted, they are simply not the correct ones as stated prior. The table is perfectly fine however this column is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='https://www.basketball-reference.com/awards/awards_2022.html'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, "html.parser")
tabs = soup.select('table[id*="mvp"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
players.append(player)
links= []
for link in soup.findAll('table')[1].findAll('a'):
url = link.get('href')
links.append(url)
my_list = links
substring = '/friv/'
new_list = [item for item in my_list if not item.startswith(substring)]
for item in my_list.copy():
if substring in item:
my_list.remove(item)
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols)
df['Playerlinks']=my_list
print(df.to_markdown())
So, essentially I am asking whether anyone is aware of a method to scrape just these hyperlinks(12 in the given example), and put them either into an ordered list(to be put into a column) or any better methods you may be aware of. My expected output, for this link, in particular, would be a first-row value of "/players/j/jokicni01.html", a second of "/players/e/embiijo01.html" etc; corresponding with their respective players. I have tried many methods using ids, find alls, and others but unfortunately my Html knowledge is simply very limited and I am starting to go in circles. Thank you in advance for any help you can provide.
Does something like this work for you? You can extract the profile links while you are iterating through the player rows.
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
# get player link
player.append(j.find('a')['href'])
players.append(player)
df=pd.DataFrame(players,columns=cols+["Playerlinks"])
print(df)

Python Web Scraping - Find only n items

I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!
You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives

Scrolling with Selenium to scrape more data not getting more than 50 or 1000 elements

I want to create a list of all of the diamonds' URLs in the table on Blue Nile, which should be ~142K entries. I noticed that I had to scroll to load more entries so the first solution I implemented for that is to scroll to the end of the page first before scraping. However, the max number of elements scraped would only be 1000. I learned that this is due to issues outlined in this question: Selenium find_elements_by_id() doesn't return all elements
but the solutions aren't clear and straightforward for me.
I tried to scroll the page by a certain amount and scrape until the page has reached the end. However, I can only seem to get the initial 50 unique elements.
driver = webdriver.Chrome()
driver.get("https://www.bluenile.com/diamond-search?pt=setform")
source_site = 'www.bluenile.com'
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
print(last_height)
new_height = 500
diamond_urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
count = 0
while new_height < last_height:
for url in soup.find_all('a', class_='grid-row row TL511DiaStrikePrice', href=True):
full_url = source_site + url['href'][1:]
diamond_urls.append(full_url)
count += 1
if count == 50:
driver.execute_script("window.scrollBy(0, 500);")
time.sleep(SCROLL_PAUSE_TIME)
new_height+=500
print(new_height)
count = 0
Please help me find the issue with my code above or suggest a better solution. Thanks!
As a simpler solution I would just query their API (Sample below):
https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=50&_=1591689344542&unlimitedPaging=false&sortDirection=asc&sortColumn=default&shape=RD&maxDateType=MANUFACTURING_REQUIRED&isQuickShip=false&hasVisualization=false&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN
One of the response parameters of this endpoint is the countRaw which is 100876. Therefore it should be simple enough to iterate over in blocks of 50 (or more you just don't want to abuse the endpoint) until you have all the data you need.
Hope this helps.

Scraping multiple pages into list with beautifulsoup

I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)
Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Finding both child and parent by class name in one go with WebDriver?

On a typical eBay search query where more than 50 listings are returned, such as this, eBay displays in the a grid format (whether you have it set up to display as grid or a list).
I'm using class name to pull out the prices using WebDriver:
prices = webdriver.find_all_elements_by_class_name("bidsold")
The challenge: although all prices on the page look identical in structure, the ones that are crossed out (where Buy It Now is not available and it's Best offer accepted) are actually contained within a child span of the above span:
I could pull these out separately by repeating the find_all_elements_by_class_name method with class sboffer, but (i) I will lose track of the order, and more importantly (ii) it will roughly double the time it takes to extract the prices.
The CSS selector for both types of prices also differ, as do the XPaths.
How do we catch all prices in one go?
Try this:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Columbia+Hiking+Pants&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000&_pgn=2')
prices_list = driver.find_elements_by_css_selector('span.amt')
prices_on_page = []
for span in prices_list:
unsold_item = span.find_elements_by_css_selector('span.bidsold.bold')
sold_item = span.find_elements_by_css_selector('span.sboffer')
if len(sold_item):
prices_on_page.append(sold_item[0].text)
elif len(unsold_item):
prices_on_page.append(unsold_item[0].text)
elif span.text:
prices_on_page.append(span.text)
print prices_on_page
driver.quit()
In this case, you will have track of the order and you will only query the specific span element instead of the entire page. This should improve performance.
I would go for xpath- below code worked for me. It grabbed 50 prices!
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Columbia+Hiking+Pants&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000&_pgn=2')
my_prices = []
itms = driver.find_elements_by_xpath("//div[#class='bin']")
for i in itms:
prices = i.find_elements_by_xpath(".//span[contains(text(),'$')]")
val = ','.join(i.text for i in prices)
my_prices.append([val])
print my_prices
driver.quit()
Result is
[[u'$64.95'], [u'$59.99'], [u'$49.95'], [u'$46.89,$69.99'], [u'$44.98'], [u'$42.95'], [u'$39.99'], [u'$39.99'], [u'$37.95'], [u'$36.68'], [u'$35.96,$44.95'], [u'$34.99'], [u'$34.99'], [u'$34.95'], [u'$30.98'], [u'$29.99'], [u'$29.99'], [u'$29.65,$32.95'], [u'$29.00'], [u'$27.96,$34.95'], [u'$27.50'], [u'$27.50'], [u'$26.99,$29.99'], [u'$26.95'], [u'$26.55,$29.50'], [u'$24.99'], [u'$24.99'], [u'$24.99'], [u'$24.99'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$24.98'], [u'$22.00'], [u'$22.00'], [u'$22.00'], [u'$22.00'], [u'$18.00'], [u'$18.00'], [u'$17.95'], [u'$11.99'], [u'$9.99'], [u'$6.00']]

Categories