I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple.
The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part.
That's the code.
import bs4,requests
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
for i in webcontent.findAll('div',{'jslog':'93789'}):
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.
I tried:
for _
in range(5):
but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.
I also tried :
while len(articles_list)<5:
but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.
and finally i tried:
for tuples in (articles_list[0:5]): #Iterate in the tuple,
for element in tuples: #Print title, link and a divisor
print(element)
print('-'*80)
I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.
Can you please help me understand what i am missing?
Thanks!
You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.
Try this code:
import re
import bs4,requests
keyword_list = ['health','Coronavirus','travel']
articles_list = []
base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen'
request = requests.get(base_url)
webcontent = bs4.BeautifulSoup(request.content,'lxml')
maxcnt = 5 # max number of articles
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
if len(articles_list) == maxcnt: break # exit outer loop
for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1):
if any(keyword in i.select_one('h3').getText() for keyword in keyword_list):
articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href'))))
if len(articles_list) == maxcnt: break # exit inner loop
print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list])) # article titles
Output
5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives
Related
I have recently been working on a scraper for NBA MVPS from basketball reference and hope to incorporate the embedded Player page hyperlinks that appear into a new column in the same row as the player. I have done the same for scraping other pages but unfortunately, due to the many tables and indeterminate order of them, my prior method returns many random links from around the page. Below is my code, the column player links is the one in question. While these links are acceptably formatted, they are simply not the correct ones as stated prior. The table is perfectly fine however this column is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='https://www.basketball-reference.com/awards/awards_2022.html'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, "html.parser")
tabs = soup.select('table[id*="mvp"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
players.append(player)
links= []
for link in soup.findAll('table')[1].findAll('a'):
url = link.get('href')
links.append(url)
my_list = links
substring = '/friv/'
new_list = [item for item in my_list if not item.startswith(substring)]
for item in my_list.copy():
if substring in item:
my_list.remove(item)
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols)
df['Playerlinks']=my_list
print(df.to_markdown())
So, essentially I am asking whether anyone is aware of a method to scrape just these hyperlinks(12 in the given example), and put them either into an ordered list(to be put into a column) or any better methods you may be aware of. My expected output, for this link, in particular, would be a first-row value of "/players/j/jokicni01.html", a second of "/players/e/embiijo01.html" etc; corresponding with their respective players. I have tried many methods using ids, find alls, and others but unfortunately my Html knowledge is simply very limited and I am starting to go in circles. Thank you in advance for any help you can provide.
Does something like this work for you? You can extract the profile links while you are iterating through the player rows.
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
# get player link
player.append(j.find('a')['href'])
players.append(player)
df=pd.DataFrame(players,columns=cols+["Playerlinks"])
print(df)
When I run this code to get the titles and links, I get 10X results. Any idea what I am doing wrong? Is there a way to stop the scraping when we reach the last result on the page?
Thanks!
while True:
web = 'https://news.google.com/search?q=weather&hl=en-US&gl=US&ceid=US%3Aen'
driver.get(web)
time.sleep(3)
titleContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
linkContainers = driver.find_elements(by='xpath', value='//*[#class="DY5T1d RZIKme"]')
if (len(titleContainers) != 0):
for i in range(len(titleContainers)):
counter = counter + 1
print("Counter: " + str(counter))
titles.append(titleContainers[i].text)
links.append(linkContainers[i].get_attribute("href"))
else:
break
You put yourself in an infinite loop, with that 'while True' statement. if (len(titleContainers) != 0): condition will always evaluate to True, once they're found in page (they're 100). You're not posting your full code as well, I imagine that counter, titles and links are lists defined somewhere in your code. You may want to test for counter to be less or equal to titleContainers length.
As an exercise for learning Python and Selenium, I'm trying to write a script that checks a web page with all kinds of commercial deals, find all the specific food deals (class name 'tag-food'), put them in a list (elem), then check which ones contain the text 'sushi', and for those elements extract the html element which contains price. And print the results.
I have:
elem = driver.find_elements_by_class_name('tag-food')
i = 0
while i < len(elem):
source_code = elem[i].get_attribute("innerHTML")
# ?? how to check if source_code contains 'sushi'?
# ?? if true how to extract price data?
i = i + 1
driver.quit()
What's the best and most direct way to do these checks? Thanks! 🙏
I don't think you need a while loop for this. Also, you would be looking for a text value, not innerHTML
You can make it more simple like this:
for row in driver.find_elements_by_class_name('tag-food'):
if "sushi" in row.get_attribute("innerText"):
print("Yes this item has sushi")
# find element to grab price, store in variable to do something else with
else:
print("No sushi in this item")
Or even just this, depending on how the text in the HTML is structured:
for row in driver.find_elements_by_class_name('tag-food'):
if "sushi" in row.text:
print("Yes this item has sushi")
# find element to grab price, store in variable to do something else with
else:
print("No sushi in this item")
I am trying to scrape sales data from eBay with BeautifulSoup in Python for recently sold items and it works very well with the following code which finds all prices and all dates from sold items.
price = []
try:
p = soup.find_all('span', class_='POSITIVE')
except:
p = 'nan'
for x in p:
x = str(x)
x = x.replace(' ','"')
x = x.split('"')
if '>Sold' in x:
continue
else:
price.append(x)
Now I am running into a problem though. As seen in the picture below for this URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=babe+ruth+1933+goudey+149+psa+%281.5%29&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=babe+ruth+1933+goudey+149+psa+1.5&LH_Complete=1&rt=nc&LH_Sold=1), eBay sometimes suggests other search results if there are not enough for specific search queries. Check out the image
By that, my code not only finds the correct prices but also those of the suggested results below the warning. I was trying to find out where the warning message is located and delete every listing that is being found afterward, but I cannot figure it out. I also thought that I can search for the prices one by one but even then I cannot figure out how to notice when the warning appears.
Is there any other way you guys can think of to solve this?
I am aware that this is really specific
You can scrape the number of results (Shown in picture) and make a loop with the range of the results.
The code will be something like:
results = soup.find...
#You have to make the variable a int so replace everything extra
results = int(results)
for i in range(1, results):
price[i] = str(price[i])
price[i] = price[i].replace(' ','"')
price[i] = price[i].split()
if '>Sold' in price[i]:
continue
else:
I am a python newbie, and wondering if someone would be able to highlight where about I am going wrong with the following webscraping script.
I am trying to recursively loop through the list of matches, to pull a cumulative value (metric) for each match.
My problem is, it is returning the exact same value each time.
I've tried to add notes to explain each of my points, any help appreciated.
#use Selenium & Beautiful Soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
#define URL/driver
my_url = "https://www.bet365.com/#/IP/"
driver = webdriver.Edge()
driver.get(my_url)
#allow a sleep of 10 seconds
time.sleep(10)
#parse the page
pSource= driver.page_source
soup = BeautifulSoup(pSource, "html.parser")
#containers tag - per match
containers = soup.findAll("div", {"class": "ipn-TeamStack "})
for container in containers:
#Total Match Shots
cumul_match_shots = 0
match = container.find_all('div')
for data in soup.findAll('div',{'class':'ml1-SoccerStatsBar '}):
for result in data.find_all('span'):
a = result.text
if len(a) > 0:
cumul_match_shots += int(a)
#print out values
print(match)
print(cumul_match_shots)
#close the webpage
driver.close() `
I think that you need to change the indentation (and move it a little bit higher) of the print(cumul_match_shots), as in the current state - it will give (print) you always the value from the last for loop.
And I am not sure if you are have a right place to reset the value to 0 again. Currently it looks like it will be a cumulative value of score in ALL matches.
As for match - it should be ok, as you do not modify it in the for loops.