Scraping all entries of lazyloading page using python - python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')

The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

Related

How to scrape yahoo finance news headers with BeautifulSoup?

I would like to scrape news from yahoo's finance, for a pair.
How does bs4's find() or find_all() work?
for this example:
with this link:
I'm traying to extract the data ... but no data is scraped. why? what's wrong?
I'm using this, but nothing is printed (except the tickers)
html = BeautifulSoup(source_s, "html.parser") # "html")
news_table_s = html.find_all("div",{"class":"Py(14px) Pos(r)"})
news_tables_s[ticker_s] = news_table_s
print("news_tables", news_tables_s)
I would like to extract the headers from a yahoo finance web page.
You have to iterate your ResultSet to get anything out.
for e in html.find_all("div",{"class":"Py(14px) Pos(r)"}):
print(e.h3.text)
Recommendation - Do not use dynamic classes to select elements use more static ids or HTML structure, here selected via css selector
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Example
from bs4 import BeautifulSoup
import requests
url='https://finance.yahoo.com/quote/EURUSD%3DX?p=EURUSD%3DX'
html = BeautifulSoup(requests.get(url).text)
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Output
EUR/USD steadies, but bears sharpen claws as dollar feasts on Fed bets
EUR/USD Weekly Forecast – Euro Gives Up Early Gains for the Week
EUR/USD Forecast – Euro Plunges Yet Again on Friday
EUR/USD Forecast – Euro Gives Up Early Gains
EUR/USD Forecast – Euro Continues to Test the Same Indicator
Dollar gains as inflation remains sticky; sterling retreats
Siemens Issues Blockchain Based Euro-Denominated Bond on Polygon Blockchain
EUR/USD Forecast – Euro Rallies
FOREX-Dollar slips with inflation in focus; euro, sterling up on jobs data
FOREX-Jobs figures send euro, sterling higher; dollar slips before CPI

Is there an error in this web-scraping script?

Whats the error in this script?
from bs4 import BeautifulSoup
import requests
years = [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]
web = 'https://www.uefa.com/uefaeuropaleague/history/seasons/2022/matches/'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')
matches = soup.find_all('div', class_='pk-match-unit size-m')
for match in matches:
print(match.find('div', class_='pk-match__base--team-home size-m').get_text())
print(match.find('div', class_='pk-match__score size-m').get_text())
print(match.find('div', class_='pk-match__base--team-away size-m').get_text())
I am not able to find the error, the purpose of the print is to obtain the data of the games of the last edition of the Europa League.
I attach a picture of the html for reference, since I do not see where the error is.
Keep in mind that I am only doing it for the year 2021.
I try to get the results from the group stage to the final.
Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Issue here is that data is loaded from an api so you won't get it with BeautifulSoup if it is not in response of requests - Take a look at your browsers dev tools on xhr tab and use this api call to get results from as JSON.
Example
Inspect the whole JSON to pick info that fit your needs
import requests
json_data = requests.get('https://match.uefa.com/v5/matches?competitionId=14&seasonYear=2022&phase=TOURNAMENT&order=DESC&offset=0&limit=20').json()
for m in json_data:
print(m['awayTeam']['internationalName'], f"{m['score']['total']['away']}:{m['score']['total']['home']}", m['homeTeam']['internationalName'])
Output
Rangers 1:1 Frankfurt
West Ham 0:1 Frankfurt
Leipzig 1:3 Rangers
Frankfurt 2:1 West Ham
Rangers 0:1 Leipzig
Braga 1:3 Rangers
West Ham 3:0 Lyon
Frankfurt 3:2 Barcelona
Leipzig 2:0 Atalanta
Rangers 0:1 Braga
...

How to make BeautifulSoup go to the specific webpage I want instead of a random one on the site?

I am trying to learn web scraping using BeautifulSoup by scraping UFC fight data off of the website Tapology. I have entered in the URL of a specific fight's webpage but every time I run the code it seems to jump to a new random fight on the page instead of this fight. Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
html_text = requests.get(url, timeout=5).text
soup = BeautifulSoup(html_text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print(fightresult, fightstats)
Honestly I have no idea how it could be switching to other webpages when I have a very specific URL like the one I am using.
I got the same results ("..every time I run the code it seems to jump to a new random fight...") when I tried your code. Like some of the comments suggested, it's probably in an effort to evade bots. Maybe the right set of headers could resolve it, but I'm not very good with making requests imitate un-automated browsers - so in situations like these, I sometimes use HTMLSession (or cloudscraper or even ScrapingAnt, and finally selenium if none of the others work).
# from requests_html import HTMLSession
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
req = HTMLSession().get(url)
soup = BeautifulSoup(req.text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print('\n\n\n'.join('\n'.join( # remove some whitespaces from text for better readability
' '.join(w for w in t.text.split() if w) for t in f if t.text.strip()
) for f in [fightresult, fightstats]))
For me, that prints
Keith Jardine defeats Chuck Liddell via 3 Round Decision #30 Biggest Upset of All Time #97 Greatest Light Heavy MMA Fight of All Time
13-3-1
Pro Record At Fight
20-4-0
Climbed to 14-3
Record After Fight
Fell to 20-5
+290 (Moderate Underdog)
Betting Odds
-395 (Moderate Favorite)
United States
Nationality
United States
Albuquerque, New Mexico
Fighting out of
San Luis Obispo, California
31 years, 10 months, 3 weeks, 1 day
Age at Fight
37 years, 9 months, 5 days
204.5 lbs (92.8 kgs)
Weigh-In Result
205.5 lbs (93.2 kgs)
6'1" (186cm)
Height
6'2" (188cm)
76.0" (193cm)
Reach
76.5" (194cm)
Jackson Wink MMA
Gym
The Pit
Invicta FC 50
11.16.2022, 9:00 PM ET
Bellator 288
11.18.2022, 6:00 PM ET
ONE on Prime Video 4
11.18.2022, 7:00 PM ET
LFA 147
11.18.2022, 6:00 PM ET
ONE Championship 163
11.19.2022, 4:30 AM ET
UFC Fight Night
11.19.2022, 1:00 PM ET
Cage Warriors 147: Unplugg...
11.20.2022, 12:00 PM ET
PFL 10
11.25.2022, 5:30 PM ET
Jiří "Denisa" Procházka
Glover Teixeira
Jan Błachowicz
Magomed Ankalaev
Aleksandar "Rocket" Rakić
Anthony "Lionheart" Smith
Jamahal "Sweet Dreams" Hill
Nikita "The Miner" Krylov

Python/Pandas/NLTK: Iterating through a DataFrame, get value, transform it and add the new value to a new column

I scraped some data from google news into a dataframe:
DataFrame:
df
title link pubDate description source source_url
0 Australian research finds cost-effective way t... https://news.google.com/__i/rss/rd/articles/CB... Sat, 15 Oct 2022 23:51:00 GMT Australian research finds cost-effective way t... The Guardian https://www.theguardian.com
1 Something New Under the Sun: Floating Solar Pa... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 11:49:11 GMT Something New Under the Sun: Floating Solar Pa... Voice of America - VOA News https://www.voanews.com
2 Adapt solar panels for sub-Saharan Africa - Na... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 09:06:41 GMT Adapt solar panels for sub-Saharan AfricaNatur... Nature.com https://www.nature.com
3 Cost of living: The people using solar panels ... https://news.google.com/__i/rss/rd/articles/CB... Wed, 05 Oct 2022 07:00:00 GMT Cost of living: The people using solar panels ... BBC https://www.bbc.co.uk
4 Business Matters: Solar Panels on Commercial P... https://news.google.com/__i/rss/rd/articles/CB... Mon, 17 Oct 2022 09:13:35 GMT Business Matters: Solar Panels on Commercial P... Insider Media https://www.insidermedia.com
... ... ... ... ... ... ...
What I want to do now is basically to iterate through the "link" column and summarize every article with NLTK and add the summary to a new column. Here is an example:
article = Article(df.iloc[4, 1]) #get the url from the link column
article.download()
article.parse()
article.nlp()
article = article.summary
print(article)
Output:
North WestGemma Cornwall, Head of Sustainability of Anderton Gables, looks into the benefit of solar panels.
And, with the cost of solar panels continually dropping, it is becoming increasingly affordable for commercial property owners.
Reduce your energy spendMost people are familiar with solar energy, but many are unaware of the significant financial savings that can be gained by installing solar panels in commercial buildings.
As with all things, there are pros and cons to weigh up when considering solar panels.
If you’re considering solar panels for your property, contact one of the Anderton Gables team, who can advise you on the best course of action.
I tried a little bit, but I couldn't make it work...
Thanks for your help!
This will be a very slow solution with a for loop, but it might work for a small dataset. Iterating through all the links and then applying the transformations needed, and ultimately create a new column in the dataframe
summaries = []
for l in df['source_url'].values:
article = Article(l)
article.download()
article.parse()
article.nlp()
summaries.append(article.summary)
df['summaries'] = summaries
Or you could define a custom function and the use pd.apply:
def get_description(x):
art = Article(x)
art.download()
art.parse()
art.nlp()
return art.summary
df['summary'] = df['source_url'].apply(get_description)

Scraping SVG Chart using Selenium - Python

I am trying to scrape the SVG chart which contains the previous months prices of the house, from the following URL: https://www.zameen.com/Property/dha_defence_dha_defence_phase_2_1_kanal_neat_and_clean_upper_portion_for_rent-24195800-339-4.html
I am scraping the "Price Index" section, image attached:
The code snippet is below:
elements = driver.find_elements(by=By.XPATH, value="//*[local-name() = 'svg' and #class='ct-chart-line']//*[name() = 'g' and #class='ct-labels']//span[contains(#class,'ct-horizontal')]")
print("WebElements: ", len(elements))
actions = ActionChains(driver)
for el in elements:
actions.move_to_element(el).perform()
time.sleep(1)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.chartist-tooltip").text)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.ct-axis-tooltip-x").text)
Following is the output I have received:
WebElements: 7
32,388,054 ==> Jun 2021
33,828,816 ==> Aug 2021
36,064,647 ==> Oct 2021
38,336,196 ==> Dec 2021
39,535,707 ==> Feb 2022
40,257,851 ==> Apr 2022
40,733,506 ==> May 2022
I am using the X-Axis Labels of Months as my elements and moving the selenium on it and capturing the price for that month. Unfortunately, the label of the months is of every 2 months and I get a total of 7 months of data rather than 12 months.
What could be a better solution so I can get complete 12-month prices?
import requests
cookies = {}
headers = {'x-requested-with': 'XMLHttpRequest'}
params = {
'property_id': '24195800',
'purpose': '2',
}
response = requests.get('https://www.zameen.com/nfpage/async/property/get_property_search_index', params=params, cookies=cookies, headers=headers)
will return all data for the graph in nested dict format:
response.text
'{"status":"success","search_index_data":{"section_data":{"heading_txt":" Islamabad DHA Defence Phase 2, 1 Kanal Plots Price Index","heading_month":"May 2022"},"index_data":{"class":"inc","current_year":"2022","last_value":"235.60","ratio":135.6,"ratio_percent":135.6,"range":"100.00 - 235.60","weeks_range":"187.33 - 235.60","this_year_range":"225.89 - 235.60","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data_per_unit":{"class":"inc","current_year":"2022","last_value":"9,052","ratio":"5,210","ratio_percent":135.6,"range":"3,842 - 9,052","weeks_range":"7,197 - 9,052","this_year_range":"8,679 - 9,052","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data":{"class":"inc","current_year":"2022","last_value":"40,734,000","ratio":"23,445,000","ratio_percent":135.6,"range":"17,289,000 - 40,734,000","weeks_range":"32,386,500 - 40,734,000","this_year_range":"39,055,500 - 40,734,000","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"chart_data":{"0":{"moving_avg":"100.0000","org_moving_avg":"1.000000","period_end_date":"2018-01-31","slope":"1.000000"},"1":{"moving_avg":"101.5000","org_moving_avg":"1.015000","period_end_date":"2018-02-28","slope":"0.970000"},"2":{"moving_avg":"101.6667","org_moving_avg":"1.016667","period_end_date":"2018-03-31","slope":"0.980000"},"3":{"moving_avg":"102.3333","org_moving_avg":"1.023333","period_end_date":"2018-04-30","slope":"0.980000"},"4":{"moving_avg":"103.0000","org_moving_avg":"1.030000","period_end_date":"2018-05-31","slope":"0.950000"},"5":{"moving_avg":"104.3333","org_moving_avg":"1.043333","period_end_date":"2018-06-30","slope":"0.940000"},"6":{"moving_avg":"105.6667","org_moving_avg":"1.056667","period_end_date":"2018-07-31","slope":"0.940000"},"7":{"moving_avg":"106.3333","org_moving_avg":"1.063333","period_end_date":"2018-08-31","slope":"0.940000"},"8":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2018-09-30","slope":"0.910000"},"9":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-10-31","slope":"0.930000"},"10":{"moving_avg":"109.0000","org_moving_avg":"1.090000","period_end_date":"2018-11-30","slope":"0.920000"},"11":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-12-31","slope":"0.930000"},"12":{"moving_avg":"108.0000","org_moving_avg":"1.080000","period_end_date":"2019-01-31","slope":"0.930000"},"13":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-02-28","slope":"0.930000"},"14":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-03-31","slope":"0.930000"},"15":{"moving_avg":"108.6667","org_moving_avg":"1.086667","period_end_date":"2019-04-30","slope":"0.910000"},"16":{"moving_avg":"110.0000","org_moving_avg":"1.100000","period_end_date":"2019-05-31","slope":"0.890000"},"17":{"moving_avg":"112.3333","org_moving_avg":"1.123333","period_end_date":"2019-06-30","slope":"0.870000"},"18":{"moving_avg":"113.3333","org_moving_avg":"1.133333","period_end_date":"2019-07-31","slope":"0.880000"},"19":{"moving_avg":"114.3333","org_moving_avg":"1.143333","period_end_date":"2019-08-31","slope":"0.870000"},"20":{"moving_avg":"114.6667","org_moving_avg":"1.146667","period_end_date":"2019-09-30","slope":"0.860000"},"21":{"moving_avg":"116.3333","org_moving_avg":"1.163333","period_end_date":"2019-10-31","slope":"0.850000"},"22":{"moving_avg":"118.0000","org_moving_avg":"1.180000","period_end_date":"2019-11-30","slope":"0.830000"},"23":{"moving_avg":"120.0000","org_moving_avg":"1.200000","period_end_date":"2019-12-31","slope":"0.820000"},"24":{"moving_avg":"121.0000","org_moving_avg":"1.210000","period_end_date":"2020-01-31","slope":"0.830000"},"25":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-02-29","slope":"0.840000"},"26":{"moving_avg":"120.3333","org_moving_avg":"1.203333","period_end_date":"2020-03-31","slope":"0.830000"},"27":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-04-30","slope":"0.820000"},"28":{"moving_avg":"122.3333","org_moving_avg":"1.223333","period_end_date":"2020-05-31","slope":"0.810000"},"29":{"moving_avg":"124.0000","org_moving_avg":"1.240000","period_end_date":"2020-06-30","slope":"0.790000"},"30":{"moving_avg":"125.0000","org_moving_avg":"1.250000","period_end_date":"2020-07-31","slope":"0.800000"},"31":{"moving_avg":"127.3333","org_moving_avg":"1.273333","period_end_date":"2020-08-31","slope":"0.760000"},"32":{"moving_avg":"131.3333","org_moving_avg":"1.313333","period_end_date":"2020-09-30","slope":"0.730000"},"33":{"moving_avg":"137.0000","org_moving_avg":"1.370000","period_end_date":"2020-10-31","slope":"0.700000"},"34":{"moving_avg":"142.3333","org_moving_avg":"1.423333","period_end_date":"2020-11-30","slope":"0.680000"},"35":{"moving_avg":"147.0000","org_moving_avg":"1.470000","period_end_date":"2020-12-31","slope":"0.660000"},"36":{"moving_avg":"152.3333","org_moving_avg":"1.523333","period_end_date":"2021-01-31","slope":"0.630000"},"37":{"moving_avg":"159.6667","org_moving_avg":"1.596667","period_end_date":"2021-02-28","slope":"0.590000"},"38":{"moving_avg":"168.0000","org_moving_avg":"1.680000","period_end_date":"2021-03-31","slope":"0.560000"},"39":{"moving_avg":"176.6667","org_moving_avg":"1.766667","period_end_date":"2021-04-30","slope":"0.540000"},"40":{"moving_avg":"182.6667","org_moving_avg":"1.826667","period_end_date":"2021-05-31","slope":"0.540000"},"41":{"moving_avg":"187.3333","org_moving_avg":"1.873333","period_end_date":"2021-06-30","slope":"0.520000"},"42":{"moving_avg":"191.3333","org_moving_avg":"1.913333","period_end_date":"2021-07-31","slope":"0.510000"},"43":{"moving_avg":"195.6667","org_moving_avg":"1.956667","period_end_date":"2021-08-31","slope":"0.500000"},"44":{"moving_avg":"202.0000","org_moving_avg":"2.020000","period_end_date":"2021-09-30","slope":"0.480000"},"45":{"moving_avg":"208.5988","org_moving_avg":"2.085988","period_end_date":"2021-10-31","slope":"0.463400"},"46":{"moving_avg":"216.2373","org_moving_avg":"2.162373","period_end_date":"2021-11-30","slope":"0.448600"},"47":{"moving_avg":"221.7375","org_moving_avg":"2.217375","period_end_date":"2021-12-31","slope":"0.441500"},"48":{"moving_avg":"225.8916","org_moving_avg":"2.258916","period_end_date":"2022-01-31","slope":"0.438100"},"49":{"moving_avg":"228.6755","org_moving_avg":"2.286755","period_end_date":"2022-02-28","slope":"0.432400"},"50":{"moving_avg":"231.0205","org_moving_avg":"2.310205","period_end_date":"2022-03-31","slope":"0.428200"},"51":{"moving_avg":"232.8524","org_moving_avg":"2.328524","period_end_date":"2022-04-30","slope":"0.427800"},"52":{"moving_avg":"235.6036","org_moving_avg":"2.356036","period_end_date":"2022-05-31","slope":"0.417500"}},"base_avg_price":3842,"calculated_value":4500,"selectedMonthData":[{"date":"2021-12-31","price_sqft":"8,519","price":"38,335,500","index":"221.74"},{"date":"2021-06-30","price_sqft":"7,197","price":"32,386,500","index":"187.33"},{"date":"2020-06-30","price_sqft":"4,764","price":"21,438,000","index":"124.00"}]},"index_type":false}'

Categories