Is there an error in this web-scraping script? - python

Whats the error in this script?
from bs4 import BeautifulSoup
import requests
years = [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]
web = 'https://www.uefa.com/uefaeuropaleague/history/seasons/2022/matches/'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')
matches = soup.find_all('div', class_='pk-match-unit size-m')
for match in matches:
print(match.find('div', class_='pk-match__base--team-home size-m').get_text())
print(match.find('div', class_='pk-match__score size-m').get_text())
print(match.find('div', class_='pk-match__base--team-away size-m').get_text())
I am not able to find the error, the purpose of the print is to obtain the data of the games of the last edition of the Europa League.
I attach a picture of the html for reference, since I do not see where the error is.
Keep in mind that I am only doing it for the year 2021.
I try to get the results from the group stage to the final.

Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Issue here is that data is loaded from an api so you won't get it with BeautifulSoup if it is not in response of requests - Take a look at your browsers dev tools on xhr tab and use this api call to get results from as JSON.
Example
Inspect the whole JSON to pick info that fit your needs
import requests
json_data = requests.get('https://match.uefa.com/v5/matches?competitionId=14&seasonYear=2022&phase=TOURNAMENT&order=DESC&offset=0&limit=20').json()
for m in json_data:
print(m['awayTeam']['internationalName'], f"{m['score']['total']['away']}:{m['score']['total']['home']}", m['homeTeam']['internationalName'])
Output
Rangers 1:1 Frankfurt
West Ham 0:1 Frankfurt
Leipzig 1:3 Rangers
Frankfurt 2:1 West Ham
Rangers 0:1 Leipzig
Braga 1:3 Rangers
West Ham 3:0 Lyon
Frankfurt 3:2 Barcelona
Leipzig 2:0 Atalanta
Rangers 0:1 Braga
...

Related

How to make BeautifulSoup go to the specific webpage I want instead of a random one on the site?

I am trying to learn web scraping using BeautifulSoup by scraping UFC fight data off of the website Tapology. I have entered in the URL of a specific fight's webpage but every time I run the code it seems to jump to a new random fight on the page instead of this fight. Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
html_text = requests.get(url, timeout=5).text
soup = BeautifulSoup(html_text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print(fightresult, fightstats)
Honestly I have no idea how it could be switching to other webpages when I have a very specific URL like the one I am using.
I got the same results ("..every time I run the code it seems to jump to a new random fight...") when I tried your code. Like some of the comments suggested, it's probably in an effort to evade bots. Maybe the right set of headers could resolve it, but I'm not very good with making requests imitate un-automated browsers - so in situations like these, I sometimes use HTMLSession (or cloudscraper or even ScrapingAnt, and finally selenium if none of the others work).
# from requests_html import HTMLSession
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
req = HTMLSession().get(url)
soup = BeautifulSoup(req.text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print('\n\n\n'.join('\n'.join( # remove some whitespaces from text for better readability
' '.join(w for w in t.text.split() if w) for t in f if t.text.strip()
) for f in [fightresult, fightstats]))
For me, that prints
Keith Jardine defeats Chuck Liddell via 3 Round Decision #30 Biggest Upset of All Time #97 Greatest Light Heavy MMA Fight of All Time
13-3-1
Pro Record At Fight
20-4-0
Climbed to 14-3
Record After Fight
Fell to 20-5
+290 (Moderate Underdog)
Betting Odds
-395 (Moderate Favorite)
United States
Nationality
United States
Albuquerque, New Mexico
Fighting out of
San Luis Obispo, California
31 years, 10 months, 3 weeks, 1 day
Age at Fight
37 years, 9 months, 5 days
204.5 lbs (92.8 kgs)
Weigh-In Result
205.5 lbs (93.2 kgs)
6'1" (186cm)
Height
6'2" (188cm)
76.0" (193cm)
Reach
76.5" (194cm)
Jackson Wink MMA
Gym
The Pit
Invicta FC 50
11.16.2022, 9:00 PM ET
Bellator 288
11.18.2022, 6:00 PM ET
ONE on Prime Video 4
11.18.2022, 7:00 PM ET
LFA 147
11.18.2022, 6:00 PM ET
ONE Championship 163
11.19.2022, 4:30 AM ET
UFC Fight Night
11.19.2022, 1:00 PM ET
Cage Warriors 147: Unplugg...
11.20.2022, 12:00 PM ET
PFL 10
11.25.2022, 5:30 PM ET
Jiří "Denisa" Procházka
Glover Teixeira
Jan Błachowicz
Magomed Ankalaev
Aleksandar "Rocket" Rakić
Anthony "Lionheart" Smith
Jamahal "Sweet Dreams" Hill
Nikita "The Miner" Krylov

Python/Pandas/NLTK: Iterating through a DataFrame, get value, transform it and add the new value to a new column

I scraped some data from google news into a dataframe:
DataFrame:
df
title link pubDate description source source_url
0 Australian research finds cost-effective way t... https://news.google.com/__i/rss/rd/articles/CB... Sat, 15 Oct 2022 23:51:00 GMT Australian research finds cost-effective way t... The Guardian https://www.theguardian.com
1 Something New Under the Sun: Floating Solar Pa... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 11:49:11 GMT Something New Under the Sun: Floating Solar Pa... Voice of America - VOA News https://www.voanews.com
2 Adapt solar panels for sub-Saharan Africa - Na... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 09:06:41 GMT Adapt solar panels for sub-Saharan AfricaNatur... Nature.com https://www.nature.com
3 Cost of living: The people using solar panels ... https://news.google.com/__i/rss/rd/articles/CB... Wed, 05 Oct 2022 07:00:00 GMT Cost of living: The people using solar panels ... BBC https://www.bbc.co.uk
4 Business Matters: Solar Panels on Commercial P... https://news.google.com/__i/rss/rd/articles/CB... Mon, 17 Oct 2022 09:13:35 GMT Business Matters: Solar Panels on Commercial P... Insider Media https://www.insidermedia.com
... ... ... ... ... ... ...
What I want to do now is basically to iterate through the "link" column and summarize every article with NLTK and add the summary to a new column. Here is an example:
article = Article(df.iloc[4, 1]) #get the url from the link column
article.download()
article.parse()
article.nlp()
article = article.summary
print(article)
Output:
North WestGemma Cornwall, Head of Sustainability of Anderton Gables, looks into the benefit of solar panels.
And, with the cost of solar panels continually dropping, it is becoming increasingly affordable for commercial property owners.
Reduce your energy spendMost people are familiar with solar energy, but many are unaware of the significant financial savings that can be gained by installing solar panels in commercial buildings.
As with all things, there are pros and cons to weigh up when considering solar panels.
If you’re considering solar panels for your property, contact one of the Anderton Gables team, who can advise you on the best course of action.
I tried a little bit, but I couldn't make it work...
Thanks for your help!
This will be a very slow solution with a for loop, but it might work for a small dataset. Iterating through all the links and then applying the transformations needed, and ultimately create a new column in the dataframe
summaries = []
for l in df['source_url'].values:
article = Article(l)
article.download()
article.parse()
article.nlp()
summaries.append(article.summary)
df['summaries'] = summaries
Or you could define a custom function and the use pd.apply:
def get_description(x):
art = Article(x)
art.download()
art.parse()
art.nlp()
return art.summary
df['summary'] = df['source_url'].apply(get_description)

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')
The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

Why does my markov chain produce identical sentences from corpus?

I am using markovify markov chain generator in python and when using the example code given there it produces a lot of duplicate sentences for me and I don't know why.
The code is as follows:
import markovify
# Get raw text as string.
with open("testtekst.txt") as f:
text = f.read()
# Build the model.
text_model = markovify.Text(text)
# Print five randomly-generated sentences
for i in range(20):
print(text_model.make_sentence())
This gives me output of:
Time included him on their list of the world's highest-paid athlete by ESPN from 2016 to 2019.
He assumed full captaincy of the world's most marketable and famous athletes, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
One of the tournament.
He also led them to victory in the world in 2014.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
In 2015, Ronaldo was ranked the world's most famous athlete by ESPN from 2016 to 2019.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2004, where he helped Portugal reach the final.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's highest-paid athlete by ESPN from 2016 to 2019.
Time included him on their list of the national team in July 2008.
He is the first footballer and the FIFA Club World Cup at age 23, he won his first season.
Time included him on their list of the national team in July 2008.
He also led them to victory in the world in 2014.
The following year, he led Portugal to their first major tournament title at Euro 2004, where he helped Portugal reach the final.
One of the world's most marketable and famous athletes, Ronaldo was ranked the world's most famous athlete by Forbes in 2016 and 2017 and the FIFA Club World Cup at age 23, he won his first international goal at Euro 2016, and received the Silver Boot as top scorer of Euro 2020.
The testtekst.txt is in ANSI encoding and has the following corpus:
Born and raised in Madeira, Ronaldo began his senior club career
playing for Sporting CP, before signing with Manchester United in
2003, aged 18, winning the FA Cup in his first season. He would also
go onto win three consecutive Premier League titles, the Champions
League and the FIFA Club World Cup at age 23, he won his first Ballon
d'Or. Ronaldo was the subject of the then-most expensive association
football transfer when he signed for Real Madrid in 2009 in a transfer
worth €94 million (£80 million), where he won 15 trophies, including
two La Liga titles, two Copa del Rey and four Champions Leagues, and
became the club's all-time top goalscorer. He also finished runner-up
for the Ballon d'Or three times, behind Lionel Messi (his perceived
career rival), and won back-to-back Ballons d'Or in 2013 and 2014, and
again in 2016 and 2017. In 2018, he signed for Juventus in a transfer
worth an initial €100 million (£88 million), the most expensive
transfer for an Italian club and the most expensive transfer for a
player over 30 years old. He won two Serie A titles, two Supercoppe
Italiana and a Coppa Italia, before returning to Manchester United in
2021. Ronaldo made his senior international debut for Portugal in 2003 at the age of 18 and has since earned over 180 caps, making him
Portugal's most-capped player. With more than 100 goals at
international level, he is also the nation's all-time top goalscorer.
He has played in and scored at 11 major tournaments, he scored his
first international goal at Euro 2004, where he helped Portugal reach
the final. He assumed full captaincy of the national team in July
2008. In 2015, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation. The following year, he led
Portugal to their first major tournament title at Euro 2016, and
received the Silver Boot as the second-highest goalscorer of the
tournament. He also led them to victory in the inaugural UEFA Nations
League in 2019, and later received the Golden Boot as top scorer of
Euro 2020. One of the world's most marketable and famous athletes,
Ronaldo was ranked the world's highest-paid athlete by Forbes in 2016
and 2017 and the world's most famous athlete by ESPN from 2016 to
2019. Time included him on their list of the 100 most influential people in the world in 2014. He is the first footballer and the third
sportsman to earn US $1 billion in his career.
As you can see in the output - there are several identical sentences printed out and I have no idea why. The default state size should be 2.
The answer is that my state size was too big - after setting it to be 1 it produced unique sentences. I also did not know that Markovify always starts generating new sentences with the first words of the sentences in corpus.
That's right, Markovify always starts generating new sentences with the first words of sentences in the corpus. The answer is that your state size was too big. You answered yourself. However, you got a good result. I have carefully read the text on Ronaldo. Well done

Using regex separators with read_csv() in python?

I have a lot of csv files formated as such:
date1::tweet1::location1::language1
date2::tweet2::location2::language2
date3::tweet3::location3::language3
and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:
try:
data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
data.columns = ["timestamp", "tweet", "location", "lang"]
print 'Number of tweets: ' + str(len(data))
except BaseException, e :
print 'Error: ',str(e)
I get the following error thrown at me
Error: expected 4 fields in line 4581, saw 5
I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.
The date fields and tweets contain colons, urls and so on so perhaps regex would make sense?
Can someone help me figure out what I'm doing wrong? Many thanks in advance!
Sample of the data as requested below:
Fri Apr 22 21:41:03 +0000 2016::RT #TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::RT #JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en
Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en
Start with this:
pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])
The above should bring in 4 columns, then you can figure out how many lines were dropped, and if the data makes sense.
Use this pattern:
data["lang"].unique()
Since, you have problem with data and do not where it is. You need to step back and use python 'csv reader'. This should get you started.
import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
try:
tweetList.append( (row[0].split('::')) )
except BaseException, e :
print 'Error: ',str(e)
print tweetList
tweetsDf = pd.DataFrame(tweetList)
print tweetsDf
0 \
0 Fri Apr 22 21:41:03 +0000 2016
1 Fri Apr 22 21:41:07 +0000 2016
2 Fri Apr 22 21:41:07 +0000 2016
3 Fri Apr 22 21:41:08 +0000 2016
1 2 3
0 RT #TalOfer: Barack Obama: Brexit would put UK... United Kingdom en
1 RT #JamieRoss7: It must be awful to strongly b... The United Kingdom en
2 Whether or not it rains on June 23rd will hav... Dublin None
3 FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi... Mardan None
Have you tried read_table instead? I've got this kind of error when I tried to use read_csv before and I solved the problem by using it. Please refer to this post, this might give you some ideas about how to solve the error. And maybe also try sep=r":{2}" as delimiter.

Categories