Python scraping loop

Python scraping loop - python

So, i need help here , this is mi code
results=[]
import re
for i in popup_linkz: # Here I take N links like this one https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA== to iterate thorught them a scraping
url=i # so right now I scrape the iterating urls
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
results=[]
#json_res = json.loads(res.text)
#print(json_res[0]['price'])
item_1='grvProducto_ctl02_lblCategoria'
for line in soup.findAll('span', attrs={'id': 'grvProducto_ctl02_lblCategoria'}):
results.append(line.text)
#this actually get the first code, but don't know how to iterate for others, also doesn't store every code on it, when I print doesn't stack them , show them single on print.
print('id',results)
I am trying to get from this urlsample >https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA==
actually it iterates from 2 to 10.000 of them.
information I want to get here but that cant get it
I am not sure how use this
for line in soup.findAll('span', attrs={'id': 'grvProducto_ctl02_lblCategoria'}):
results.append(line.text)
to use the same loop to get the other information.
data of page underlying
could you enlight me please?

Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.mercadopublico.cl/Procurement/Modules/RFB/DetailsAcquisition.aspx?qs=uEap3sWEgifS2G+m9xvYiA=="
soup = BeautifulSoup(requests.get(url).content, "html.parser")
licitation_number = soup.select_one("#lblNumLicitacion").text
responsable = soup.select_one("#lblResponsable").text
ficha = soup.select_one("#lblFicha2Reclamo").text
print(f"{licitation_number=}")
print(f"{responsable=}")
print(f"{ficha=}")
print("-" * 80)
for t in soup.select("#grvProducto .borde_tabla00"):
categoria = t.select_one('[id$="lblCategoria"]').text
candidad = t.select_one('[id$="lblCantidad"]').text
descripction = t.select_one('[id$="lblDescripcion"]').text
print(f"{categoria=} {candidad=}")
print(f"{descripction=}")
print()
Prints:
licitation_number='1549-5-LR22'
responsable='SERVICIO DE SALUD METROPOLITANA NORTE HOSPITAL SAN JOSE, Hospital San José'
ficha='107'
--------------------------------------------------------------------------------
categoria='42221501' candidad='130'
descripction='(226-2001) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO EVEROLIMUS'
categoria='42221501' candidad='360'
descripction='(226-2002) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO ZOTAROLIMUS'
categoria='42221501' candidad='120'
descripction='(226-2004) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO SIROLIMUS, CON STRUT DE 0.80'
categoria='42221501' candidad='240'
descripction='(226-2003) STENT CORONARIO DE CROMO COBALTO, LIBERADOR DE FÁRMACO SIROLIMUS, CON STRUT DE 0.60'

Related

Why do I run into trouble webscraping this website in Python?

I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles' titles from this website. I follow a procedure I found on SO which is as follows:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
movies = soup.select(".title a , .date")
print(movies)
movies_titles = [title.text for title in movies]
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
print(movies_titles)
print(movies_links)
I got .title a , .date using SelectorGadget in the url I shared. However, print(movies) is empty. What am I doing wrong?
Can anyone help me?
Thanks!

The content is not part of index.en.html but is loaded in by js from
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
Then you can't select pairs afaik, so you need to select for titles and dates separately:
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
Then you can print them out like this:
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)
Result:
['Christine Lagarde:\xa0Interview with CNBC', 'Fabio Panetta:\xa0Interview with El País ', 'Isabel Schnabel:\xa0Interview with Der Spiegel', 'Philip R. Lane:\xa0Interview with CNBC', 'Frank Elderson:\xa0Q&A on Twitter', 'Isabel Schnabel:\xa0Interview with Les Echos ', 'Philip R. Lane:\xa0Interview with the Financial Times', 'Luis de Guindos:\xa0Interview with Público', 'Philip R. Lane:\xa0Interview with Expansión', 'Isabel Schnabel:\xa0Interview with LETA', 'Fabio Panetta:\xa0Interview with Der Spiegel', 'Christine Lagarde:\xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:\xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:\xa0Interview with Deutschlandfunk', 'Philip R. Lane:\xa0Interview with SKAI TV', 'Isabel Schnabel:\xa0Interview with Der Standard']
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']
Full code:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)
titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))
movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)

I would recommend using Python Selenium
Try something like this :
from selenium.webdriver import Chrome
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
browser = Chrome()
browser.get(url)
interviews = browser.find_elements_by_class_name('title')
links = []
for interview in interviews:
try:
anchor = interview.find_element_by_tag_name('a')
link = anchor.get_attribute('href')
links.append(link)
except NoSuchElementException:
pass
Links will contain the links to all the interviews. You can do something similar for the dates

Webscraping in Python; pagination problem

I am still a beginner so Im sorry if this is a stupid question. I am trying to scrape some new articles for my master analysis through Jupyter notebook, but I am struggling with pagination. How can I fix that?
Here is the code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
danas = []
base_url = 'https://www.danas.rs/tag/izbori-2020/page/'
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = paging[1].int
last_page = paging[len(paging)-1].int
web_content_list = []
for page_number in range(int(float(start_page)),int(float(last_page)) + 1):
url = base_url+str(page_number)+"/.html"
r = requests.get(base_url+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
try:
headline = soup.find('h1', {'class': 'post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time', {'class': 'entry-date published'}).text.strip()[:17]
except:
time = None
try:
descr = soup.find('div', {'class': 'post-intro-content content'}).text.strip()
except:
descr = None
try:
txt = soup.find('div', {'class': 'post-content content'}).text.strip()
except:
txt = None
# create a list with all scraped info
danas = [headline,
date,
time,
descr,
txt]
web_content_list.append(danas)
else:
print('Oh No! ' + l)
dh = pd.DataFrame(danas)
dh.head()
And here is the error that pops out:
*AttributeError Traceback (most recent call last)
<ipython-input-10-1c9e3a7e6f48> in <module>
11 soup = BeautifulSoup(c,"html.parser")
12
---> 13 paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
14 start_page = paging[1].int
15 last_page = paging[len(paging)-1].int
AttributeError: 'NoneType' object has no attribute 'find'*

Well one issue is that 'https://www.danas.rs/tag/izbori-2020/page/' returns Greška 404: Tražena stranica nije pronađena. on the initial request. So wil lneed to address that.
Second issue is pulling in the start page and end page. Just curious, why would you search for a start page? All pages start at 1.
Another question, why convert to float, then int. Just get the page as int.
3rd, you never declare your variable date.
4th you are only grabbing the 1st article on the page. Is that what you want? Or do you want all the articles on the page? I left your code as is, since you're question is referring to iterating through the pages.
5th If you want the full text of the articles, you'll need to get to each of the article links.
There are few more issues too with the code. I tried to comment so you could see it. So compare this code to yours, and if you have questions, let me know:
Code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.danas.rs/tag/izbori-2020/'
r = requests.get(base_url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = 1
last_page = int(paging[1].text)
web_content_list = []
for page_number in range(int(start_page),int(last_page) + 1):
url = base_url+ 'page/' + str(page_number) #<-- fixed this
r = requests.get(url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article')
for article in articles:
w=1
try:
headline = soup.find('h2', {'class': 'article-post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time')['datetime']
except:
time = None
try:
descr = soup.find('div', {'class': 'article-post-excerpt'}).text.strip()
except:
descr = None
# create a list with all scraped info <--- changed to dictionary so that you have column:value when you create the dataframe
danas = {'headline':headline,
'time':time,
'descr':descr}
web_content_list.append(danas)
print('Collected: %s of %s' %(page_number, last_page))
else:
#print('Oh No! ' + l) #<--- what is l?
print('Oh No!')
dh = pd.DataFrame(web_content_list) #<-- need to get the full appended list, not the danas, as thats overwritten after each iteration
dh.head()
Output:
print(dh.head().to_string())
headline time descr
0 Vučić saopštava ime mandatara vlade u 20 časova 2020-10-05T09:00:05+02:00 Predsednik Aleksandar Vučić će u danas, nakon sastanka Predsedništva Srpske napredne stranke (SNS) saopštiti ime mandatara za sastav nove Vlade Srbije. Vučić će odluku saopštiti u 20 sati u Palati Srbija, rečeno je FoNetu u kabinetu predsednika Srbije.
1 Nova skupština i nova vlada 2020-08-01T14:00:13+02:00 Saša Radulović biće poslanik još nekoliko dana i prvi je objavio da se vrši dezinfekcija skupštinskih prostorija, što govori u prilog tome da će se novi saziv, izabran 21. juna, ipak zakleti u Domu Narodne skupštine.
2 Brnabić o novom mandatu: To ne zavisi od mene, SNS ima dobre kandidate 2020-07-15T18:59:43+02:00 Premijerka Ana Brnabić izjavila je danas da ne zavisi od nje da li će i u novom mandatu biti na čelu Vlade Srbije.
3 Državna izborna komisija objavila prve rezultate, HDZ ubedljivo vodi 2020-07-05T21:46:56+02:00 Državna izborna komisija (DIP) objavila je večeras prve nepotpune rezultate po kojima vladajuća Hrvatska demokratska zajednica (HDZ) osvaja čak 69 mandata, ali je reč o rezultatima na malom broju prebrojanih glasova.
4 Analiza Pravnog tima liste „Šabac je naš“: Ozbiljni dokazi za krađu izbora 2020-07-02T10:53:57+02:00 Na osnovu izjave 123 birača, od kojih je 121 potpisana i sa matičnim brojem, prikupljenim u roku od 96 sati nakon zatvaranja biračkih mesta u nedelju 21. 6. 2020. godine u 20 časova, uočena su 263 kršenja propisa na 55 biračkih mesta, navodi se na početku Analize koju je o kršenju izbornih pravila 21. juna i uoči izbora sačinio pravni tim liste „Nebojša Zelenović – Šabac je naš“.

Avoid to copy some content while scraping through pages

I have some difficulties in saving the results that I am scraping.
Please refer to this code (this code was slightly changed for my specific case):
import bs4, requests
import pandas as pd
import re
import time
headline=[]
corpus=[]
dates=[]
tag=[]
start=1
url="https://www.imolaoggi.it/category/cron/"
while True:
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html')
headlines=soup.find_all('h3')
corpora=soup.find_all('p')
dates=soup.find_all('time', attrs={'class':'entry-date published updated'})
tags=soup.find_all('span', attrs={'class':'cat-links'})
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text)
for d in date:
dates.append(d.text)
for c in tags:
tag.append(c.text)
if soup.find_all('a', attrs={'class':'page-numbers'}):
url = f"https://www.imolaoggi.it/category/cron/page/{page}"
page +=1
else:
break
Create dataframe
df = pd.DataFrame(list(zip(date, headline, tag, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
I would like to save all the pages from this link. The code works, but it seems that it writes everytime (i.e. every page) two identical sentences for the corpus:
I think this is happening because of the tag I chosen:
corpora=soup.find_all('p')
This causes a misalignment in rows in my dataframe, as data are saved in lists and corpus starts being correctly scraped later, if compared to others.
I hope you cab help to understand how to fix it.

You were close, but your selectors were off, and you mis-naned some of your variables.
I would use css selectors like this:
eadline=[]
corpus=[]
date_list=[]
tag_list=[]
headlines=soup.select('h3.entry-title')
corpora=soup.select('div.entry-meta + p')
dates=soup.select('div.entry-meta span.posted-on')
tags=soup.select('span.cat-links')
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text.strip())
for d in dates:
date_list.append(d.text)
for c in tags:
tag_list.append(c.text)
df = pd.DataFrame(list(zip(date_list, headline, tag_list, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
df
Output:
Date Headlines Tags Corpus
0 30 Ottobre 2020 Roma: con spranga di ferro danneggia 50 auto i... CRONACA, NEWS Notte di vandalismi a Colli Albani dove un uom...
1 30 Ottobre 2020\n30 Ottobre 2020 Aggressione con machete: grave un 28enne, arre... CRONACA, NEWS Roma - Ha impugnato il suo machete e lo ha agi...
2 30 Ottobre 2020\n30 Ottobre 2020 Deep State e globalismo, Mons. Viganò scrive a... CRONACA, NEWS LETTERA APERTA\r\nAL PRESIDENTE DEGLI STATI UN...
3 30 Ottobre 2020 Meluzzi e Scandurra: “Sacrificare libertà per ... CRONACA, NEWS "Sacrificare la libertà per la sicurezza è un ...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://www.imolaoggi.it/category/cron/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.entry-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2937)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Tags", "Content"])
print(df)
df.to_csv("result.csv", index=False)

Web scraping problem during passing fuction as paramater in function

Hello I've created two functions that work well well called alone. But when I try to use a for loop with these functions I got a problem with my parameter.
First function to search and get link to pass to the second one.
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
def searchsport(terme):
url = 'https://www.verif.com/recherche/{}/1/ca/d/?ville=null'.format(terme)
response = requests.get(url, headers= USER_AGENT)
response.raise_for_status()
return terme, response.text
def crawl(keyword):
if __name__ == '__main__':
try:
keyword, html = searchsport(keyword)
soup = bs(html,'html.parser')
table = soup.find_all('td', attrs={'class': 'verif_col1'})
premier = []
for result in table:
link = result.find('a', href=True)
premier.append(link)
truelink = 'https://www.verif.com/'+str(premier[0]).split('"')[1]
#print("le lien", truelink)
except Exception as e:
print(e)
finally:
time.sleep(10)
return truelink
Second function to scrape a link.
def single_text(item_url):
source_code = requests.get(item_url)
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml" )
print('nivo2 ok')
table = soup.find('table',{'class':"table infoGen hidden-smallDevice"}) # on cherche que la balise table
print('nivo1 ok', '\n', table)
table_rows = table.find_all('tr') # les données de tables sont dans les celulles tr
#print(table_rows)
l = []
for tr in table_rows:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td]
l.append(row)
# On enleve certains caractères unitiles
df = pd.DataFrame(l)
return df
All these function worked when I tested them on a link.
Now I have a csv file with name of companies using searchsport() to search in website and the returned link is passed to single_text() to scrape.
for keyword in list(pd.read_csv('sport.csv').name):
l = crawl(keyword)
print(l) # THIS PRINT THE LINK
single_item(l) # HERE I GOT THE PROBLEME
Error:
nivo1 ok
nivo2 ok
nivo1 ok
None
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-55-263d95d6748c> in <module>
3 l = crawl(keyword)
4
----> 5 single_item(item_url=l)
<ipython-input-53-6d3b5c1b1ee8> in single_item(item_url)
7 table = soup.find('table',{'class':"table infoGen hidden-smallDevice"}) # on cherche que la balise table
8 print('nivo1 ok', '\n', table)
----> 9 table_rows = table.find_all('tr') # les données de tables sont dans les celulles tr
10 #print(table_rows)
11
AttributeError: 'NoneType' object has no attribute 'find_all'
When I run this I got a df.
single_item(item_url="https://www.verif.com/societe/COMPANYNAME-XXXXXXXXX/").head(1)
My expected results should be two DataFrame for every keyword.
Why it doesn't work?

So I have noted throughout the code some of the problems I saw with your code as posted.
Some things I noticed:
Not handling cases of where something is not found e.g. 'PARIS-SAINT-GERMAIN-FOOTBALL' will fail whereas 'PARIS SAINT GERMAIN FOOTBALL' as a search term will not
Opportunities for simplification missed e.g. creating a dataframe by looping tr then td when could just use read_html on table; Using find_all when a single table or a tag is needed
Overwriting variables in loops as well as typos e.g.
for tr in table_rows:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td] # presumable a typo with row = row
Not testing if a dataframe is empty
Risking generating incorrect urls by using 'https://www.verif.com/' as the next part you concatenate on starts with "/" as well
Inconsistent variable naming e.g. what is single_item? The function I see is called single_text.
These are just some observations and there is certainly still room for improvement.
import requests, time
from bs4 import BeautifulSoup as bs
import pandas as pd
def searchsport(terme):
url = f'https://www.verif.com/recherche/{terme}/1/ca/d/?ville=null'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
response.raise_for_status()
return terme, response.text
def crawl(keyword):
try:
keyword, html = searchsport(keyword)
soup = bs(html,'lxml')
a_tag = soup.select_one('td.verif_col1 a[href]')
# your code before when looping tds would just overwrite truelink if more than one found. Instead
if a_tag is None:
#handle case of no result e.g. with using crawl('PARIS-SAINT-GERMAIN-FOOTBALL') instead of
#crawl('PARIS SAINT GERMAIN FOOTBALL')
truelink = ''
else:
# print(a_tag['href'])
# adding to the list premier served no purpose. Using split on href would result in list index out of range
truelink = f'https://www.verif.com{a_tag["href"]}' #relative link already so no extra / after .com
except Exception as e:
print(e)
truelink = '' #handle case of 'other' fail. Make sure there is an assigment
finally:
time.sleep(5)
return truelink #unless try succeeded this would have failed with local variable referenced before assignment
def single_text(item_url):
source_code = requests.get(item_url, headers = {'User-Agent':'Mozilla/5.0'})
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml")
print('nivo2 ok')
table = soup.select_one('.table') # on cherche que la balise table
#print('nivo1 ok', '\n', table)
if table is None:
df = pd.DataFrame()
else:
df = pd.read_html(str(table))[0] #simplify to work direct with table and pandas;avoid your loops
return df
def main():
terms = ['PARIS-SAINT-GERMAIN-FOOTBALL', 'PARIS SAINT GERMAIN FOOTBALL']
for term in terms:
item_url = crawl(term)
if item_url:
print(item_url)
df = single_text(item_url) # what is single_item in your question? There is single_text
if not df.empty: #test if dataframe is empty
print(df.head(1))
if __name__ == '__main__':
main()
Returning df from main()
import requests, time
from bs4 import BeautifulSoup as bs
import pandas as pd
def searchsport(terme):
url = f'https://www.verif.com/recherche/{terme}/1/ca/d/?ville=null'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
response.raise_for_status()
return terme, response.text
def crawl(keyword):
try:
keyword, html = searchsport(keyword)
soup = bs(html,'lxml')
a_tag = soup.select_one('td.verif_col1 a[href]')
# your code before when looping tds would just overwrite truelink if more than one found. Instead
if a_tag is None:
#handle case of no result e.g. with using crawl('PARIS-SAINT-GERMAIN-FOOTBALL') instead of
#crawl('PARIS SAINT GERMAIN FOOTBALL')
truelink = ''
else:
# print(a_tag['href'])
# adding to the list premier served no purpose. Using split on href would result in list index out of range
truelink = f'https://www.verif.com{a_tag["href"]}' #relative link already so no extra / after .com
except Exception as e:
print(e)
truelink = '' #handle case of 'other' fail. Make sure there is an assigment
finally:
time.sleep(5)
return truelink #unless try succeeded this would have failed with local variable referenced before assignment
def single_text(item_url):
source_code = requests.get(item_url, headers = {'User-Agent':'Mozilla/5.0'})
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml")
print('nivo2 ok')
table = soup.select_one('.table') # on cherche que la balise table
#print('nivo1 ok', '\n', table)
if table is None:
df = pd.DataFrame()
else:
df = pd.read_html(str(table))[0] #simplify to work direct with table and pandas;avoid your loops
return df
def main():
terms = ['PARIS-SAINT-GERMAIN-FOOTBALL', 'PARIS SAINT GERMAIN FOOTBALL']
for term in terms:
item_url = crawl(term)
if item_url:
#print(item_url)
df = single_text(item_url) # what is single_item in your question? There is single_text
return df
if __name__ == '__main__':
df = main()
print(df)

Your error suggests that you trying to run find_all() against a variable which hasn't been populated, i.e. a tag wasn't found to which you could run find_all() against. I have dealt with this by including a statement testing for NoneType
if VALUE is not None:
## code when the tag is found
else:
## code when tag is not found
I think this is the bit you need to do an update like this,
for tr in table_rows:
if tr is not None:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td]
l.append(row)
# On enleve certains caractères unitiles
df = pd.DataFrame(l)
else:
## code to run when tr isn't populated
There's a more colourful example where some XML is being parsed where this in action here

Scraping web site

I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks

You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385

You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping loop - python

Related

Why do I run into trouble webscraping this website in Python?

Webscraping in Python; pagination problem

Avoid to copy some content while scraping through pages

Web scraping problem during passing fuction as paramater in function

Scraping web site

Categories

Resources