Webscraping in Python; pagination problem

Webscraping in Python; pagination problem - python

I am still a beginner so Im sorry if this is a stupid question. I am trying to scrape some new articles for my master analysis through Jupyter notebook, but I am struggling with pagination. How can I fix that?
Here is the code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
danas = []
base_url = 'https://www.danas.rs/tag/izbori-2020/page/'
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = paging[1].int
last_page = paging[len(paging)-1].int
web_content_list = []
for page_number in range(int(float(start_page)),int(float(last_page)) + 1):
url = base_url+str(page_number)+"/.html"
r = requests.get(base_url+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
try:
headline = soup.find('h1', {'class': 'post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time', {'class': 'entry-date published'}).text.strip()[:17]
except:
time = None
try:
descr = soup.find('div', {'class': 'post-intro-content content'}).text.strip()
except:
descr = None
try:
txt = soup.find('div', {'class': 'post-content content'}).text.strip()
except:
txt = None
# create a list with all scraped info
danas = [headline,
date,
time,
descr,
txt]
web_content_list.append(danas)
else:
print('Oh No! ' + l)
dh = pd.DataFrame(danas)
dh.head()
And here is the error that pops out:
*AttributeError Traceback (most recent call last)
<ipython-input-10-1c9e3a7e6f48> in <module>
11 soup = BeautifulSoup(c,"html.parser")
12
---> 13 paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
14 start_page = paging[1].int
15 last_page = paging[len(paging)-1].int
AttributeError: 'NoneType' object has no attribute 'find'*

Well one issue is that 'https://www.danas.rs/tag/izbori-2020/page/' returns Greška 404: Tražena stranica nije pronađena. on the initial request. So wil lneed to address that.
Second issue is pulling in the start page and end page. Just curious, why would you search for a start page? All pages start at 1.
Another question, why convert to float, then int. Just get the page as int.
3rd, you never declare your variable date.
4th you are only grabbing the 1st article on the page. Is that what you want? Or do you want all the articles on the page? I left your code as is, since you're question is referring to iterating through the pages.
5th If you want the full text of the articles, you'll need to get to each of the article links.
There are few more issues too with the code. I tried to comment so you could see it. So compare this code to yours, and if you have questions, let me know:
Code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.danas.rs/tag/izbori-2020/'
r = requests.get(base_url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = 1
last_page = int(paging[1].text)
web_content_list = []
for page_number in range(int(start_page),int(last_page) + 1):
url = base_url+ 'page/' + str(page_number) #<-- fixed this
r = requests.get(url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article')
for article in articles:
w=1
try:
headline = soup.find('h2', {'class': 'article-post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time')['datetime']
except:
time = None
try:
descr = soup.find('div', {'class': 'article-post-excerpt'}).text.strip()
except:
descr = None
# create a list with all scraped info <--- changed to dictionary so that you have column:value when you create the dataframe
danas = {'headline':headline,
'time':time,
'descr':descr}
web_content_list.append(danas)
print('Collected: %s of %s' %(page_number, last_page))
else:
#print('Oh No! ' + l) #<--- what is l?
print('Oh No!')
dh = pd.DataFrame(web_content_list) #<-- need to get the full appended list, not the danas, as thats overwritten after each iteration
dh.head()
Output:
print(dh.head().to_string())
headline time descr
0 Vučić saopštava ime mandatara vlade u 20 časova 2020-10-05T09:00:05+02:00 Predsednik Aleksandar Vučić će u danas, nakon sastanka Predsedništva Srpske napredne stranke (SNS) saopštiti ime mandatara za sastav nove Vlade Srbije. Vučić će odluku saopštiti u 20 sati u Palati Srbija, rečeno je FoNetu u kabinetu predsednika Srbije.
1 Nova skupština i nova vlada 2020-08-01T14:00:13+02:00 Saša Radulović biće poslanik još nekoliko dana i prvi je objavio da se vrši dezinfekcija skupštinskih prostorija, što govori u prilog tome da će se novi saziv, izabran 21. juna, ipak zakleti u Domu Narodne skupštine.
2 Brnabić o novom mandatu: To ne zavisi od mene, SNS ima dobre kandidate 2020-07-15T18:59:43+02:00 Premijerka Ana Brnabić izjavila je danas da ne zavisi od nje da li će i u novom mandatu biti na čelu Vlade Srbije.
3 Državna izborna komisija objavila prve rezultate, HDZ ubedljivo vodi 2020-07-05T21:46:56+02:00 Državna izborna komisija (DIP) objavila je večeras prve nepotpune rezultate po kojima vladajuća Hrvatska demokratska zajednica (HDZ) osvaja čak 69 mandata, ali je reč o rezultatima na malom broju prebrojanih glasova.
4 Analiza Pravnog tima liste „Šabac je naš“: Ozbiljni dokazi za krađu izbora 2020-07-02T10:53:57+02:00 Na osnovu izjave 123 birača, od kojih je 121 potpisana i sa matičnim brojem, prikupljenim u roku od 96 sati nakon zatvaranja biračkih mesta u nedelju 21. 6. 2020. godine u 20 časova, uočena su 263 kršenja propisa na 55 biračkih mesta, navodi se na početku Analize koju je o kršenju izbornih pravila 21. juna i uoči izbora sačinio pravni tim liste „Nebojša Zelenović – Šabac je naš“.

Related

I want to extract IMDb movie IDs using python

Here is my code:
So I wanted to extract all the bollywood movies, and the project requires, movie titles, cast, crew, IMDB id etc.... I am not able to get all the IMDb IDs with the error nonetype. When I used it on one page only it was working quite well, however, when I use it on multiple pages it shows an error. Please help
#importing the libraries needed
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
from time import sleep
from random import randint
#declaring the list of empty variables, So that we can append the data overall
movie_name = []
year = []
time=[]
rating=[]
votes = []
description = []
director_s = []
starList= []
imdb_id = []
#the whole core of the script
url = "https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=num_votes,desc&start=1&ref_=adv_nxt"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
movie_data = soup.findAll('div', attrs = {'class': 'lister-item mode-advanced'})
for store in movie_data:
name = store.h3.a.text
movie_name.append(name)
year_of_release = store.h3.find('span', class_ = "lister-item-year text-muted unbold").text
year.append(year_of_release)
runtime = store.p.find("span", class_ = 'runtime').text if store.p.find("span", class_ = 'runtime') else " "
time.append(runtime)
rate = store.find('div', class_ = "inline-block ratings-imdb-rating").text.replace('\n', '') if store.find('div', class_ = "inline-block ratings-imdb-rating") else " "
rating.append(rate)
value = store.find_all('span', attrs = {'name': "nv"})
vote = value[0].text if store.find_all('span', attrs = {'name': "nv"}) else " "
votes.append(vote)
# Description of the Movies
describe = store.find_all('p', class_ = 'text-muted')
description_ = describe[1].text.replace('\n', '') if len(describe) > 1 else ' '
description.append(description_)
## Director
ps = store.find_all('p')
for p in ps:
if 'Director'in p.text:
director =p.find('a').text
director_s.append(director)
## ID
imdbID = store.find('span','rating-cancel').a['href'].split('/')[2]
imdb_id.append(imdbID)
## actors
star = store.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").replace("Director:", "").strip()
starList.append(star)
Error:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17576/2711511120.py in <module>
63
64 ## IDs
---> 65 imdbID = store.find('span','rating-cancel').a['href'].split('/')[2] if store.find('span','rating-cancel').a['href'].split('/')[2] else ' '
66 imdb_id.append(imdbID)
67
AttributeError: 'NoneType' object has no attribute 'a'

Change your condition to the following, cause first you have to check if <span> exists:
imdbID = store.find('span','rating-cancel').a.get('href').split('/')[2] if store.find('span','rating-cancel') else ' '
Example
Check the url, here are some of the <span> missing:
import requests
from bs4 import BeautifulSoup
#the whole core of the script
url = "https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=my_ratings,desc"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
movie_data = soup.find_all('div', attrs = {'class': 'lister-item mode-advanced'})
for store in movie_data:
imdbID = store.find('span','rating-cancel').a.get('href').split('/')[2] if store.find('span','rating-cancel') else ' '
print(imdbID)
Output
tt9900050
tt9896506
tt9861220
tt9810436
tt9766310
tt9766294
tt9725058
tt9700334
tt9680166
tt9602804
Even better scrape the id via image tag cause these is always there even if there is only the placholder:
imdbID = store.img.get('data-tconst')

Web scraping problem during passing fuction as paramater in function

Hello I've created two functions that work well well called alone. But when I try to use a for loop with these functions I got a problem with my parameter.
First function to search and get link to pass to the second one.
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
def searchsport(terme):
url = 'https://www.verif.com/recherche/{}/1/ca/d/?ville=null'.format(terme)
response = requests.get(url, headers= USER_AGENT)
response.raise_for_status()
return terme, response.text
def crawl(keyword):
if __name__ == '__main__':
try:
keyword, html = searchsport(keyword)
soup = bs(html,'html.parser')
table = soup.find_all('td', attrs={'class': 'verif_col1'})
premier = []
for result in table:
link = result.find('a', href=True)
premier.append(link)
truelink = 'https://www.verif.com/'+str(premier[0]).split('"')[1]
#print("le lien", truelink)
except Exception as e:
print(e)
finally:
time.sleep(10)
return truelink
Second function to scrape a link.
def single_text(item_url):
source_code = requests.get(item_url)
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml" )
print('nivo2 ok')
table = soup.find('table',{'class':"table infoGen hidden-smallDevice"}) # on cherche que la balise table
print('nivo1 ok', '\n', table)
table_rows = table.find_all('tr') # les données de tables sont dans les celulles tr
#print(table_rows)
l = []
for tr in table_rows:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td]
l.append(row)
# On enleve certains caractères unitiles
df = pd.DataFrame(l)
return df
All these function worked when I tested them on a link.
Now I have a csv file with name of companies using searchsport() to search in website and the returned link is passed to single_text() to scrape.
for keyword in list(pd.read_csv('sport.csv').name):
l = crawl(keyword)
print(l) # THIS PRINT THE LINK
single_item(l) # HERE I GOT THE PROBLEME
Error:
nivo1 ok
nivo2 ok
nivo1 ok
None
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-55-263d95d6748c> in <module>
3 l = crawl(keyword)
4
----> 5 single_item(item_url=l)
<ipython-input-53-6d3b5c1b1ee8> in single_item(item_url)
7 table = soup.find('table',{'class':"table infoGen hidden-smallDevice"}) # on cherche que la balise table
8 print('nivo1 ok', '\n', table)
----> 9 table_rows = table.find_all('tr') # les données de tables sont dans les celulles tr
10 #print(table_rows)
11
AttributeError: 'NoneType' object has no attribute 'find_all'
When I run this I got a df.
single_item(item_url="https://www.verif.com/societe/COMPANYNAME-XXXXXXXXX/").head(1)
My expected results should be two DataFrame for every keyword.
Why it doesn't work?

So I have noted throughout the code some of the problems I saw with your code as posted.
Some things I noticed:
Not handling cases of where something is not found e.g. 'PARIS-SAINT-GERMAIN-FOOTBALL' will fail whereas 'PARIS SAINT GERMAIN FOOTBALL' as a search term will not
Opportunities for simplification missed e.g. creating a dataframe by looping tr then td when could just use read_html on table; Using find_all when a single table or a tag is needed
Overwriting variables in loops as well as typos e.g.
for tr in table_rows:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td] # presumable a typo with row = row
Not testing if a dataframe is empty
Risking generating incorrect urls by using 'https://www.verif.com/' as the next part you concatenate on starts with "/" as well
Inconsistent variable naming e.g. what is single_item? The function I see is called single_text.
These are just some observations and there is certainly still room for improvement.
import requests, time
from bs4 import BeautifulSoup as bs
import pandas as pd
def searchsport(terme):
url = f'https://www.verif.com/recherche/{terme}/1/ca/d/?ville=null'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
response.raise_for_status()
return terme, response.text
def crawl(keyword):
try:
keyword, html = searchsport(keyword)
soup = bs(html,'lxml')
a_tag = soup.select_one('td.verif_col1 a[href]')
# your code before when looping tds would just overwrite truelink if more than one found. Instead
if a_tag is None:
#handle case of no result e.g. with using crawl('PARIS-SAINT-GERMAIN-FOOTBALL') instead of
#crawl('PARIS SAINT GERMAIN FOOTBALL')
truelink = ''
else:
# print(a_tag['href'])
# adding to the list premier served no purpose. Using split on href would result in list index out of range
truelink = f'https://www.verif.com{a_tag["href"]}' #relative link already so no extra / after .com
except Exception as e:
print(e)
truelink = '' #handle case of 'other' fail. Make sure there is an assigment
finally:
time.sleep(5)
return truelink #unless try succeeded this would have failed with local variable referenced before assignment
def single_text(item_url):
source_code = requests.get(item_url, headers = {'User-Agent':'Mozilla/5.0'})
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml")
print('nivo2 ok')
table = soup.select_one('.table') # on cherche que la balise table
#print('nivo1 ok', '\n', table)
if table is None:
df = pd.DataFrame()
else:
df = pd.read_html(str(table))[0] #simplify to work direct with table and pandas;avoid your loops
return df
def main():
terms = ['PARIS-SAINT-GERMAIN-FOOTBALL', 'PARIS SAINT GERMAIN FOOTBALL']
for term in terms:
item_url = crawl(term)
if item_url:
print(item_url)
df = single_text(item_url) # what is single_item in your question? There is single_text
if not df.empty: #test if dataframe is empty
print(df.head(1))
if __name__ == '__main__':
main()
Returning df from main()
import requests, time
from bs4 import BeautifulSoup as bs
import pandas as pd
def searchsport(terme):
url = f'https://www.verif.com/recherche/{terme}/1/ca/d/?ville=null'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
response.raise_for_status()
return terme, response.text
def crawl(keyword):
try:
keyword, html = searchsport(keyword)
soup = bs(html,'lxml')
a_tag = soup.select_one('td.verif_col1 a[href]')
# your code before when looping tds would just overwrite truelink if more than one found. Instead
if a_tag is None:
#handle case of no result e.g. with using crawl('PARIS-SAINT-GERMAIN-FOOTBALL') instead of
#crawl('PARIS SAINT GERMAIN FOOTBALL')
truelink = ''
else:
# print(a_tag['href'])
# adding to the list premier served no purpose. Using split on href would result in list index out of range
truelink = f'https://www.verif.com{a_tag["href"]}' #relative link already so no extra / after .com
except Exception as e:
print(e)
truelink = '' #handle case of 'other' fail. Make sure there is an assigment
finally:
time.sleep(5)
return truelink #unless try succeeded this would have failed with local variable referenced before assignment
def single_text(item_url):
source_code = requests.get(item_url, headers = {'User-Agent':'Mozilla/5.0'})
print('nivo1 ok')
plain_text = source_code.text # La page en html avec toutes ces balises
soup = bs(plain_text,features="lxml")
print('nivo2 ok')
table = soup.select_one('.table') # on cherche que la balise table
#print('nivo1 ok', '\n', table)
if table is None:
df = pd.DataFrame()
else:
df = pd.read_html(str(table))[0] #simplify to work direct with table and pandas;avoid your loops
return df
def main():
terms = ['PARIS-SAINT-GERMAIN-FOOTBALL', 'PARIS SAINT GERMAIN FOOTBALL']
for term in terms:
item_url = crawl(term)
if item_url:
#print(item_url)
df = single_text(item_url) # what is single_item in your question? There is single_text
return df
if __name__ == '__main__':
df = main()
print(df)

Your error suggests that you trying to run find_all() against a variable which hasn't been populated, i.e. a tag wasn't found to which you could run find_all() against. I have dealt with this by including a statement testing for NoneType
if VALUE is not None:
## code when the tag is found
else:
## code when tag is not found
I think this is the bit you need to do an update like this,
for tr in table_rows:
if tr is not None:
td = tr.find_all('td')
row = row = [tr.text.strip() for tr in td]
l.append(row)
# On enleve certains caractères unitiles
df = pd.DataFrame(l)
else:
## code to run when tr isn't populated
There's a more colourful example where some XML is being parsed where this in action here

Scraping web site

I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks

You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385

You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)

Parsing table columns to a list each

I'm working with a public bus website with a specific bus stop (see variable 'url') and I want to parse every column (Bus Line - Depart time - ETA) to a list each, but I'm getting weird results with this code:
import requests
from bs4 import BeautifulSoup
url = 'http://www.stcp.pt/pt/itinerarium/soapclient.php?codigo=AAL1'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
buses = []
for table in soup.find_all('table', attrs={
'id': 'smsBusResults'
}):
for row in table.find_all('tr', attrs={
'class': 'even'
}):
for col in row.find_all('td'):
buses.append(row.get_text().strip())
print(buses)
Note: If you see "a passar", that means "passing by"

try this
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = requests.get('http://www.stcp.pt/pt/itinerarium/soapclient.php?codigo=AAL1').content
soup = BeautifulSoup(data)
table = soup.find_all('table', {'id':'smsBusResults'})
tr = table[0].find_all('tr')
headers = []
for td in tr[0].find_all('th'):
headers.append(td.text)
temp_df = pd.DataFrame(columns=headers)
pos = 0
for i in range(1,len(tr)):
temp_list = []
for td in tr[i].find_all('td'):
value = (td.text).replace('\n','')
value = value.replace('\t','')
temp_list.append(value)
temp_df.loc[pos] = temp_list
pos+=1
print(temp_df)
Output
Linha Hora Prevista Tempo de Espera
0 600 AV. ALIADOS 16:29 1min
1 202 AV.ALIADOS - 16:34 6min
2 600 AV. ALIADOS 16:41 12min
3 600 AV. ALIADOS 16:50 21min

How can I parse the data having same tags?

I'm trying to parse the data for the details which are under same tags but I'm unable to do it.
the script which I tried :
import re
import pytz
import requests
import datetime
from flask import url_for
from bs4 import BeautifulSoup
from urllib.parse import urljoin
bigbash_article_link = "http://www.espncricinfo.com/ci/content/squad/1134829.html"
r = requests.get(bigbash_article_link)
bigbash_article_html = r.text
soup = BeautifulSoup(bigbash_article_html, "html.parser")
items = soup.find_all("div",{"class":"large-7 medium-7 small-7 columns"})
items1 = soup.find_all("h3")
items2 = soup.find_all("span")
bigbash_article_dict = []
for div in items:
a =div.find('img')['src']
b = 'http://www.espncricinfo.com/'
c = urljoin(b,a)
print(c)
#c[bigbash_article_dict]
#print(bigbash_article_dict)
for div in items1:
a =div.find('a').string
print(a)
for div in items2:
a =(div.find('span')).text
print(a)
I get output as follow
http://www.espncricinfo.com/inline/content/image/1099912.html?alt=icon
http://www.espncricinfo.com/inline/content/image/751925.html?alt=icon
http://www.espncricinfo.com/inline/content/image/599004.html?alt=icon
http://www.espncricinfo.com/inline/content/image/549144.html?alt=icon
http://www.espncricinfo.com/inline/content/image/986769.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099468.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1100136.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1100133.html?alt=icon
http://www.espncricinfo.com/inline/content/image/721225.html?alt=icon
http://www.espncricinfo.com/inline/content/image/818215.html?alt=icon
http://www.espncricinfo.com/inline/content/image/443920.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1080507.html?alt=icon
http://www.espncricinfo.com/inline/content/image/986785.html?alt=icon
http://www.espncricinfo.com/inline/content/image/517833.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099482.html?alt=icon
http://www.espncricinfo.com/inline/content/image/708777.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1093893.html?alt=icon
http://www.espncricinfo.com/inline/content/image/818165.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099914.html?alt=icon
Virat Kohli
Moeen Ali
Murugan Ashwin
Yuzvendra Chahal
Aniket Choudhary
Nathan Coulter-Nile
Colin de Grandhomme
Quinton de Kock
Pavan Deshpande
AB de Villiers
Aniruddha Joshi
Sarfaraz Khan
Kulwant Khejroliya
Brendon McCullum
Mandeep Singh
Mohammed Siraj
Pawan Negi
Parthiv Patel
Navdeep Saini
Tim Southee
Manan Vohra
Washington Sundar
Chris Woakes
Umesh Yadav
Traceback (most recent call last):
File "qwe.py", line 41, in <module>
a =(div.find('span')).text
AttributeError: 'NoneType' object has no attribute 'text'
I get Attribute error If I try to parse the details inside the span tags. Is there any way to extract all the parsed details inside one list of dictionary
The output I'm trying to get
[
{'image':'http://www.espncricinfo.com/inline/content/image/1099912.html?alt=icon','name':'Virat Kohli','role':'captian','Age':'29 years 84 days','Playing role': 'Top-order batsman', 'Batting': 'Right-hand bat', 'Bowling': 'Right-arm medium'}
...
...
...
{'image':'http://www.espncricinfo.com/inline/content/image/1099914.html?alt=icon','name':'Umesh Yadav','role':'captian','Age':' 30 years 95 days','Playing role': 'Bowler', 'Batting': 'Right-hand bat', 'Bowling': 'Right-arm fast-medium'}
]

Try the following. I am iterating over the li tags instead:
details = soup.find("div",{"class":"large-20 medium-20 small-20 columns"})
list = details.find_all('li')
bigbash_article_dict = {}
for div in list:
image_div = div.find("div", {"class": "large-7 medium-7 small-7 columns"})
image_present = False
image_sub_path = "http://www.espncricinfo.com/dummyImage"
if image_div is not None:
image_sub_path = image_div.find('img')['src']
image_present = True
domain = 'http://www.espncricinfo.com/'
image_path = urljoin(domain,image_sub_path)
bigbash_article_dict['image'] = image_path
if image_present:
details_div = div.find("div",{"class":"large-13 medium-13 small-13 columns"})
else: details_div = div.find("div",{"class":"large-13 medium-13 small-20 columns"})
name = details_div.find('a').text.strip()
bigbash_article_dict['name'] = name
for span in details_div.find_all('span'):
info = span.text
if ':' not in info:
key = "Role"
value = info
else:
key = info.split(':')[0]
value = info.split(':')[1]
bigbash_article_dict[key] = value
print(bigbash_article_dict)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping in Python; pagination problem - python

Related

I want to extract IMDb movie IDs using python

Web scraping problem during passing fuction as paramater in function

Scraping web site

Parsing table columns to a list each

How can I parse the data having same tags?

Categories

Resources