Im new to programming and cannot figure out why this wont loop. It prints and converts the first item exactly how I want. But stops after the first iteration.
from bs4 import BeautifulSoup
import requests
import re
import json
url = 'http://books.toscrape.com/'
page = requests.get(url)
html = BeautifulSoup(page.content, 'html.parser')
section = html.find_all('ol', class_='row')
for books in section:
#Title Element
header_element = books.find("article", class_='product_pod')
title_element = header_element.img
title = title_element['alt']
#Price Element
price_element = books.find(class_='price_color')
price_str = str(price_element.text)
price = price_str[1:]
#Create JSON
final_results_json = {"Title":title, "Price":price}
final_result = json.dumps(final_results_json, sort_keys=True, indent=1)
print(title)
print(price)
print()
print(final_result)
First, clarify what you are looking for? Probably, you wish to print the title, price and final_result for every book that has been scraped from the URL books.toscrape.com. The code is working as it is written though the expectation is different. If you notice you are finding all the "ol" tags with class name = "row" and there's just one such element on the page thus, section has only one element eventually the for loop iterates just once.
How to debug it?
Check the type of section, type(section)
Print the section to know what it contains
write some print statements in for loop to understand what happens when
It isn't hard to debug this one.
You need to change:
section = html.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
there is only 1 <ol> in that doc
I think you want
for book in section[0].find_all('li'):
ol means ordered list, of which there is one in this case, there are many li or list items in that ol
Related
I've written a simple python script for web scraping:
import requests
from bs4 import BeautifulSoup
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
list = soup.find_all("li",{"class":"column"})
for li in list:
name = li.div.a.h3.text.strip()
print(name)
link = li.div.a.get("href")
oldprice = li.find("div",{"class":"proDetail"}).find_all("a")[0].text.strip().strip('TL')
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
print(f"name: {name} link: {link} old price: {oldprice} new price: {newprice}")
It gives me a list index out of range error in the line newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
Why am I getting this error? How can I fix it?
As it is mentioned above, your code is not returning as many elements as you expect.
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL') This find_all("a") is only returning a list of 1 <a> tag.
Additionally, you should check which web page this is happening in. By that I mean,
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
It could also be the case that the code fails when i=1 or when i=2 oe both. So you should examine each web page also.
I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)
Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)
Im using BS4 for the first time and need to scrape the items from an online catalogue to csv.
I have setup my code however when i run the code the results are only repeating the first item in the catalogue n times (where n is the number of items).
Can someone review my code and let me know where i am going wrong.
Thanks
import requests
from bs4 import BeautifulSoup
from csv import writer
#response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/27/anaesthetic-oxygen-and-resuscitation?CoreListRequest=BrowseCoreList')
response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/32/nhs-cat?LastCartId=&LastFavouriteId=&CoreListRequest=BrowseAll')
soup = BeautifulSoup(response.text , 'html.parser')
items = soup.find_all(class_='productPrevDetails')
#print(items)
for item in items:
ItemCode = soup.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = soup.select('p')[58].get_text()
ProductInfo = soup.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
You always see the first result because you are searching soup, not the item. Try
for item in items:
ItemCode = item.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = item.select('p')[58].get_text()
ProductInfo = item.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
When i run my scraper it fetches titles and hrefs to the titles form a webpage. The page has pagination option in the footer which contains 6 new links which are being scraped by the second "print" in my scraper. But, at this point I can't make use of this next-page links, I meant I can't find any way to insert it somewhere in the function so that I can grab the titles and hrefs from each next-page link. Sorry for any mistakes I've made and thanks in advance to take a look into it.
import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
def GrabbingData(url):
base="http://www.wiseowl.co.uk"
response = requests.get(url)
tree = html.fromstring(response.text)
title = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]/a/text()')
link = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]/a/#href')
for i,j in zip(title,link):
print(i,j)
pagination=tree.xpath("//div[contains(concat(' ', #class, ' '), ' woPaging ')]//a[#class='woPagingItem' or #class='woPagingNext']/#href")
for nextp in pagination:
print(base + nextp)
GrabbingData(Page_link)
You can easily make it a recursive function, like this:
import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
visited_links = []
def GrabbingData(url):
base="http://www.wiseowl.co.uk"
response = requests.get(url)
visited_links.append(url)
tree = html.fromstring(response.text)
title = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]//a/text()')
link = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]//a/#href')
for i,j in zip(title,link):
print(i,j)
pagination=tree.xpath("//div[contains(concat(' ', #class, ' '), ' woPaging ')]//a[#class='woPagingItem' or #class='woPagingNext']/#href")
for nextp in pagination:
url1 = str(base + nextp)
if url1 not in visited_links:
#print(url1)
GrabbingData(url1)
if __name__ == "__main__":
GrabbingData(Page_link)
Since the HTML on a next page URL will contain "Back" links, I also added a list visited_links, so you don't go back to pages already visited and you don't end up with infinite loop.
The last part starting with
if __name__ == "__main__":
is commonly used to call a function if the file is called directly (as opposed to being imported).
This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.
I've been working on this for 12hrs+ today, what am I missing in order get the correct results?
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
print r.text
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = pages.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
web_address = link.text
gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
writer = csv.writer(file)
for row in gyms:
writer.writerow([row])
You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.
You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.
There are at least two possible solutions:
Append every parsed line of url to a list and return that list.
Move you processing code in the loops and append the parsed data to gyms in the loop.
As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
# etc. etc.
You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows().
Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.