BeautifulSoup - Scrape multiple pages

BeautifulSoup - Scrape multiple pages - python

I want to scrape the name of the members from each page and move on to the next pages and do the same. My code is working for only one page. I'm very new to this, Any advice would be appreciated. Thank you.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bodia.com/spa-members/page/1")
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
I tried this and it only gives me the members of the page 3.
for i in range (1,4): #to scrape names of page 1 to 3
r = requests.get("https://www.bodia.com/spa-members/page/"+ format(i))
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
Then I tried this :
i = 1
while i<5:
r = requests.get("https://www.bodia.com/spa-members/page/"+str(i))
i+=1
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
It gives me the name of 4 members, but I don't know from which page
['Seng Putheary (Nana)']
['Marco Julia']
['Simon']
['Ms Anne Guerineau']

Just two changes needed to be made to get it to scrape everything.
r = requests.get("https://www.bodia.com/spa-members/page/"+ format(i)) needs to be changed to r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i)). Your use of format was incorrect.
You were not looping over all the code, so the result was that it only printed out one set of names and then had no way to return to the start of the loop. Indenting everything under the for loop fixed that.
import requests
from bs4 import BeautifulSoup
for i in range (1,4): #to scrape names of page 1 to 3
r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i))
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print(lights_list)
The above code was spitting out a list of names every 3 seconds for the pages it scraped.

Related

Webscraping and finding elements

I am trying to find out when a game has been postponed and get the related team information or game number because I append the team abbreviation to a list. What currently happens is that it is only getting the items that are postponed, and skipping over the games that do not have a postponement. I think I need to change the soup.select line, or do something slightly different, but cannot figure it out.
The code does not throw any errors, but the list returned is [0,1,2,3]. However, if you open https://www.rotowire.com/baseball/daily-lineups.php, it should return [0,1,14,15] because those are the team elements with a game postponed.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
x = 0
gamesRemoved = []
for tag in soup.select(".lineup__main > div"):
ppcheck = tag.text
if "POSTPONED" in ppcheck:
print(x)
print('Postponement')
first_team = x*2
print(first_team)
gamesRemoved.append(first_team)
second_team = x*2+1
gamesRemoved.append(second_team)
x+=1
else:
x+=1
continue
print(gamesRemoved)

You can use BeautifulSoup.select and check if 'is-postponed' exists as a class name in the lineup box:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.rotowire.com/baseball/daily-lineups.php').text, 'html.parser')
p = [j for i, a in enumerate(d.select('.lineup.is-mlb')) for j in [i*2, i*2+1] if 'is-postponed' in a['class']]
Output:
[0, 1, 14, 15]

Web scraping with bs4 python: How to display football matchups

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)

Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

Webscrape Python output doesn't match website

EDITED
I am scraping a website and am trying to get the amount of search results from my search so that I can use that number to determine how many pages to scrape. Here is an example
#!/usr/bin/python3
from bs4 import BeautifulSoup
import requests
import csv
url_list = []
item_list = []
page_ctr = 0
item_ctr = 0
num_pages = 0;
my_url = 'https://www.walmart.com/search/?query=games&20lego'
get_page_num = requests.get(my_url)
num = get_page_num.content
num_soup = BeautifulSoup(num, 'lxml')
num_soup.prettify()
print(num_soup.prettify())
#num_sum = num_soup.find('div', {'class': 'result-summary-container'}).text()
#print(num_sum)
#num_pages = (num_sum[1]/40) + 1
When I inspect the element in chrome and just looking at the page with my eyes, I see 230 results, but when I look at my output i get something more like this:
</span> of 1,000+ results</div>
very new to web scraping, can anyone explain this?

Python Beautiful Soup not looping results

Im using BS4 for the first time and need to scrape the items from an online catalogue to csv.
I have setup my code however when i run the code the results are only repeating the first item in the catalogue n times (where n is the number of items).
Can someone review my code and let me know where i am going wrong.
Thanks
import requests
from bs4 import BeautifulSoup
from csv import writer
#response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/27/anaesthetic-oxygen-and-resuscitation?CoreListRequest=BrowseCoreList')
response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/32/nhs-cat?LastCartId=&LastFavouriteId=&CoreListRequest=BrowseAll')
soup = BeautifulSoup(response.text , 'html.parser')
items = soup.find_all(class_='productPrevDetails')
#print(items)
for item in items:
ItemCode = soup.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = soup.select('p')[58].get_text()
ProductInfo = soup.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)

You always see the first result because you are searching soup, not the item. Try
for item in items:
ItemCode = item.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = item.select('p')[58].get_text()
ProductInfo = item.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)

Parsing error with Beautiful Soup 4 and Python

I need to get the list of the rooms from this website: http://www.studentroom.ch/en/dynasite.cfm?dsmid=106547
I'm using Beautiful Soup 4 in order to parse the page.
This is the code I wrote until now:
from bs4 import BeautifulSoup
import urllib
pageFile = urllib.urlopen("http://studentroom.ch/dynasite.cfm?dsmid=106547")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
roomsNoFilter = soup.find('div', {"id": "ImmoListe"})
rooms = roomsNoFilter.table.find_all('tr', recursive=False)
for room in rooms:
print room
print "----------------"
print len(rooms)
For now I'm trying to get only the rows of the table.
But I get only 7 rows instead of 78 (or 77).
At first I tough that I was receiving only a partial html, but I printed the whole html and I'm receiving it correctly.
There's no ajax calls that loads new rows after the page loaded...
Someone could please help me finding the error?

This is working for me
soup = BeautifulSoup(pageHtml)
div = soup.select('#ImmoListe')[0]
table = div.select('table > tbody')[0]
k = 0
for room in table.find_all('tr'):
if 'onmouseout' in str(room):
print room
k = k + 1
print "Total ",k
Let me know the status

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - Scrape multiple pages - python

Related

Webscraping and finding elements

Web scraping with bs4 python: How to display football matchups

Webscrape Python output doesn't match website

Python Beautiful Soup not looping results

Parsing error with Beautiful Soup 4 and Python

Categories

Resources