I am trying to loop through multiple pages and my code doesn't extract anything. I am kind of new to scraping so bear with me. I made a container so I can target each listing. I also made a variable to target the anchor tag that you would press to go to the next page. I would really appreciate any help I could get. Thanks.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
for page in range(0,25):
file = "breakfeast_chicago.csv"
f = open(file, "w")
Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)
my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})
new = page_soup.findAll("a", {"class":"next ajax-page"})
for i in new:
try:
for container in containers:
b_name = i.find("container.h2.span.text").get_text()
b_addr = i.find("container.p.span.text").get_text()
city_container = container.findAll("span",{"class": "locality"})
b_city = i.find("city_container[0].text ").get_text()
region_container = container.findAll("span",{"itemprop": "postalCode"})
b_reg = i.find("region_container[0].text").get_text()
phone_container = container.findAll("div",{"itemprop": "telephone"})
b_phone = i.find("phone_container[0].text").get_text()
print(b_name, b_addr, b_city, b_reg, b_phone)
f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
except: AttributeError
f.close()
If using BS4 try : find_all
Try dropping into a trace using import pdb;pdb.set_trace() and try to debug what is being selected in the for loop.
Also, some content may be hidden if it is loaded via javascript.
Each anchor tag or href for "clicking" is just another network request, and if you plan to follow the link consider slowing down the number of requests in between each request, so you don't get blocked.
You can try like the below script. It will traverse different pages through pagination and collect name and phone numbers from each container.
import requests
from bs4 import BeautifulSoup
my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select(".info"):
try:
name = item.select(".business-name [itemprop='name']")[0].text
except Exception:
name = ""
try:
phone = item.select("[itemprop='telephone']")[0].text
except Exception:
phone = ""
print(name,phone)
Related
Hey how can I change this code to enter each page and get the info from this url I want ( the book name and the url of the book )
i wrote ( with google help ) this code but i want to get all the books from all the pages ( 50 pages )
# import web grabbing client and
# HTML parser
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
# variable to store website link as string
booksURL = 'http://books.toscrape.com/'
# grab website and store in variable urlClient
urlClient = uReq(booksURL)
# read and close HTML
page_html = urlClient.read()
urlClient.close()
# call BeautifulSoup for parsing
page_soup = soup(page_html, "html.parser")
# grabs all the products under list tag
bookshelf = page_soup.findAll(
"li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
for books in bookshelf:
# collect title of all books
book_title = books.h3.a["title"]
book_url = books.find("a")["href"]
#books_url = books.h3.a["url"]
print(book_title + "-" +booksURL+book_url)
i tried to add this code but i dont know how to add it to my
for i in range(51): # Number of pages plus one
url = "https://books.toscrape.com/catalogue/page-{}.html".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
This might work. I have removed uReq because I prefer using requests ;)
# import web grabbing client and
# HTML parser
from bs4 import BeautifulSoup as soup
import requests
for i in range(1, 51): # Number of pages plus one
url = "https://books.toscrape.com/catalogue/page-{}.html".format(i)
response = requests.get(url)
# call BeautifulSoup for parsing
page_soup = soup(response.content, "html.parser")
# grabs all the products under list tag
bookshelf = page_soup.findAll(
"li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
for books in bookshelf:
# collect title of all books
book_title = books.h3.a["title"]
book_url = books.find("a")["href"]
print(book_title + " - " + book_url)
What I'm trying to do is
Take multiple URLs.
Take h2 text in every URL.
Merge h2 texts and then write csv.
In this code, I did:
Take one URL. Take h2 text in URL.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://example.com/ekonomi/20200108/"
#i am trying to do | urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/]
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each product from the store page
containers = page_soup.findAll("div", {"class": "b-plainlist__info"})
out_filename = "output.csv"
headers = "title \n"
f = open(out_filename, "w")
f.write(headers)
container = containers[0]
for container in containers:
title = container.h2.get_text()
f.write(title.replace(",", " ") + "\n")
f.close() # Close the file
Provided your iteration through the containers is correct, this should work:
You want to iterate through the urls. Each url will grab the title, and append it into a list. Then just create a series with that list and write to csv with Pandas:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/']
titles = []
for page_url in urls:
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each product from the store page
containers = page_soup.findAll("div", {"class": "b-plainlist__info"})
for container in containers:
titles.append(container.h2.get_text())
df = pd.DataFrame(titles, columns=['title'])
df.to_csv("output.csv", index=False)
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-
events/?page=1'
#opening connection , downloading page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
# catch each events
card = page_soup.findAll("div",{"class":"eds-media-card-content__content"})
filename = "Data_Events.csv"
f = open(filename, "w")
headers = "events_name, events_dates, events_location, events_fees\n"
f.write(headers)
for activity in card :
event_activity = activity.findAll("div",{"class":"eds-event-
card__formatted-name--is-clamped"})
events_name = event_activity[0].text
event_date = activity.findAll("div",{"class":"eds-text-bs--fixed eds-
text-color--grey-600 eds-l-mar-top-1"})
events_dates = event_date[0].text
events_location = event_date[1].text
events_fees = event_date[2].text
print("events_name: " + events_name)
print("events_dates: " + events_dates)
print("events_location: " + events_location)
print("events_fees: " + events_fees)
f.write(events_name + "," + events_dates + "," + events_location + "," +
events_fees + "\n")
f.close()
Hi, i am still a beginner in using Python language and i would like to know how can i apply a function where this script is able to get data to a next page within the website?
I have try to do a
for pages in page (1, 49)
my_url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-
events/?page=1'
Any advice would be appreciated
import itertools
import requests
from bs4 import BeautifulSoup
def parse_page(url, page)
params = dict(page=page)
resp = requests.get(url, params=params) # will format `?page=#` to url
soup = BeautifulSoup(resp.text, 'html.parser')
... # parse data from page
url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events'
for page in itertools.count(start=1): # don't need to know total pages
try:
parse_page(url, page)
except Exception:
# `parse_url` was designed for a different page layout and will
# fail when no more pages to scrape, so we break here
break
I want to crawl all these movie reviews in this page.
Which part in red circle
I tried to crawl with this code. (I used Jupiter Notebook-Anaconda3)
import requests
from bs4 import BeautifulSoup
test_url = "https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=174903&type=after&page=1"
resp = requests.get(test_url)
soup = BeautifulSoup(resp.content, 'html.parser')
soup
score_result = soup.find('div', {'class': 'score_result'})
lis = score_result.findAll('li')
lis[:3]
from urllib.request import urljoin #When I ran this block and next block it didn't save any reviews.
review_text=[]
#review_text = lis[0].find('p').getText()
list_soup =soup.find_all('li', 'p')
for item in list_soup:
review_text.append(item.find('p').get_text())
review_text[:5] #Nothing was saved.
As I wrote in third block and forth block nothing was saved. What is the problem?
This will get what you want. Tested in python within Jupyter Notebook (latest)
import requests
from bs4 import BeautifulSoup
from bs4.element import NavigableString
test_url = "https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=174903&type=after&page=1"
resp = requests.get(test_url)
soup = BeautifulSoup(resp.content, 'html.parser')
movie_lst = soup.select_one('div.score_result')
ul_movie_lst = movie_lst.ul
for movie in ul_movie_lst:
if isinstance(movie, NavigableString):
continue
score = movie.select_one('div.star_score em').text
name = movie.select_one('div.score_reple p span').text
review = movie.select_one('div.score_reple dl dt em a span').text
print(score + "\t" + name)
print("\t" + review)
Preview
I am practicing here and my goal is to retrieve these data from the page in the url variable:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
url = "https://www.newegg.com/global/bg-en/PS4-Accessories/SubCategory/ID-3142"
# opening connection, grabing the page
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("div", {"class": "item-container"})
for container in containers:
brand = container.select("div.item-info")[0].a.img["title"]
name = container.findAll("a", {"class": "item-title"})[0].text.strip()
shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip()
print("brand " + brand)
print("name " + name)
print("shipping " + shipping)
Nothing more I can say for it :) I just simple as that but I still can't get it why no data is retrieved. Will be thankful for every advice!
You are invoking the find_all method with wrong arguments.
You should use the argument "class_" properly, according to the documentation found here:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class