Can't extract the text and find all by BeautifulSoup - python

I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.
import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})
for the result of b, it does not return all the list inside the tag.
And for the last one of it is '+ plus', looks like as following.
<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]

This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.
Instead you have to use selenium web driver, open page and process all the javascripts. Then you can get ccess to all data you expect

Related

webscraping python not showing all tags

I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.

Scraping product URLs under specific ZIP Code

I'm trying to scrape the product links under ZIP code 08041. I have written the code to scrape the products without ZIP code but don't know how to scrape and send the request fror the products under 08041?
Here is my code:
import requests
import random
import time
from bs4 import BeautifulSoup
import wget
import csv
from fp.fp import FreeProxy
def helloworld(url):
r = requests.get(url)
print ('Status',r.status_code)
#time.sleep(8)
soup = BeautifulSoup(r.content,'html.parser')
post = soup.find_all('a',"name")
for href in post:
if ( href.get('href')[1] == 'p'):
href = href.get('href')
print (href)
def page_counter():
url1 = "https://soysuper.com/c/aperitivos#products"
print (url1,'\n')
helloworld(url1)
page_counter()
You can use the back-end end-points to mimic a request with a given zip code.
Note: The cookie is hard-coded but valid for a year.
Here's how:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "soysuper=eyJjYXJ0IjoiNjA2NWNkMzg5ZDI5YzkwNDU1NjI3MzYzIiwiZXhwaXJlcyI6MTY0ODg0MTMzOSwib3JpZCI6IkM2NzgwOUYyLTkyRUYtMTFFQi04NjNELTgzMTBCMUUwMTM2NiIsInNtIjoiIiwidXVpZCI6IkIwQjYxQzRFLTkyRUYtMTFFQi05MjRCLTA5MTFCMUUwMTM2NiIsIndoIjpbIjU0MDQ5MjEwMDk1Y2ZhNTQ2YzAwMDAwMCIsIjRmZjMwZTZhNTgzMmU0OGIwMjAwMDAwMCIsIjU5Y2JhZmE2OWRkNGU0M2JmMzIwODM0MiIsIjRmMzEyNzU4ZTNjNmIzMDAzMjAwMDAwMCIsIjVhMTZmNjdhMjUwOGMxNGFiMzE0OTY4MyIsIjYwMjQxNTEzNzIyZDZhNTZkNDZlMjhmNyIsIjRmZjMwZTJkYzI3ZTk1NTkwMjAwMDAwMSIsIjU5ZjcxYTZlNjI4YWIwN2UyYjJjZmJhMSIsIjU5Y2JhZjNjOWRkNGU0M2JmMzIwODM0MSIsIjVhMGU0NDFhNTNjOTdiM2UxNDYyOGEzNiIsIjRmMmJiZmI3ZWJjYjU1OGM3YjAwMDAwMCIsIjYwNDExZjJlNzIyZDZhMTEyZDVjYTNlYiIsIjViMWZmZjAyNzI1YTYxNzBjOTIxMjc0MSIsIjVlNzk2NWUwZDc5MTg3MGU0NTA1MGMwMCIsIjVkMTI0NDQ2OWRkNGU0NGFkMDU3MmMxMSJdLCJ6aXAiOiIwODA0MSJ9--166849121eece159a6fdb0c0fe8341032321d9b1;"
}
with requests.Session() as connection:
r = connection.get("https://soysuper.com/supermarket?zipcode=08041", headers=headers)
headers["Request-Id"] = r.headers["Next-Request-Id"]
headers["Referer"] = "https://soysuper.com/c/aperitivos"
products_data = connection.get("https://soysuper.com/c/aperitivos?products=1&page=1", headers=headers).json()
print(products_data["products"]["total"])
Output: Total number of products for 08041 zip code.
2923
What you're effectivly getting is a JSON with all the product data for a given page. This is what it looks like in the Network tab.
Do notice the pager key. Use it to "paginate" the API and get more product info.

HTMLs not found by BeautifulSoup

I'm trying to write a program, that downloads the most upvoted picture from a subreddit, but for some reason the BeautifulSoup does not find all the links on a website, I know I could try it with other methods but I'm curious why isn't it finding all the link every time.
Here is the code as well.
from PIL import Image
import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/r/wallpaper/top/'
result = requests.get(url)
soup = BeautifulSoup(result.text,'html.parser')
for link in soup.find_all('a'):
print (link.get('href'))
Site is loaded with JavaScript, bs4 will not be able to render JavaScript therefor, I've been able to locate the data within script tag.
import requests
import re
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
r = requests.get(url, headers=headers)
match = re.search(r"window.___r = ({.+})", r.text).group(1)
data = json.loads(match)
# print(data.keys())
# humanreadable = json.dumps(data, indent=4)
main("https://www.reddit.com/r/wallpaper/top/")
Shorter version:
match = re.finditer(r'permalink":"(.+?)"', r.text)
for item in match:
print(item.group(1))
Output:
https://www.reddit.com/r/wallpaper/comments/fv9ubr/khyber_pakhtunkhwa_pakistan_balakot_1920x1024/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/?instanceId=t3_p%3DgAAAAABeiiTtw4FM0zBerf9DDiq5tmonjJbAwzQb_UwA-VHlw2J8zUxw-y6Doa6j-jPP0qt05lRZfyReQwnLH9pN6wdSBBvqhgxgRS3uKyKCRvkk6WNwns5wpad0ijMgHwqVnZSGMT0KWP4WB15zBNkb3j96ifm23pT4uACb6cpNVh-TE05GiTtDnD9UUMir02Z7hOr0x4f_wLJEIplafXRp2yiAFPh5VzH_4VSsPx9zV7v3IJwN5ctYLfIcdCW5Z3W-z3bbOVUCU2HqqRAoh0XEj0LrgdicMexa9fzPbtWOshfx3kIazwFhYXoSowPBRZUquSs9zEaQwP1B-wg951edNb7RSjYTrDpQ75zsMfIkasKvAOH-V58%3D
https://www.reddit.com/r/wallpaper/comments/fv6wew/lone_road_in_nowhere_arizona_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvaqaa/the_hobbit_house_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fvcs4j/something_i_made_in_illustrator_5120_2880/
https://www.reddit.com/r/wallpaper/comments/fv09u2/bath_time_in_rocky_mountain_national_park_1280x720/
https://www.reddit.com/r/wallpaper/comments/fuyomz/up_is_still_my_favorite_film_grandpa_carl_cams/
https://www.reddit.com/r/wallpaper/comments/fvagex/beautiful_and_colorful_nature_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv3nnn/maroon_bells_co_photo_credit_to/
https://www.reddit.com/r/wallpaper/comments/fuyg0z/volcano_lightening_19201080/
https://www.reddit.com/r/wallpaper/comments/fvgohk/doctor_strange1920x1080/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/?instanceId=t3_p%3DgAAAAABeiiTt9isPY03zwoimtzcC7w3uLzUDCuoD5cU6ekeEYt48cRAqoMsc1ZDBJ6OeK1U3Bs2Zo1ZSWzdQ4DOux21vGvWzJkxNWQ14XzDWag_GlrE-t_4rpFA_73kW94xGUQchsXL7f4VkbbHIyn8SMlUlTtt3j3lJCViwINOQgIF3p5N8Q4ri-swtJC-JyEUYa4dJazlZ9xLYyOHSvMkiR3k9lDx0NEKqpqfbQ9__f3xLUzgS4yF4OngMDFUVFa5nyH3I32mkP3KezXLxOR6H8CSGI_jqRA4dBV-AnHLuzPlgENRpfaMhWJ04vTEOjmG4sm4xs65OZCumqNstzlDEvR7ryFwL6LeH02a9E3czck5jfKY7HXQ%3D
https://www.reddit.com/r/wallpaper/comments/fuzjza/ghost_cloud_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvg88o/park_autumn_tress_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv47r8/audi_quattro_s1_3840x2160_fh4/
https://www.reddit.com/r/wallpaper/comments/fuybjs/spacecrafts_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fv043i/dragonfly_1280x720/
https://www.reddit.com/r/wallpaper/comments/fv06ud/muskrat_swim_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvdafk/natural_beauty_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbnuc/cigar_man_19201080/
https://www.reddit.com/r/wallpaper/comments/fvcww4/thunder_road_3840_x_2160/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/?instanceId=t3_p%3DgAAAAABeiiTtxVzGp9KwvtRNa1pOVCgz2IBkTGRxqdyXk4WTsjAkWS9wzyDVF_1aSOz36HqHOVrngfj3z_9O1cAkzz-0fwhxyJ_8jePT3F88mrveLChf_YRIbAtxb-Ln_OaeeXUnyrFVl-OPN7cqXvtgh3LoymBx3doL-bEVnECOWkcSXvUIwpMn-flVZ5uNcGL1nKEiszUcORqq1oQ32BnrmWHomrDb3Q%3D%3D
https://www.reddit.com/r/wallpaper/comments/fv3xqs/social_distancing_log_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbcpl/neon_city_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fvbhdb/sunrise_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fv2eno/second_heavy_bike_in_ghost_recon_breakpoint/

Trying to parse a div class but can't get the correct results

I'm trying to parse the div class titled "dealer-info" from the URL below.
https://www.nissanusa.com/dealer-locator.html
I tried this:
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Normally, I would expect that to work, but I'm getting this result: HTTPError: Forbidden
Also, tried this.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
print(data)
That gives me all the HTML on the site, but it's pretty ugly to look at, or make any sense of at all.
I'm trying to get a structured data set, of "dealer-info". I am using Python 3.6.
You might be being rejected by the server in your first example due to not pretending to be an ordinary browser. You should try combining the user agent code from the second example with the Beautiful Soup code from the first:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
text = response.read()
soup = BeautifulSoup(text, "lxml")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Keep in mind that if the web site is explicitly trying to keep Beautiful Soup or other non-recognized user agents out, they may take issue with you scraping their web site data. You should consult and obey https://www.nissanusa.com/robots.txt as well as any terms of use or terms of service agreements you may have agreed to.

Requests in python return error, while opening link manually works perfect

import requests
a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')
b = requests.get(a)
print(b.text)
print(b.url)
If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?
VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>
First of all, make sure you follow the website's Terms of Use and usage policies.
This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:
from urllib.parse import parse_qs, urljoin
import requests
from bs4 import BeautifulSoup
SEARCH_TERM = 'coca-cola'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
# get the current search state
response = session.get("https://tmsearch.uspto.gov/")
soup = BeautifulSoup(response.content, "html.parser")
link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]
session.get(urljoin(response.url, link))
state = parse_qs(link)['state'][0]
# perform a search
response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
'f': 'toc',
'state': state,
'p_search': 'search',
'p_s_All': '',
'p_s_ALL': SEARCH_TERM + '[COMB]',
'a_default': 'search',
'a_search': 'Submit'
})
# print search results
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find("font", color="blue").get_text())
table = soup.find("th", text="Serial Number").find_parent("table")
for row in table('tr')[1:]:
print(row('td')[1].get_text())
It prints all the serial number values from the first search results page, for demonstration purposes.

Categories