BeautifulSoup can't read HTML of the webpage - python

I want to get real estate data from https://www.realtor.com/
I use this code:
from bs4 import BeautifulSoup as bs
import requests
main_url='https://www.realtor.com/realestateandhomes-search/New-York_NY'
page=requests.get(main_url).content
bs(page,'html.parser')
It does not output the full HTML of the page, so can't find the tags I am interested in.
Is there another way to get the full HTML?

import requests
main_url='https://www.realtor.com/realestateandhomes-search/New-York_NY'
page=requests.get(main_url)
results = bs(page.content,'html.parser')
print(results)
This should work

Related

i am trying to scrape name website and some data but it is showing me none

import requests
from bs4 import BeautifulSoup
url='https://dashboard.slintel.com/#/technologies-details/5bb416601e3d6672781be96f'
a=requests.get(url)
soup=BeautifulSoup(a.text,'html.parser')
data=soup.find('table', class_="table)
print(data)
please help me find the solution i have tried "tbody" Tag as well but it is also giving me same result

How can I get data from this link into a JSON?

I am trying to extract the search results with Python from this link into a JSON file, but normal request methods seem not functioning in this case. How can extract all the results?
url= https://apps.usp.org/app/worldwide/medQualityDatabase/reportResults.html?country=Ethiopia%2BGhana%2BKenya%2BMozambique%2BNigeria%2BCambodia%2BLao+PDR%2BPhilippines%2BThailand%2BViet+Nam%2BBolivia%2BColombia%2BEcuador%2BGuatemala%2BGuyana%2BPeru&period=2017%2B2016%2B2015%2B2014%2B2013%2B2012%2B2011%2B2010%2B2009%2B2008%2B2007%2B2006%2B2005%2B2004%2B2003&conclusion=Both&testType=Both&counterfeit=Both&recordstart=50
my code
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
Why am I not getting the full source code of the page?

bs4 won't open locally stored html page correctly

When I attempt to parse a locally stored copy of a webpage, beautifulsoup returns gibberish to me. I don't understand why as I've never faced this problem when using the requests and bs4 modules together for scraping tasks.
here's my code
import requests
from bs4 import BeautifulSoup as BS
import os
url_2 = r'/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/'
os.chdir(url_2)
f = open('re_2.html')
soup = BS(url_2, "lxml")
f.close()
print soup
this code returns the following :
<html><body><p>/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/</p></body></html>
I wasn't able to find a similar problem online so I've posted it here. any help would be much appreciated.
You are passing the path (which you named url_2) to BeautifulSoup so it treats that as a web page text and returns it, neatly wrapped in some minimal HTML. Seems fine.
Try constructing the BS from the file's contents instead. See here how it works: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
soup = BS(f)
should do...

BeautifulSoup html -- load from memory?

I'm using BeautifulSoup in python 3.5 to parse html. While I can load it from file, I need to load it from memory because I get from an HTTP request. I've google but found nothing loading html to BeautifulSoup from memory. Is it possible?
If you are using the version 4 of BeautifulSoup, try passing the request data to it
from bs4 import BeautifulSoup
import requests
# replace the following URL
response = requests.get("https://www.python.org")
soup = BeautifulSoup(response.text,"html.parser")
from BeautifulSoup import BeautifulSoup
import requests
data = requests.get('https://google.com').text
soup = BeautifulSoup(data)

Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames
I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names
Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.

Categories