I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]
I am trying to use Python to scrape the US News Ranking for universities, and I'm struggling. I normally use Python "requests" and "BeautifulSoup".
The data is here:
https://www.usnews.com/education/best-global-universities/rankings
Using right click and inspect shows a bunch of links and I don't even know which one to pick. I followed an example from the web that I found but it just gives me empty data:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO
url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'
css = '#resultsMain :nth-child(1)'
npage = 20
urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]
def scrapevec(url, css):
doc = parse(StringIO(url)).getroot()
return([link.text_content() for link in doc.cssselect(css)])
usng = []
for u in urlst:
print(u)
ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]
This doesn't work as t is an empty array.
I'd really appreciate any help.
The MWE you posted is not working at all: urlst is never defined and cannot be called. I strongly suggest you to look for basic scraping tutorials (with python, java, etc.): there is plenty and in general is a good starting.
Below you can find a snippet of a code that prints the universities' names listed on page 1 - you'll be able to extend the code to all the 150 pages through a for loop.
import requests
from bs4 import BeautifulSoup
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'
page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
print(univ.text)
Edit: now the example works, but as you say in your question, it only returns empty lists. Below an edited version of the code that returns a list of all universities (pp. 1-150)
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
res = []
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
res.append(univ.text)
return res
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist] # unfold the list of lists
Re-edit following QHarr suggestion (thanks!) - same output, shorter and more "pythonic" solution
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
return [univ.text for univ in res_tab.select('[href]', limit=10)]
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist]
I am working with lobbying data from opensecrets.org, in particular industry data. I want to have a time series of lobby expenditures for each industry going back since the 90's.
I want to web-scrape the data automatically. Urls where the data is have the following format:
https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019
which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage. It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code. I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.
If there is an API, that your best bet as mentioned above. But the data is able to be parsed anyway provided you get the right url/query parameters:
I've managed to iterate through it with the links for you to grab each table. I stored it in a dictionary with the key being the Firm name, and the value being the table/data. You can change it up to anyway you'd like. Maybe just store as json, or save each as csv.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a', href=True)
root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}
for each in links:
if 'clientsum.php?' in each['href']:
w=1
firms = each.text
link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
links_dict[firms] = link
all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():
print ('%s of %s ---- %s' %(n, tot, firms))
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
results = pd.DataFrame()
graph = soup.find_all('set')
for each in graph:
year = each['label']
total = each['value']
temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
results = results.append(temp_df,sort=True).reset_index(drop=True)
all_tables[firms] = results
n+=1
*Output:**
Not going to print as there are 347 tables, but just so you see the structure:
We are trying to scrape every product for every category on Forever 21's website. Given a product page, we know how to extract the information we need, and given a category, we can extract every product. However, we do not know how to crawl through every product category. Here is our code for a given category and getting every product:
import requests
from bs4 import BeautifulSoup
import json
#import re
params = {"action": "getcategory",
"br": "f21",
#"category": re.compile('\S+'),
"category": "dress",
"pageno": 1,
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "http://www.forever21.com/Ajax/Ajax_Category.aspx"
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
i = 0
j = 0
while len(soup.select("div.item_pic a")) != 0:
for a in soup.select("div.item_pic a"):
#print a["href"]
i = i + 1
params["pageno"] = params["pageno"] + 1
j = j + 1
js = requests.get(url, params=params).json()
soup = BeautifulSoup(js[u'CategoryHTML'], "html.parser")
print i
print j
As you can see in the comments, we tried to use regular expressions for the category but had no success. i and j are just product and page counters. Any suggestions on how to modify/add to this code to get every product category?
You can scrape the category page and get all subcategories from the navigation menu:
import requests
from bs4 import BeautifulSoup
url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=app-main"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
menues = [li["class"][0] for li in soup.select("#has_sub .white nav ul > li")]
print(menues)
Prints:
[u'women-new-arrivals', u'want_list', u'dress', u'top_blouses', u'outerwear_coats-and-jackets', u'bottoms', u'intimates_loungewear', u'activewear', u'swimwear_all', u'acc', u'shoes', u'branded-shop-women-clothing', u'sale_women|women', u'women-new-arrivals-clothing-dresses', u'women-new-arrivals-clothing-tops', u'women-new-arrivals-clothing-outerwear', u'women-new-arrivals-clothing-bottoms', u'women-new-arrivals-clothing-intimates-loungewear', u'women-new-arrivals-clothing-swimwear', u'women-new-arrivals-clothing-activewear', u'women-new-arrivals-accessories|women-new-arrivals', u'women-new-arrivals-shoes|women-new-arrivals', u'promo-web-exclusives', u'promo-best-sellers-app', u'backinstock-women', u'promo-shop-by-outfit-women', u'occasion-shop-wedding', u'contemporary-main', u'promo-basics', u'21_items', u'promo-summer-forever', u'promo-coming-soon', u'dress_casual', u'dress_romper', u'dress_maxi', u'dress_midi', u'dress_mini', u'occasion-shop-dress', u'top_blouses-off-shoulder', u'top_blouses-lace-up', u'top_bodysuits-bustiers', u'top_graphic-tops', u'top_blouses-crop-top', u'top_t-shirts', u'sweater', u'top_blouses-sweatshirts-hoodies', u'top_blouses-shirts', u'top_plaids', u'outerwear_bomber-jackets', u'outerwear_blazers', u'outerwear_leather-suede', u'outerwear_jean-jackets', u'outerwear_lightweight', u'outerwear_utility-jackets', u'outerwear_trench-coats', u'outerwear_faux-fur', u'promo-jeans-refresh|bottoms', u'bottoms_pants', u'bottoms_skirt', u'bottoms_shorts', u'bottoms_shorts-active', u'bottoms_leggings', u'bottoms_sweatpants', u'bottom_jeans|', u'intimates_loungewear-bras', u'intimates_loungewear-panties', u'intimates_loungewear-bodysuits-slips', u'intimates_loungewear-seamless', u'intimates_loungewear-accessories', u'intimates_loungewear-sets', u'activewear_top', u'activewear_sports-bra', u'activewear_bottoms', u'activewear_accessories', u'swimwear_tops', u'swimwear_bottoms', u'swimwear_one-piece', u'swimwear_cover-ups', u'acc_features', u'acc_jewelry', u'acc_handbags', u'acc_glasses', u'acc_hat', u'acc_hair', u'acc_legwear', u'acc_scarf-gloves', u'acc_home-and-gift-items', u'shoes_features', u'shoes_boots', u'shoes_high-heels', u'shoes_sandalsflipflops', u'shoes_wedges', u'shoes_flats', u'shoes_oxfords-loafers', u'shoes_sneakers', u'Shoes_slippers', u'branded-shop-new-arrivals-women', u'branded-shop-women-clothing-dresses', u'branded-shop-women-clothing-tops', u'branded-shop-women-clothing-outerwear', u'branded-shop-women-clothing-bottoms', u'branded-shop-women-clothing-intimates', u'branded-shop-women-accessories|branded-shop-women-clothing', u'branded-shop-women-accessories-jewelry|', u'branded-shop-shoes-women|branded-shop-women-clothing', u'branded-shop-sale-women', u'/brandedshop/brandlist.aspx', u'promo-branded-boho-me', u'promo-branded-rare-london', u'promo-branded-selfie-leslie', u'sale-newly-added', u'sale_dresses', u'sale_tops', u'sale_outerwear', u'sale_sweaters', u'sale_bottoms', u'sale_intimates', u'sale_swimwear', u'sale_activewear', u'sale_acc', u'sale_shoes', u'the-outlet', u'sale-under-5', u'sale-under-10', u'sale-under-15']
Note the values of br and category GET parameters. f21 is the "Women" category, app-main is the main page for a category.
I'm doing a web scrape of a website with 122 different pages with 10 entries per page. The code breaks on random pages, on random entries each time it is ran. I can run the code on a url one time and it works while other times it does not.
def get_soup(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return soup
def from_soup(soup, myCellsList):
cellsList = soup.find_all('li', {'class' : 'product clearfix'})
for i in range (len(cellsList)):
ottdDict = {}
ottdDict['Name'] = cellsList[i].h3.text.strip()
This is only a piece of my code, but this is where the error is occurring. The problem is that when I use this code, the h3 tag is not always appearing in each item in the cellsList. This results in a NoneType error when the last line of the code is ran. However, the h3 tag is always there in the HTML when I inspect the webpage.
cellsList vs html 1
same comparison made from subsequent soup request
What could be causing these differences and how can I avoid this problem? I was able to run the code successfully for a time, and it seems to have all of a sudden stopped working. The code is able to scrape some pages without problem but it randomly does not register the h3 tags on random entries on random pages.
There are slight discrepancies in the html for various elements as you progress through the site pages, the best way to get the name is actually to select the outer div and extract the text from the anchor.
This will get all the info from each product and put it into dicts where the keys are 'Tissue', 'Cell' etc.. and the values are the relating descriptionm:
import requests
from time import sleep
def from_soup(url):
with requests.Session() as s:
s.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
# id for next oage anchor.
id_ = "#layoutcontent_2_middlecontent_0_threecolumncontent_0_content_ctl00_rptCenterColumn_dcpCenterColumn_0_ctl00_0_productRecords_0_bottomPaging_0_liNextPage_0"
soup = BeautifulSoup(s.get(url).content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k, "", 1).split())
yield d
# get anchor for next page and loop until no longer there.
nxt = soup.select_one(id_)
# loop until mo more next page.
while nxt:
# sleep between requests
sleep(.5)
resp = s.get(nxt.a["href"])
soup = BeautifulSoup(resp.content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k,"",1).split())
yield d
After running:
for ind, h in enumerate(from_soup(
"https://www.lgcstandards-atcc.org/Products/Cells_and_Microorganisms/Cell_Lines/Human/Alphanumeric.aspx?geo_country=gb")):
print(ind, h)
You will see 1211 dicts with all the data.