I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]
Related
I am trying to webscrape the "Active Positions" table from the following website:
https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings
My code is below:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings')
soup = BeautifulSoup(html_text, 'lxml')
job1 = soup.find('div', classs_ = 'dialog-off-canvas-main-canvas')
job2 = job1.find('div', class_ = 'page with-primary-nav hide-more-videos')
job3 = job2.find('div', class_ = 'page__main')
job4 = job3.find('div', class_ = 'page__content')
job5 = job4.find('div', class_ = 'quote-subdetail__content quote-subdetail__content--new')
job6 = job5.findAll('div', class_ = 'layout layout--2-col-large')
job7 = job6.find('div', class_ = 'institutional-holdings institutional-holdings--paginated')
job8 = job7.find('div', class_ = 'institutional-holdings__section institutional-holdings__section--active-positions')
job9 = job8.find('div', class_ = 'institutional-holdings__table-container')
job10 = job9.find('table', class_ = 'institutional-holdings__table')
job11 = job10.find('tbody', class_ = 'institutional-holdings__body')
job12 = job11.findAll('tr', class_ = 'institutional-holdings__row').text
print(job12)
I have chosen to include nearly every class path to attempt to speed up the execution, as including only a couple took up to 10 minutes before i decided to interupt. However, i still get the same long execution with no output. Is there something wrong with my code? Or can I improve this by doing something I haven't thought of? Thanks.
Data is being hydrated in page via Javascript XHR calls. Here is a way of getting ActivePositions by scraping the API endpoint directly:
import requests
import pandas as pd
url = 'https://api.nasdaq.com/api/company/AAPL/institutional-holdings?limit=15&type=TOTAL&sortColumn=marketValue&sortOrder=DESC'
headers = {
'accept': 'application/json, text/plain, */*',
'origin': 'https://www.nasdaq.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data']['activePositions']['rows'])
print(df)
Result in terminal:
positions holders shares
0 Increased Positions 1,780 239,170,203
1 Decreased Positions 2,339 209,017,331
2 Held Positions 283 8,965,339,255
3 Total Institutional Shares 4,402 9,413,526,789
In case you want to scrape the big 4,402 Institutional Holders table, there are ways for that too.
EDIT: Here is how you can save the data to a json file:
df.to_json('active_positions.json')
Although it might make more sense to save it as tabular data (csv):
df.to_csv('active_positions.csv')
Pandas docs: https://pandas.pydata.org/docs/
Here's link for scraping : https://stockanalysis.com/stocks/
I'm trying to get all the rows of the table (6000+ rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.
I tried almost everything I can. I'm , ALSO, a beginner in web scraping.
My code :
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000
What should I do?
Take a look down the bottom of the html and you will see this
<script id="__NEXT_DATA__" type="application/json">
Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.
As stated, it's in the <script> tags. Pull it and read it in.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])
Output:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]
New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres  Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
I am working with lobbying data from opensecrets.org, in particular industry data. I want to have a time series of lobby expenditures for each industry going back since the 90's.
I want to web-scrape the data automatically. Urls where the data is have the following format:
https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019
which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage. It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code. I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.
If there is an API, that your best bet as mentioned above. But the data is able to be parsed anyway provided you get the right url/query parameters:
I've managed to iterate through it with the links for you to grab each table. I stored it in a dictionary with the key being the Firm name, and the value being the table/data. You can change it up to anyway you'd like. Maybe just store as json, or save each as csv.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a', href=True)
root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}
for each in links:
if 'clientsum.php?' in each['href']:
w=1
firms = each.text
link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
links_dict[firms] = link
all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():
print ('%s of %s ---- %s' %(n, tot, firms))
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
results = pd.DataFrame()
graph = soup.find_all('set')
for each in graph:
year = each['label']
total = each['value']
temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
results = results.append(temp_df,sort=True).reset_index(drop=True)
all_tables[firms] = results
n+=1
*Output:**
Not going to print as there are 347 tables, but just so you see the structure:
I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()