How to scrape hidden data elements using BeautifulSoup

How to scrape hidden data elements using BeautifulSoup - python

Level2StockQuotes.com offers free real-time top of book quotes that I would like to capture in python using BeautifulSoup. The issue is even though I can see the actual data values in a browser inspector, I can't scrape these values into python.
BeautifulSoup returns all data rows with each data element blank. Pandas returns a dataframe with NaN for each data element.
import bs4 as bs
import urllib.request
import pandas as pd
symbol = 'AAPL'
url = 'https://markets.cboe.com/us/equities/market_statistics/book/'+ symbol + '/'
page = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(page,'lxml')
rows = soup.find_all('tr')
print(rows)
for tr in rows:
td = tr.find_all('td')
row =(i.text for i in td)
print(row)
#using pandas to get dataframe
dfs = pd.read_html(url)
for df in dfs:
print(df)
Can someone more experienced than I tell me how to pull this data?
Thanks!

The page is dynamic. You'll either need to use Selenium to simulate a browser and let the page render before grabbing the html, or you can get the data straight from the json XHR.
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://markets.cboe.com/json/bzx/book/AAPL'
headers = {
'Referer': 'https://markets.cboe.com/us/equities/market_statistics/book/AAPL/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
jsonData = requests.get(url, headers=headers).json()
df_asks = pd.DataFrame(jsonData['data']['asks'], columns=['Shares','Price'] )
df_bids = pd.DataFrame(jsonData['data']['bids'], columns=['Shares','Price'] )
df_trades = pd.DataFrame(jsonData['data']['trades'], columns=['Time','Price','Shares','Time_ms'])
Output:
df_list = [df_asks, df_bids, df_trades]
for df in df_list:
print (df)
Shares Price
0 40 209.12
1 100 209.13
2 200 209.14
3 100 209.15
4 24 209.16
Shares Price
0 200 209.05
1 200 209.02
2 100 209.01
3 200 209.00
4 100 208.99
Time Price Shares Time_ms
0 10:45:57 300 209.0700 10:45:57.936000
1 10:45:57 300 209.0700 10:45:57.936000
2 10:45:55 29 209.1100 10:45:55.558000
3 10:45:52 45 209.0900 10:45:52.265000
4 10:45:52 50 209.0900 10:45:52.265000
5 10:45:52 5 209.0900 10:45:52.265000
6 10:45:51 100 209.1100 10:45:51.902000
7 10:45:48 100 209.1400 10:45:48.528000
8 10:45:48 100 209.1300 10:45:48.528000
9 10:45:48 200 209.1300 10:45:48.528000

Related

How to scrape a table from a page and create a multi-column dataframe with python?

This website https://aviation-safety.net/wikibase/ DB begins from year 1902 to 2022.
I am trying to scrape the table, narrative, probable cause and classification for every accidents in the year 2015 and 2016: https://aviation-safety.net/database/dblist.php?Year=2015. With the below code I am able to scrape the table only:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import concurrent.futures
import itertools
from random import randint
from time import sleep
def scraping(year):
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = f'https://aviation-safety.net/database/dblist.php?Year={year}&sorteer=datekey&page=1'
#sleep(randint(1,3))
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
page_container = soup.find('div',{'class':'pagenumbers'})
pages = max([int(page['href'].split('=')[-1]) for page in page_container.find_all('a')])
#info = []
tl = []
for page in range(1,pages+1):
new_url = f'https://aviation-safety.net/database/dblist.php?Year={year}&lang=&page={page}'
print(new_url)
#sleep(randint(1,3))
data = requests.get(new_url,headers=headers)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find('table')
for index,row in enumerate(table.find_all('tr')):
if index == 0:
continue
link_ = 'https://aviation-safety.net/'+row.find('a')['href']
#sleep(randint(1,3))
new_page = requests.get(link_, headers=headers)
new_soup = BeautifulSoup(new_page.text, 'lxml')
table1 = new_soup.find('table')
for i in table1.find_all('tr'):
title = i.text
tl.append(title)
df= pd.DataFrame(tl)
df.columns = ['status']
df.to_csv(f'{year}_aviation-safety_new.csv', encoding='utf-8-sig', index=False)
if __name__ == "__main__":
START = 2015
STOP = 2016
years = [year for year in range(START,STOP+1)]
print(f'Scraping {len(years)} years of data')
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
final_list = executor.map(scraping,years)
But the data is not organized. The dataframe looks like this:
The outcome should be like this:

It looks the values of tl are strings, e.g. 'Status:Accident investigation report completed and information captured'.
Converting the list of strings into a pd.DataFrame gets you a single column with all the values in the list.
If you want to use the "name" of the string, e.g. Status as a column header, you'll need to separate it from the rest of the text.
# maxsplit of 1 so we don't accidentally split up the values, e.g. time
title, text = title.split(":", maxsplit=1)
This looks like
('Status', 'Accident investigation report completed and information captured')
Now we create a dictionary
row_dict[title] = text
Giving us
{'Status': 'Accident investigation report completed and information captured'}
We will add to this same dictionary in the last loop
# old
for i in table1.find_all('tr'):
title = i.text
tl.append(title)
# new
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
After we've gathered all the data from page, i.e. completed the row_dict loop, we append to tl.
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
tl.append(row_dict)
All together now
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import concurrent.futures
import itertools
from random import randint
from time import sleep
def scraping(year):
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = f'https://aviation-safety.net/database/dblist.php?Year={year}&sorteer=datekey&page=1'
#sleep(randint(1,3))
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
page_container = soup.find('div',{'class':'pagenumbers'})
pages = max([int(page['href'].split('=')[-1]) for page in page_container.find_all('a')])
#info = []
tl = []
for page in range(1,pages+1):
new_url = f'https://aviation-safety.net/database/dblist.php?Year={year}&lang=&page={page}'
print(new_url)
#sleep(randint(1,3))
data = requests.get(new_url,headers=headers)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find('table')
for index,row in enumerate(table.find_all('tr')):
if index == 0:
continue
link_ = 'https://aviation-safety.net/'+row.find('a')['href']
#sleep(randint(1,3))
new_page = requests.get(link_, headers=headers)
new_soup = BeautifulSoup(new_page.text, 'lxml')
table1 = new_soup.find('table')
# make changes here!!!!!!!
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
tl.append(row_dict)
df= pd.DataFrame(tl)
df.to_csv(f'{year}_aviation-safety_new.csv', encoding='utf-8-sig', index=False)
if __name__ == "__main__":
START = 2015
STOP = 2016
years = [year for year in range(START,STOP+1)]
print(f'Scraping {len(years)} years of data')
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
final_list = executor.map(scraping,years)

The read_html()
method offers convenient access to such datasets.
>>> url = "https://web.archive.org/web/20221027040903/https://aviation-safety.net/database/dblist.php?Year=2015"
>>>
>>> dfs = pd.read_html(url)
>>>
>>> df = dfs[1].drop(columns="operator").dropna(axis=1, how="all")
>>> df["date"] = pd.to_datetime(df.date.str.replace("??-", "01-", regex=False), format="%d-%b-%Y")
>>> df.set_index("date")
type registration fat. location cat
date
2015-01-02 Saab 340B G-LGNL 0 Stornoway Ai... A1
2015-01-03 Antonov An-26B-100 RA-26082 0 Magadan-Soko... A1
2015-01-04 Fokker 50 5Y-SIB 0 Nairobi-Jomo... A1
2015-01-08 Bombardier Challenger 300 PR-YOU 0 São Paulo-Co... O1
2015-01-09 Cessna 208B Grand Caravan 8R-GAB 0 Matthews Rid... A2
... ... ... ... ... ..
2015-06-11 Eclipse 500 N508JA 0 Sacramento-E... A2
2015-06-11 Hawker 800XP N497AG 0 Port Harcour... A1
2015-06-12 Boeing 737-33A VH-NLK 0 near Kosrae Airpo... I2
2015-06-15 Antonov An-2R RA-84553 0 Tatsinsky di... A1
2015-06-16 Boeing 737-322 (WL) LY-FLB 0 Aktau Airpor... O1
[100 rows x 5 columns]
It's hard to control the
user-agent
header, so either use a cooperative site,
or do a bit of extra work with requests or curl
to obtain the html text beforehand.

Optimising Python script for scraping to avoid getting blocked/ draining resources

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date
url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")
with open(filename, 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
# for page in range(1, 3):
for page in itertools.count(1):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(1, 5))
# this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!

This is one way of getting that data - bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
# print(soup)
properties = soup.select('li.pp-property-box')
for p in properties:
name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)
Result printed in terminal:
100%
251/251 [03:41<00:00, 1.38it/s]
Property Price Url
0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833
4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825
3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/

how can I scrape data beyond the page limit on Zillow?

I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)

Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]

Separate table row to 2 when scraping with pandas read_html

Can't get the row format correct when using pandas read_html(). I'm looking for adjustments either to the method itself or the underlying html (scraped via bs4) to get the desired output.
Current output:
(note it is 1 row containing two types of data. ideally it should be separated to 2 rows as below)
Desired:
code to replicate the issue:
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
df = pd.read_html(url)[-1] # last table
df.columns = [str(i) for i in range(len(df.columns))]
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]

How to (quick) fix with beautifulsoup
You can create a dict with the headers from the table and then iterate over each td to append the list of values stored in the p:
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
Output
Fighter
Sig. str
Sig. str. %
Head
Body
Leg
Distance
Clinch
Ground
Joanne Wood
27 of 68
39%
8 of 36
3 of 7
16 of 25
26 of 67
1 of 1
0 of 0
Taila Santos
30 of 60
50%
21 of 46
3 of 7
6 of 7
19 of 42
0 of 0
11 of 18

Similar idea to use enumerate to determine number of rows, but use :-soup-contains to target table, then nth-child selector to extract relevant row during list comprehension. pandas to convert resultant list of lists into a DataFrame. Assumes rows are added in same pattern as current 2.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('http://ufcstats.com/fight-details/bb15c0a2911043bd')
soup = bs(r.content, 'lxml')
table = soup.select_one(
'.js-fight-section:has(p:-soup-contains("Significant Strikes")) + table')
df = pd.DataFrame(
[[i.text.strip() for i in table.select(f'tr:nth-child(1) td p:nth-child({n+1})')]
for n, _ in enumerate(table.select('tr:nth-child(1) > td:nth-child(1) > p'))], columns=[i.text.strip() for i in table.select('th')])
print(df)

BeautifulSoup organize data into dataframe table

I have been working with BeautifulSoup to try and organize some data that I am pulling from an website (html) I have been able to boil the data down but am getting stuck on how to:
eliminate not needed info
organize remaining data to be put into a pandas dataframe
Here is the code I am working with:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url,headers = headers)
soup = bs(page.text)
names = soup.body.findAll('tr')
function_names = re.findall('th class="\w+', str(names))
function_names = [item[10:] for item in function_names]
description = soup.body.findAll('td')
#description = re.findall('td class="\w+', str(description))
data = pd.DataFrame({'Title':function_names,'Info':description})
The error I have been getting is that the array numbers don't match up, which I know to be true but when I un-hashtag out the second description line it removes the numbers I want from there and even then the table isn't organizing itself properly.
What I would like the output to look like is:
(headers) title: location | studio | 1 BR | 2 BR | 3 BR
(new line) data : Lehi, UT| $1,335 |$1,309|$1,454|$1,580
That is really all that I need but I can't get BS or Pandas to do it properly.
Any help would be greatly appreciated!

Try the following approach. It first extracts all of the data in the table and then transposes it (columns swapped with rows):
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url, headers=headers)
soup = bs(page.text, 'lxml')
table = soup.find("table", class_="rentTrendGrid")
rows = []
for tr in table.find_all('tr'):
rows.append([td.text for td in tr.find_all(['th', 'td'])])
#header_row = rows[0]
rows = list(zip(*rows[1:])) # tranpose the table
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Giving you the following kind of output:
Studio 1 BR 2 BR 3 BR
0 0 729 1,041 1,333
1 $1,335 $1,247 $1,464 $1,738

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape hidden data elements using BeautifulSoup - python

Related

How to scrape a table from a page and create a multi-column dataframe with python?

Optimising Python script for scraping to avoid getting blocked/ draining resources

how can I scrape data beyond the page limit on Zillow?

Separate table row to 2 when scraping with pandas read_html

BeautifulSoup organize data into dataframe table

Categories

Resources