Can't get the row format correct when using pandas read_html(). I'm looking for adjustments either to the method itself or the underlying html (scraped via bs4) to get the desired output.
Current output:
(note it is 1 row containing two types of data. ideally it should be separated to 2 rows as below)
Desired:
code to replicate the issue:
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
df = pd.read_html(url)[-1] # last table
df.columns = [str(i) for i in range(len(df.columns))]
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]
How to (quick) fix with beautifulsoup
You can create a dict with the headers from the table and then iterate over each td to append the list of values stored in the p:
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
Output
Fighter
Sig. str
Sig. str. %
Head
Body
Leg
Distance
Clinch
Ground
Joanne Wood
27 of 68
39%
8 of 36
3 of 7
16 of 25
26 of 67
1 of 1
0 of 0
Taila Santos
30 of 60
50%
21 of 46
3 of 7
6 of 7
19 of 42
0 of 0
11 of 18
Similar idea to use enumerate to determine number of rows, but use :-soup-contains to target table, then nth-child selector to extract relevant row during list comprehension. pandas to convert resultant list of lists into a DataFrame. Assumes rows are added in same pattern as current 2.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('http://ufcstats.com/fight-details/bb15c0a2911043bd')
soup = bs(r.content, 'lxml')
table = soup.select_one(
'.js-fight-section:has(p:-soup-contains("Significant Strikes")) + table')
df = pd.DataFrame(
[[i.text.strip() for i in table.select(f'tr:nth-child(1) td p:nth-child({n+1})')]
for n, _ in enumerate(table.select('tr:nth-child(1) > td:nth-child(1) > p'))], columns=[i.text.strip() for i in table.select('th')])
print(df)
Related
i want to scrap product images from https://society6.com/art/i-already-want-to-take-a-nap-tomorrow-pink of each product >
step=1 first i go in div', class_='card_card__l44w (which is having each product link)
step=2 then parse the href of each product >
but its getting back only first 15 product link inspite of all 44
==============================
second thing is when i parse each product link and then grab json from there ['product']['response']['product']['data']['attributes']['media_map']
after media_map key there are many other keys like b , c , d , e , f , g (all having src: in it with the image link i only want to parse .jpg image from every key)
below is my code
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
baseurl = 'https://society6.com/'
headers = {
"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
r = requests.get('https://society6.com/art/flamingo-cone501586', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
productslist = soup.find_all('div', class_='card_card__l44w')
productlinks = []
for item in productslist:
for link in item.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
newlist = []
for link in productlinks:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find_all('script')[9].text.strip()[24:]
data = json.loads(scripts)
url = data['product']['response']['product']['data']['attributes']['media_map']
detail = {
'links' : url
}
newlist.append(detail)
print('saving')
df = pd.DataFrame(newlist)
df.to_csv('haja.csv')`
[1]: https://i.stack.imgur.com/qdhXP.png
All the information is loaded at first visit and all 66 products are stored in window.__INITIAL_STATE.
If you scroll almost to the end of the file you can see it.
You can use that to parse the information.
import re
import json
data = json.loads((soup
.find("script", text=re.compile("^window.__INITIAL_STATE"))
.text
.replace("</script>", "")
.replace("window.__INITIAL_STATE = ", "")))
products = data["designDetails"]["response"]["designDetails"]["data"]["products"]
products is a list with 66 items. Example:
{'sku': 's6-7120491p92a240v826',
'retail_price': 29,
'discount_price': 24.65,
'image_url': 'https://ctl.s6img.com/society6/img/yF7u4l5D3MODQBBerUQBHdYsfN8/h_264,w_264/acrylic-boxes/small/top/~artwork,fw_1087,fh_1087,fx_-401,fy_-401,iw_1889,ih_1889/s6-original-art-uploads/society6/uploads/misc/f7916751f46d4d9c9fb7f6fe4e5d5729/~~/flamingo-cone501586-acrylic-boxes.jpg',
'product_type': {'id': 92,
'title': 'Acrylic Box',
'slug': 'acrylic-box',
'slug_plural': 'acrylic-boxes'},
'department': {'id': 83,
'title': 'Office'},
'sort': 0}
I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]
I'm trying to get the transfer history of the top 500 most valuable players on Transfermarkt. I've managed (with some help) to loop through each players profile and scraped image and name. Now I want the transfer history, which can be found in a table on each players profile: Player Profile
I want to save the table in a dataframe, using Pandas and then write it to a CSV, with Season, Date etc as headers. For Monaco and PSG, for example, I just want the names of the clubs, not pictures or Nationality. But right now, all I get is this:
Empty DataFrame
Columns: []
Index: []
Expected output:
Season Date Left Joined MV Fee
0 18/19 Jul 1, 2018 Monaco PSG 120.00m 145.00m
I've viewed the source and inspected the page, but can't find anything that helps me, apart from that the tbody and tr. But the way I'm doing it I want to precise that table, since there are several others.
This is my code:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
result = []
def main(url):
with requests.Session() as req:
result = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)
result.extend([
{
"Season": t[1].text.strip()
}
for t in (t.find_all(recursive=False) for t in tr)
])
df = pd.DataFrame(result)
print(df)
import requests
from bs4 import BeautifulSoup
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
links = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
"a", class_="spielprofil_tooltip")]
ns = [item.text for item in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
links.extend(urls)
names.extend(ns)
return links, names
def parser():
links, names = main(site)
for link, name in zip(links, names):
with requests.Session() as req:
r = req.get(link, headers=headers)
df = pd.read_html(r.content)[1]
df.loc[-1] = name
df.index = df.index + 1
df.sort_index(inplace=True)
print(df)
parser()
Level2StockQuotes.com offers free real-time top of book quotes that I would like to capture in python using BeautifulSoup. The issue is even though I can see the actual data values in a browser inspector, I can't scrape these values into python.
BeautifulSoup returns all data rows with each data element blank. Pandas returns a dataframe with NaN for each data element.
import bs4 as bs
import urllib.request
import pandas as pd
symbol = 'AAPL'
url = 'https://markets.cboe.com/us/equities/market_statistics/book/'+ symbol + '/'
page = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(page,'lxml')
rows = soup.find_all('tr')
print(rows)
for tr in rows:
td = tr.find_all('td')
row =(i.text for i in td)
print(row)
#using pandas to get dataframe
dfs = pd.read_html(url)
for df in dfs:
print(df)
Can someone more experienced than I tell me how to pull this data?
Thanks!
The page is dynamic. You'll either need to use Selenium to simulate a browser and let the page render before grabbing the html, or you can get the data straight from the json XHR.
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://markets.cboe.com/json/bzx/book/AAPL'
headers = {
'Referer': 'https://markets.cboe.com/us/equities/market_statistics/book/AAPL/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
jsonData = requests.get(url, headers=headers).json()
df_asks = pd.DataFrame(jsonData['data']['asks'], columns=['Shares','Price'] )
df_bids = pd.DataFrame(jsonData['data']['bids'], columns=['Shares','Price'] )
df_trades = pd.DataFrame(jsonData['data']['trades'], columns=['Time','Price','Shares','Time_ms'])
Output:
df_list = [df_asks, df_bids, df_trades]
for df in df_list:
print (df)
Shares Price
0 40 209.12
1 100 209.13
2 200 209.14
3 100 209.15
4 24 209.16
Shares Price
0 200 209.05
1 200 209.02
2 100 209.01
3 200 209.00
4 100 208.99
Time Price Shares Time_ms
0 10:45:57 300 209.0700 10:45:57.936000
1 10:45:57 300 209.0700 10:45:57.936000
2 10:45:55 29 209.1100 10:45:55.558000
3 10:45:52 45 209.0900 10:45:52.265000
4 10:45:52 50 209.0900 10:45:52.265000
5 10:45:52 5 209.0900 10:45:52.265000
6 10:45:51 100 209.1100 10:45:51.902000
7 10:45:48 100 209.1400 10:45:48.528000
8 10:45:48 100 209.1300 10:45:48.528000
9 10:45:48 200 209.1300 10:45:48.528000
I'm using BeautifulSoup to try to get the whole table of all 2000 companies from this URL:
https://www.forbes.com/global2000/list/#tab:overall.
This is the code I have written:
from bs4 import BeautifulSoup
import urllib.request
html_content = urllib.request.urlopen('https://www.forbes.com/global2000/list/#header:position')
soup = BeautifulSoup(html_content, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,7), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
In the result, I get only the names of the columns, but not the table itself.
How can I get the whole table.
The content is generated via javascript, so you can must selenium to mimic a browser and scroll movements, and then parse the page source with beautiful soup, or, in some cases, like this one, you can access those values by querying their ajax API:
import requests
import json
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
target = 'https://www.forbes.com/ajax/list/data?year=2017&uri=global2000&type=organization'
with requests.Session() as s:
s.headers = headers
data = json.loads(s.get(target).text)
print([x['name'] for x in data[:5]])
Output (first 5 items):
['3M', '3i Group', '77 Bank', 'AAC Technologies Holdings', 'ABB']