How to scrape web-page with button/menuitems optionvalue?

How to scrape web-page with button/menuitems optionvalue? - python

In particular, I'am trying to scrape this web site
I would like to setup the Button-menuitems on "50" rows per page:
My Currently core is the follow:
Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//[#class='btn btn-default dropdown-toggle']")))).select_by_visible_text('50')
where is my wrong? Can you help me ?
Thank you in advance for youre Time!

This should work for your case
driver = webdriver.Firefox(service=s)
driver.get(' https://whalewisdom.com/filer/fisher-asset-management-llc#tabholdings_tab_link')
button = driver.find_element(By.CSS_SELECTOR, '.btn-group.dropdown')
button.click()
element = driver.find_element(By.XPATH, '//li[#role="menuitem"]/a[contains(text(), "50")]')
element.click()

You can try this easier code which doesn't need Selenium but rather directly makes a call to the data API of the site with requests.
Please note the argument limit at the end of the query string that sets the limit to 50 rows, as you want. If you want to scrape the next 50 items just increase the offset to 50 then 100, 150, etc. This will get you all the available data.
import requests
import pandas as pd
import json
url = "https://whalewisdom.com/filer/holdings?id=berkshire-hathaway-inc&q1=-1&type_filter=1,2,3,4&symbol=&change_filter=&minimum_ranking=&minimum_shares=&is_etf=0&sc=true&sort=current_mv&order=desc&offset=0&limit=50"
raw = requests.get(url)
data = json.loads(raw.content)
df = pd.DataFrame(data["rows"])
df.head()
Print out:
symbol permalink security_type name sector industry current_shares previous_shares shares_change position_change_type ... percent_ownership quarter_first_owned quarter_id_owned source_type source_date filing_date avg_price recent_price quarter_end_price id
0 AAPL aapl SH Apple Inc INFORMATION TECHNOLOGY COMPUTERS & PERIPHERALS 8.909234e+08 8.871356e+08 3787856.0 addition ... 5.5045625 Q1 2016 61 13F 2022-03-31 2022-05-16 36.6604 160.01 174.61 None
1 BAC bac SH Bank of America Corp. (North Carolina National... FINANCE BANKS 1.010101e+09 1.010101e+09 0.0 None ... 12.5371165 Q3 2017 67 13F 2022-03-31 2022-05-16 25.5185 33.04 41.22 None
2 AXP axp SH American Express Co FINANCE CONSUMER FINANCE 1.516107e+08 1.516107e+08 0.0 None ... 20.1326115 Q1 2001 1 13F 2022-03-31 2022-05-16 39.3110 151.60 187.00 None
3 CVX cvx SH Chevron Corp. (Standard Oil of California) ENERGY INTEGRATED OIL & GAS 1.591781e+08 3.824504e+07 120933081.0 addition ... 8.1014366 Q4 2020 80 13F 2022-03-31 2022-05-16 125.3424 159.14 162.83 None
4 KO ko SH Coca Cola Co. CONSUMER STAPLES

Related

How to use BeautifulSoup for td tags without attributes?

I am trying to extract this immigration data from Canada immigration website
I tried extracting the whole table, individual td tags, all returns empty list.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import requests
s = HTMLSession()
data=[]
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def gettabledata(soup):
table = soup.find_all('td',{'class':'sorting_1'})
I also tried finding tr tags in a table;
table = soup.find('table', class_='table')
print("\n TABLE \n")
print(table)
table_body = table.find('tbody')
print("\n TABLEBODY \n")
print(table_body)
rows = table_body.find_all('tr')
print("\n ROWS \n")
print(rows)
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print(data)
Also tried , but doesnt work:
divs = soup.find_all('div', class_='dataTables_wrapper')
print("divs \n", divs)
for div in divs:
table = div.find('tbody',class_='wb-data-json-inited')
print("\n TABLE \n")
print(table)
table_body = table.find('tbody')
print("\n TABLEBODY \n")
print(table_body)
rows = table.find_all('tr')
print("\n ROWS \n")
print(rows)
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print(data)
What am I missing, and how can I extract the table data?
Another doubt I have is how I can access the Program name and values just enclosed in td tags with no attributes. Each line in the table is a tr tag with a bunch of td tags in them :
<tr>
<td>219</td>
<td class="nowrap">March 30, 2022</td>
<td>Provincial Nominee Program</td>
<td>919</td>
<td>785</td>
<td>30-03-2022</td>
<td><p>Provincial Nominee Program</p></td>
</tr>

That table is being loaded dynamically in page. You can inspect the Network Tab in Dev Tools, and see which API is being accessed to pull the data from.
This is one way of obtaining the data in that table:
import requests
import pandas as pd
r = requests.get('https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json')
df = pd.DataFrame(r.json()['rounds'])
print(df)
This will return a dataframe:
drawNumber drawNumberURL drawDate drawDateFull drawName drawSize drawCRS mitext DrawText1 drawText2 drawDateTime drawCutOff drawDistributionAsOn dd1 dd2 dd3 dd4 dd5 dd6 dd7 dd8 dd9 dd10 dd11 dd12 dd13 dd14 dd15 dd16 dd17 dd18
0 229 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=229'>229</a> 2022-08-17 August 17, 2022 No Program Specified 2,250 525 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=229'>Invitations to apply for permanent residence under the Express Entry system #229</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=229'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program August 17, 2022 at 13:43:47 UTC December 28, 2021 at 11:03:15 UTC August 15, 2022 538 8,221 62,753 5,435 9,129 18,831 16,465 12,893 58,113 12,200 12,721 9,801 11,138 12,253 68,440 35,745 5,137 238,947
1 228 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=228'>228</a> 2022-08-03 August 3, 2022 No Program Specified 2,000 533 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=228'>Invitations to apply for permanent residence under the Express Entry system #228</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=228'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program August 03, 2022 at 15:16:24 UTC January 06, 2022 at 14:29:50 UTC August 2, 2022 640 8,975 62,330 5,343 9,044 18,747 16,413 12,783 57,987 12,101 12,705 9,747 11,117 12,317 68,325 35,522 5,145 238,924
2 227 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=227'>227</a> 2022-07-20 July 20, 2022 No Program Specified 1,750 542 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=227'>Invitations to apply for permanent residence under the Express Entry system #227</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=227'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program July 20, 2022 at 16:32:49 UTC December 30, 2021 at 15:29:35 UTC July 18, 2022 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 226 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=226'>226</a> 2022-07-06 July 6, 2022 No Program Specified 1,500 557 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=226'>Invitations to apply for permanent residence under the Express Entry system #226</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=226'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program July 6, 2022 at 14:34:34 UTC November 13, 2021 at 02:20:46 UTC July 11, 2022 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 225 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=225'>225</a> 2022-06-22 June 22, 2022 Provincial Nominee Program 636 752 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=225'>Invitations to apply for permanent residence under the Express Entry system #225</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations.html?q=225'>See full text of Ministerial Instruction</a> Provincial Nominee Program June 22, 2022 at 14:13:57 UTC April 19, 2022 at 13:45:45 UTC June 20, 2022 664 8,017 55,917 4,246 7,845 16,969 15,123 11,734 53,094 10,951 11,621 8,800 10,325 11,397 64,478 33,585 4,919 220,674
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
225 5 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-5.html'>5</a> 2015-03-20 March 20, 2015 No Program Specified 1,620 481 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-5.html'>Invitations to apply for permanent residence under the Express Entry system #5</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-5.html'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program March 20, 2015 at 23:59:53 UTC March 20, 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
226 4 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-4.html'>4</a> 2015-02-27 February 27, 2015 No Program Specified 1,187 735 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-4.html'>Invitations to apply for permanent residence under the Express Entry system #4</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-4.html'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program February 27, 2015 at 23:59:51 UTC February 27, 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
227 3 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-3.html'>3</a> 2015-02-20 February 20, 2015 Canadian Experience Class 849 808 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-3.html'>Invitations to apply for permanent residence under the Express Entry system #3</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-3.html'>See full text of Ministerial Instruction</a> Canadian Experience Class February 20, 2015 at 11:59:47 UTC February 20, 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
228 2 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-2.html'>2</a> 2015-02-07 February 7, 2015 No Program Specified 779 818 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-2.html'>Invitations to apply for permanent residence under the Express Entry system #2</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-2.html'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program February 7, 2015 at 11:59:59 UTC February 7, 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
229 1 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-1.html'>1</a> 2015-01-31 January 31, 2015 No Program Specified 779 886 <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-1.html'>Invitations to apply for permanent residence under the Express Entry system #1</a> <a href='/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds/invitations-1.html'>See full text of Ministerial Instruction</a> Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program January 31, 2015 at 11:59:48 UTC January 31, 2015 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
230 rows × 31 columns

BeautifulSoup and Pandas read_html is not pulling all of the rows in a table

When I am scraping a table from a website, it is missing the bottom 5 rows of data and I do not know how to pull them. I am using a combination of BeautifulSoup and Selenium. I thought that they were not loading, so I tried scrolling to the bottom with Selenium, but that still did not work.
Code trials:
site = 'https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One'
PATH = my_path
driver = webdriver.Chrome(PATH)
driver.get(site)
webpage = bs.BeautifulSoup(driver.page_source, features='html.parser')
table = webpage.find('table', {'class': 'stats_table sortable min_width now_sortable'})
print(table.prettify())
df = pd.read_html(str(table))[0]
print(df.tail())
Please could you help with scraping the full table?

Using only Selenium to pull all the rows from the table within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Using CSS_SELECTOR:
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.stats_table.sortable.min_width.now_sortable"))).get_attribute("outerHTML")
tabledf = pd.read_html(tabledata)
print(tabledf)
Using XPATH:
driver.get('https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One')
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='stats_table sortable min_width now_sortable']"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
[ Round Wk Day ... Referee Match Report Notes
0 Regular Season 1 Sat ... Charles Breakspear Match Report NaN
1 Regular Season 1 Sat ... Andrew Davies Match Report NaN
2 Regular Season 1 Sat ... Kevin Johnson Match Report NaN
3 Regular Season 1 Sat ... Anthony Backhouse Match Report NaN
4 Regular Season 1 Sat ... Marc Edwards Match Report NaN
.. ... ... ... ... ... ... ...
685 Semi-finals NaN Tue ... Robert Madley Match Report Leg 1 of 2
686 Semi-finals NaN Wed ... Craig Hicks Match Report Leg 1 of 2
687 Semi-finals NaN Fri ... Keith Stroud Match Report Leg 2 of 2; Blackpool won
688 Semi-finals NaN Sat ... Michael Salisbury Match Report Leg 2 of 2; Lincoln City won
689 Final NaN Sun ... Tony Harrington Match Report NaN
[690 rows x 13 columns]]

Automatic file downloading on Barchart.com using python

I would like to download automatically the table from this link:
https://www.barchart.com/options/iv-rank-percentile/stocks
In order to do this, with the help of some tutorials, I wrote this piece of code:
# Import libraries
from urllib.request import Request, urlopen
import requests
from bs4 import BeautifulSoup as soup
# Set the URL you want to webscrape from
url = 'https://www.barchart.com/options/iv-rank-percentile/stocks?viewName=main'
# Connect to the URL
response = requests.get(url)
print(response)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
print(req)
# Parse HTML and save to BeautifulSoup object¶
page_soup = soup(webpage, "html.parser")
#print(page_soup)
containers = page_soup.findAll("a", "toolbar-button download")
for container in containers:
print(container)
url = container.get('href')
print(url)
The results I printed are the followings:
<Response [403]>
<urllib.request.Request object at 0x030766F0>
<a class="toolbar-button download" data-bc-download-button=" Stocks IV Rank and IV Percentile "> <i class="bc-glyph-download"></i> <span>download</span></a>
None
It seems that I'm not able to find the 'href'
At this point, I'm struggling a bit with the next steps, because I don't really know how I can download the file (since the 'href' can't be found).
Maybe someone can help / or propose another solution?
Thanks a lot in advance,
Marketwizard

The data is loaded dynamically via Javascript from different URL. You can use this example how to load the data:
import json
import requests
from urllib.parse import unquote
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?list=options.mostActive.us&fields=symbol,symbolName,lastPrice,priceChange,percentChange,optionsTotalVolume,optionsWeightedImpliedVolatility,optionsImpliedVolatilityRank1y,optionsImpliedVolatilityPercentile1y,optionsWeightedImpliedVolatilityHigh1y,tradeTime,symbolCode,symbolType,hasOptions&between(lastPrice,.10,)=&between(tradeTime,2021-03-22,2021-03-23)=&orderBy=optionsTotalVolume&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1"
with requests.Session() as s:
# get all cookies
s.get(
"https://www.barchart.com/options/iv-rank-percentile/stocks",
headers=headers,
)
# use one cookie as HTTP header
headers["X-XSRF-TOKEN"] = unquote(s.cookies["XSRF-TOKEN"])
data = s.get(url, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data["data"]:
print("{:<8}{:<50}{}".format(d["symbol"], d["symbolName"], d["lastPrice"]))
Prints:
AAPL Apple Inc 123.39
TSLA Tesla Inc 670.00
FB Facebook Inc 293.54
AMC Amc Entertainment Holdings Inc 12.49
PLTR Palantir Technologies Inc Cl A 24.22
NIO Nio Inc 42.94
AMD Adv Micro Devices 80.30
F Ford Motor Company 12.85
SNDL Sundial Growers Inc 1.3000
BAC Bank of America Corp 37.66
MSFT Microsoft Corp 235.99
BABA Alibaba Group Holding 237.12
BA Boeing Company 251.23
GE General Electric Company 13.13
AAL American Airlines Gp 23.83
DKNG Draftkings Inc 71.72
WFC Wells Fargo & Company 38.97
AMZN Amazon.com Inc 3,110.87
GM General Motors Company 58.10
INTC Intel Corp 65.63
GME Gamestop Corp 194.49
SNAP Snap Inc 58.16
SOS Sos Ltd 6.90
PFE Pfizer Inc 36.00
NOK Nokia Corp 4.06
T AT&T Inc 29.99
CCL Carnival Corp 27.48
NVDA Nvidia Corp 527.11
MARA Marathon Digital Hldgs Inc 39.97
FTCH Farfetch Ltd Cl A 62.00
UBER Uber Technologies Inc 55.69
TLRY Tilray Inc 23.90
DIS Walt Disney Company 192.86
FCEL Fuelcell Energy Inc 15.04
QS Quantumscape Corp 64.29
SQ Square 226.13
CCIV Churchill Capital IV Cl A 26.15
V Visa Inc 208.00
CSCO Cisco Systems Inc 50.30
XOM Exxon Mobil Corp 55.91
FCX Freeport-Mcmoran Inc 35.01
JPM JP Morgan Chase & Company 150.97
PLUG Plug Power Inc 38.91
NFLX Netflix Inc 523.11
VALE Vale S.A. 17.01
TEVA Teva Pharmaceutical Industries Ltd 11.93
CLF Cleveland-Cliffs Inc 15.86
MU Micron Technology 91.27
BOX Box Inc 23.65
TSM Taiwan Semiconductor Manufacturing 117.18
RIOT Riot Blockchain Inc 56.01
BLNK Blink Charging Company 40.66
VZ Verizon Communications Inc 56.59
UAL United Airlines Holdings Inc 58.33
QCOM Qualcomm Inc 134.09
CLVS Clovis Oncology Inc 7.47
RLX Rlx Technology Inc ADR 10.15
LUMN Centurylink 14.37
WMT Wal-Mart Stores 132.37
TWTR Twitter Inc 65.21
NCLH Norwegian Cruise Ord 28.65
GOOGL Alphabet Cl A 2,030.69
C Citigroup Inc 71.96
JD Jd.com Inc Ads 84.97
BB Blackberry Ltd 10.71
X United States Steel Corp 21.79
RKT Rocket Companies Inc Cl A 22.99
PDD Pinduoduo Inc ADR 137.15
NLY Annaly Capital Management Inc 8.92
FUBO Fubotv Inc 31.53
MO Altria Group 51.64
DASH Doordash Inc Cl A 135.91
UWMC Uwm Hldg Corp 8.78
KSS Kohl's Corp 58.74
DAL Delta Air Lines Inc 47.97
NKLA Nikola Corp 15.55
LYFT Lyft Inc Cl A 64.13
WKHS Workhorse Grp 15.63
PENN Penn Natl Gaming Inc 113.16
CRM Salesforce.com Inc 215.17
XPEV Xpeng Inc ADR 37.88
BCRX Biocryst Pharma Inc 11.80
ET Energy Transfer LP 8.10
PTON Peloton Interactive Inc 109.54
BIDU Baidu Inc 266.13
NKE Nike Inc 138.27
PSTH Pershing Square Tontine Hldgs Cl A 25.89
ACB Aurora Cannabis Inc 9.70
PYPL Paypal Holdings 244.38
TME Tencent Music Entertainment Group ADR 30.87
CAN Canaan Inc ADR 22.97
GOLD Barrick Gold Corp 20.62
SPCE Virgin Galactic Holdings Inc 32.24
ZM Zoom Video Communications Cl A 328.50
NNDM Nano Dimension Ads 9.83
CVX Chevron Corp 102.54
SPRT Support.com Inc 7.10
OXY Occidental Petroleum Corp 27.46
COST Costco Wholesale 334.49
USAT USA Technologies Inc 12.45

Thank you for providing this example !
It's working, however, I have two questions:
Where does this url come from:
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?list=options.mostActive.us&fields=symbol,symbolName,lastPrice,priceChange,percentChange,optionsTotalVolume,optionsWeightedImpliedVolatility,optionsImpliedVolatilityRank1y,optionsImpliedVolatilityPercentile1y,optionsWeightedImpliedVolatilityHigh1y,tradeTime,symbolCode,symbolType,hasOptions&between(lastPrice,.10,)=&between(tradeTime,2021-03-22,2021-03-23)=&orderBy=optionsTotalVolume&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1"
Under this link https://www.barchart.com/options/iv-rank-percentile/stocks?viewName=main there are several pages with results (5 pages with 100 results each), and we have the option to click the button "show all" (show the 500 results). In the example you provided, only the content of the first page is read. How can I have access to the full results ?
Concerning the "download" button on the top right hand side of the table, is it possible to "click" on it for downloading the table ?

Missing data not being scraped from Hansard

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.

The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

Pandas read_html error when reading in a Wikipedia table

I'm trying to read in a table using read_html
import requests
import pandas as pd
import numpy as np
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
resp = requests.get(url)
tables = pd.read_html(resp.text)
But I get this error
IndexError: list index out of range
Other Wiki pages work fine. What's up with this page and how do I solve the above error?

Seems like the table can't be read because of the jquery table sorter.
It's easy to read tables with the selenium library into a df when you're dealing with jquery instead of plain html. You'll still need to do some cleanup, but this will get the table into a df.
You'll need to install the selenium library and download a web browser driver too.
from selenium import webdriver
driver = r'C:\chromedriver_win32\chromedriver.exe'
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
driver = webdriver.Chrome(driver)
driver.get(url)
the_table = driver.find_element_by_xpath('//*[#id="mw-content-text"]/div/table[2]/tbody/tr/td[2]/table')
data = the_table.text
df = pd.DataFrame([x.split() for x in data.split('\n')])
driver.close()
print(df)
0 1 2 3 4 5 \
0 Country (or dependent territory, None None
1 subnational area, etc.) Region Subregion Rate
2 listed Source None None None None
3 None None None None None None
4 Burundi Africa Eastern Africa 6.02 635
5 Comoros Africa Eastern Africa 7.70 60
6 Djibouti Africa Eastern Africa 6.48 60
7 Eritrea Africa Eastern Africa 8.04 390
8 Ethiopia Africa Eastern Africa 7.56 7,552
9 Kenya Africa Eastern Africa 5.00 2,466
10 Madagascar Africa Eastern Africa 7.69 1,863

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape web-page with button/menuitems optionvalue? - python

Related

How to use BeautifulSoup for td tags without attributes?

BeautifulSoup and Pandas read_html is not pulling all of the rows in a table

Automatic file downloading on Barchart.com using python

Missing data not being scraped from Hansard

Pandas read_html error when reading in a Wikipedia table

Categories

Resources