I would like to download automatically the table from this link:
https://www.barchart.com/options/iv-rank-percentile/stocks
In order to do this, with the help of some tutorials, I wrote this piece of code:
# Import libraries
from urllib.request import Request, urlopen
import requests
from bs4 import BeautifulSoup as soup
# Set the URL you want to webscrape from
url = 'https://www.barchart.com/options/iv-rank-percentile/stocks?viewName=main'
# Connect to the URL
response = requests.get(url)
print(response)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
print(req)
# Parse HTML and save to BeautifulSoup object¶
page_soup = soup(webpage, "html.parser")
#print(page_soup)
containers = page_soup.findAll("a", "toolbar-button download")
for container in containers:
print(container)
url = container.get('href')
print(url)
The results I printed are the followings:
<Response [403]>
<urllib.request.Request object at 0x030766F0>
<a class="toolbar-button download" data-bc-download-button=" Stocks IV Rank and IV Percentile "> <i class="bc-glyph-download"></i> <span>download</span></a>
None
It seems that I'm not able to find the 'href'
At this point, I'm struggling a bit with the next steps, because I don't really know how I can download the file (since the 'href' can't be found).
Maybe someone can help / or propose another solution?
Thanks a lot in advance,
Marketwizard
The data is loaded dynamically via Javascript from different URL. You can use this example how to load the data:
import json
import requests
from urllib.parse import unquote
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?list=options.mostActive.us&fields=symbol,symbolName,lastPrice,priceChange,percentChange,optionsTotalVolume,optionsWeightedImpliedVolatility,optionsImpliedVolatilityRank1y,optionsImpliedVolatilityPercentile1y,optionsWeightedImpliedVolatilityHigh1y,tradeTime,symbolCode,symbolType,hasOptions&between(lastPrice,.10,)=&between(tradeTime,2021-03-22,2021-03-23)=&orderBy=optionsTotalVolume&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1"
with requests.Session() as s:
# get all cookies
s.get(
"https://www.barchart.com/options/iv-rank-percentile/stocks",
headers=headers,
)
# use one cookie as HTTP header
headers["X-XSRF-TOKEN"] = unquote(s.cookies["XSRF-TOKEN"])
data = s.get(url, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data["data"]:
print("{:<8}{:<50}{}".format(d["symbol"], d["symbolName"], d["lastPrice"]))
Prints:
AAPL Apple Inc 123.39
TSLA Tesla Inc 670.00
FB Facebook Inc 293.54
AMC Amc Entertainment Holdings Inc 12.49
PLTR Palantir Technologies Inc Cl A 24.22
NIO Nio Inc 42.94
AMD Adv Micro Devices 80.30
F Ford Motor Company 12.85
SNDL Sundial Growers Inc 1.3000
BAC Bank of America Corp 37.66
MSFT Microsoft Corp 235.99
BABA Alibaba Group Holding 237.12
BA Boeing Company 251.23
GE General Electric Company 13.13
AAL American Airlines Gp 23.83
DKNG Draftkings Inc 71.72
WFC Wells Fargo & Company 38.97
AMZN Amazon.com Inc 3,110.87
GM General Motors Company 58.10
INTC Intel Corp 65.63
GME Gamestop Corp 194.49
SNAP Snap Inc 58.16
SOS Sos Ltd 6.90
PFE Pfizer Inc 36.00
NOK Nokia Corp 4.06
T AT&T Inc 29.99
CCL Carnival Corp 27.48
NVDA Nvidia Corp 527.11
MARA Marathon Digital Hldgs Inc 39.97
FTCH Farfetch Ltd Cl A 62.00
UBER Uber Technologies Inc 55.69
TLRY Tilray Inc 23.90
DIS Walt Disney Company 192.86
FCEL Fuelcell Energy Inc 15.04
QS Quantumscape Corp 64.29
SQ Square 226.13
CCIV Churchill Capital IV Cl A 26.15
V Visa Inc 208.00
CSCO Cisco Systems Inc 50.30
XOM Exxon Mobil Corp 55.91
FCX Freeport-Mcmoran Inc 35.01
JPM JP Morgan Chase & Company 150.97
PLUG Plug Power Inc 38.91
NFLX Netflix Inc 523.11
VALE Vale S.A. 17.01
TEVA Teva Pharmaceutical Industries Ltd 11.93
CLF Cleveland-Cliffs Inc 15.86
MU Micron Technology 91.27
BOX Box Inc 23.65
TSM Taiwan Semiconductor Manufacturing 117.18
RIOT Riot Blockchain Inc 56.01
BLNK Blink Charging Company 40.66
VZ Verizon Communications Inc 56.59
UAL United Airlines Holdings Inc 58.33
QCOM Qualcomm Inc 134.09
CLVS Clovis Oncology Inc 7.47
RLX Rlx Technology Inc ADR 10.15
LUMN Centurylink 14.37
WMT Wal-Mart Stores 132.37
TWTR Twitter Inc 65.21
NCLH Norwegian Cruise Ord 28.65
GOOGL Alphabet Cl A 2,030.69
C Citigroup Inc 71.96
JD Jd.com Inc Ads 84.97
BB Blackberry Ltd 10.71
X United States Steel Corp 21.79
RKT Rocket Companies Inc Cl A 22.99
PDD Pinduoduo Inc ADR 137.15
NLY Annaly Capital Management Inc 8.92
FUBO Fubotv Inc 31.53
MO Altria Group 51.64
DASH Doordash Inc Cl A 135.91
UWMC Uwm Hldg Corp 8.78
KSS Kohl's Corp 58.74
DAL Delta Air Lines Inc 47.97
NKLA Nikola Corp 15.55
LYFT Lyft Inc Cl A 64.13
WKHS Workhorse Grp 15.63
PENN Penn Natl Gaming Inc 113.16
CRM Salesforce.com Inc 215.17
XPEV Xpeng Inc ADR 37.88
BCRX Biocryst Pharma Inc 11.80
ET Energy Transfer LP 8.10
PTON Peloton Interactive Inc 109.54
BIDU Baidu Inc 266.13
NKE Nike Inc 138.27
PSTH Pershing Square Tontine Hldgs Cl A 25.89
ACB Aurora Cannabis Inc 9.70
PYPL Paypal Holdings 244.38
TME Tencent Music Entertainment Group ADR 30.87
CAN Canaan Inc ADR 22.97
GOLD Barrick Gold Corp 20.62
SPCE Virgin Galactic Holdings Inc 32.24
ZM Zoom Video Communications Cl A 328.50
NNDM Nano Dimension Ads 9.83
CVX Chevron Corp 102.54
SPRT Support.com Inc 7.10
OXY Occidental Petroleum Corp 27.46
COST Costco Wholesale 334.49
USAT USA Technologies Inc 12.45
Thank you for providing this example !
It's working, however, I have two questions:
Where does this url come from:
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?list=options.mostActive.us&fields=symbol,symbolName,lastPrice,priceChange,percentChange,optionsTotalVolume,optionsWeightedImpliedVolatility,optionsImpliedVolatilityRank1y,optionsImpliedVolatilityPercentile1y,optionsWeightedImpliedVolatilityHigh1y,tradeTime,symbolCode,symbolType,hasOptions&between(lastPrice,.10,)=&between(tradeTime,2021-03-22,2021-03-23)=&orderBy=optionsTotalVolume&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1"
Under this link https://www.barchart.com/options/iv-rank-percentile/stocks?viewName=main there are several pages with results (5 pages with 100 results each), and we have the option to click the button "show all" (show the 500 results). In the example you provided, only the content of the first page is read. How can I have access to the full results ?
Concerning the "download" button on the top right hand side of the table, is it possible to "click" on it for downloading the table ?
Related
The first page is loaded and parsed as expected but after the clicking on Next page, the BS4 does not get the new page from driver.page_source
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import random
import time
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
def parse_html(pagesource, count):
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.findChildren('table')
# This will get the first (and only) table. Your page may have more.
my_table = tables[0]
table_body = my_table.find('tbody')
all_rows = table_body.find_all('tr')
# print (all_rows[0])
for row in all_rows:
print (count)
count += 1
try:
path_body = row.find("td", class_="views-field-company-name")
path = path_body.find("a")['href']
company_name = path_body.find("a").text
company_name = company_name.strip()
print (company_name)
issue_datetime = row.find("td", class_="views-field-field-letter-issue-datetime")
# print (type(issue_datetime.find("time")['datetime']))
issue_recepient_office = row.find("td", class_="views-field-field-building").string
issue_recepient_office = issue_recepient_office.strip()
# print (issue_recepient_office)
detailed_description = row.find("td", class_="views-field-field-detailed-description-2").string
if detailed_description:
detailed_description = detailed_description.strip()
else:
detailed_description = ""
#print (detailed_description)
except:
pass
url = 'https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/warning-letters'
driver.get(url)
count = 1
parse_html(driver.page_source, count)
for i in range(0,3):
time.sleep(10)
#print(driver.page_source.encode('utf-8'))
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#datatable_next a'))).click()
time.sleep(30)
parse_html(driver.page_source, count)
driver.quit()
Output:
1
Ruth Special Food Store LLC
Foreign Supplier Verification Program (FSVP)
2
EarthLab, Inc., dba Wise Woman Herbals
3
Big Olaf Creamery LLC dba Big Olaf
CGMP/Food/Prepared, Packed or Held Under Insanitary Conditions/Adulterated/L. monocytogenes
4
Bainbridge Beverage West, LLC
Juice HACCP/CGMP for Foods/Adulterated/Insanitary Conditions
5
VapeL1FE, LLC
Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded
6
Mike Millenkamp Dairy Cattle
7
Empowered Diagnostics LLC
Unapproved Products Related to the Coronavirus Disease 2019 (COVID-19)
8
RoyalVibe Health Ltd.
CGMP/QSR/Medical Devices/PMA/Adulterated/Misbranded
9
Land View, Inc.
CGMP/Medicated Feeds/Adulterated
10
Green Pharmaceuticals Inc.
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
none selenium solution:
import requests
from bs4 import BeautifulSoup
import pandas as pd
PAGE_LENGHT = 50
def get_letters(page: int):
start = page * PAGE_LENGHT
url = f"https://www.fda.gov/datatables/views/ajax?field_letter_issue_datetime=All&field_change_date_2=All&draw={page}&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=2&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=3&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=4&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=5&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=6&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=7&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=false&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&start={start}&length={PAGE_LENGHT}&search%5Bregex%5D=false&_drupal_ajax=1&_wrapper_format=drupal_ajax&view_base_path=inspections-compliance-enforcement-and-criminal-investigations%2Fcompliance-actions-and-activities%2Fwarning-letters%2Fdatatables-data&view_display_id=warning_letter_solr_block&view_dom_id=4605f153788b3a17043d0e031eb733846503177581602cd9fd58ecd78629801b&view_name=warning_letter_solr_index&view_path=%2Finspections-compliance-enforcement-and-criminal-investigations%2Fcompliance-actions-and-activities%2Fwarning-letters&total_items=3433"
letters = []
for letter in requests.get(url).json()['data']:
letters.append([BeautifulSoup(row, 'lxml').get_text(strip=True) for row in letter])
return letters
result = []
for i in range(0, 5):
result += get_letters(i)
df = pd.DataFrame(result)
print(df)
OUTPUT:
0 1 2 ... 5 6 7
0 12/27/2022 11/07/2022 Land View, Inc. ...
1 12/27/2022 11/22/2022 MD Pharmaceutical Supply, LLC ...
2 12/27/2022 06/01/2022 Supreme Fruit Produce, Inc. ...
3 12/27/2022 10/06/2022 Empowered Diagnostics LLC ...
4 12/27/2022 11/18/2022 RoyalVibe Health Ltd. ...
.. ... ... ... ... .. .. ..
245 08/11/2022 08/11/2022 The Juice Bar ...
246 08/09/2022 06/16/2022 InfuTronix LLC ...
247 08/09/2022 07/12/2022 Zyno Medical LLC ...
248 08/09/2022 07/28/2022 Vitti Labs, LLC ...
249 08/09/2022 07/22/2022 Muscle Feast, LLC ...
UPDATE
to find request use dev tools(f12 default in chrome)
now we need to figure out how to work with this data, this is a simple html text, and bs4 will help us with this. If link needed u can change letters.append to
letters.append({
'Posted Date': BeautifulSoup(letter[0], 'lxml').get_text(strip=True),
'Letter Issue Date': BeautifulSoup(letter[1], 'lxml').get_text(strip=True),
'Company Name': BeautifulSoup(letter[2], 'lxml').get_text(strip=True),
'Issuing Office': BeautifulSoup(letter[3], 'lxml').get_text(strip=True),
'Subject': BeautifulSoup(letter[4], 'lxml').get_text(strip=True),
'Link': 'https://www.fda.gov/' + BeautifulSoup(letter[2], 'lxml').find('a').get('href'),
})
And new output looks like:
Posted Date Letter Issue Date Company Name Issuing Office Subject Link
0 12/27/2022 11/07/2022 Land View, Inc. Division of Human and Animal Food Operations West VI CGMP/Medicated Feeds/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/land-view-inc-638704-11072022
1 12/27/2022 11/22/2022 MD Pharmaceutical Supply, LLC Division of Pharmaceutical Quality Operations I CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/md-pharmaceutical-supply-llc-637815-11222022
2 12/27/2022 06/01/2022 Supreme Fruit Produce, Inc. Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/supreme-fruit-produce-inc-631972-06012022
3 12/27/2022 10/06/2022 Empowered Diagnostics LLC Center for Devices and Radiological Health Unapproved Products Related to the Coronavirus Disease 2019 (COVID-19) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/empowered-diagnostics-llc-638164-10062022
4 12/27/2022 11/18/2022 RoyalVibe Health Ltd. Center for Devices and Radiological Health CGMP/QSR/Medical Devices/PMA/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/royalvibe-health-ltd-639553-11182022
5 12/27/2022 11/28/2022 Bainbridge Beverage West, LLC Division of Human and Animal Food Operations West V Juice HACCP/CGMP for Foods/Adulterated/Insanitary Conditions https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/bainbridge-beverage-west-llc-638942-11282022
6 12/27/2022 12/16/2022 Green Pharmaceuticals Inc. Division of Pharmaceutical Quality Operations IV Drug Product/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/green-pharmaceuticals-inc-635162-12162022
7 12/27/2022 12/16/2022 VapeL1FE, LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapel1fe-llc-648624-12162022
8 12/27/2022 12/09/2022 Ruth Special Food Store LLC Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/ruth-special-food-store-llc-644551-12092022
9 12/27/2022 11/28/2022 Mike Millenkamp Dairy Cattle Division of Human and Animal Food Operations West II New Animal Drug/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/mike-millenkamp-dairy-cattle-640782-11282022
10 12/27/2022 11/10/2022 EarthLab, Inc., dba Wise Woman Herbals Division of Human and Animal Food Operations West VI CGMP/Dietary Supplement/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/earthlab-inc-dba-wise-woman-herbals-634872-11102022
11 12/27/2022 12/09/2022 Big Olaf Creamery LLC dba Big Olaf Division of Human and Animal Food Operations East IV CGMP/Food/Prepared, Packed or Held Under Insanitary Conditions/Adulterated/L. monocytogenes https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/big-olaf-creamery-llc-dba-big-olaf-642758-12092022
12 12/22/2022 12/22/2022 BS Vapes LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/bs-vapes-llc-647308-12222022
13 12/22/2022 12/22/2022 JP & SN Enterprises Inc. d/b/a eCigs International Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jp-sn-enterprises-inc-dba-ecigs-international-647315-12222022
14 12/20/2022 11/08/2022 Dollar Tree, Inc. Office of Human and Animal Food Operations – West Division 3 Interstate Commerce/Food/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/dollar-tree-inc-629509-11082022
15 12/20/2022 07/27/2022 Sagent Pharmaceuticals, Inc. Division Pharmaceutical Quality Operations I CGMP/Drugs/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sagent-pharmaceuticals-inc-636636-07272022
16 12/20/2022 11/21/2022 Nature’s Way Farms, LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/natures-way-farms-llc-641201-11212022
17 12/20/2022 12/08/2022 Nortec Quimica SA Center for Drug Evaluation and Research | CDER CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/nortec-quimica-sa-639894-12082022
18 12/20/2022 11/30/2022 CHS Inc./CHS River Plains Division of Human and Animal Food Operations West I CGMP/Medicated Feeds/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/chs-incchs-river-plains-642790-11302022
19 12/20/2022 12/02/2022 DuPont Nutrition USA Inc. Division of Pharmaceutical Quality Operations I CGMP/Drug Products/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/dupont-nutrition-usa-inc-627211-12022022
20 12/20/2022 11/01/2022 Del Valle Import Corp. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/del-valle-import-corp-642784-11012022
21 12/20/2022 08/25/2022 Sree Nidhi Corp Center for Food Safety and Applied Nutrition (CFSAN) Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sree-nidhi-corp-634266-08252022
22 12/20/2022 12/14/2022 Adarsh Daswani, M.D. Center for Drug Evaluation and Research | CDER Clinical Investigator https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/adarsh-daswani-md-648606-12142022
23 12/15/2022 12/15/2022 Vape King Inc. Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vape-king-inc-646625-12152022
24 12/15/2022 12/15/2022 Vapor E-Cigarette, L.L.C. Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-e-cigarette-llc-646876-12152022
25 12/13/2022 12/02/2022 SV3, LLC d/b/a Mi-One Brands Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sv3-llc-dba-mi-one-brands-647624-12022022
26 12/13/2022 12/07/2022 Centrient Pharmaceuticals India Private Limited Center for Drug Evaluation and Research | CDER CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/centrient-pharmaceuticals-india-private-limited-640196-12072022
27 12/13/2022 11/22/2022 Cecilia Alvarez Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cecilia-alvarez-643706-11222022
28 12/13/2022 11/29/2022 Gobwa Exotic Imports Inc. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/gobwa-exotic-imports-inc-641031-11292022
29 12/13/2022 12/05/2022 Thriftmaster Texas, LLC. d/b/a ThriftMaster Global Holdings, Inc. and TM Global Biosciences, LLC Center for Drug Evaluation and Research | CDER Finished Pharmaceuticals/Unapproved New Drug/Misbranded/Adulterated Human Foods https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/thriftmaster-texas-llc-dba-thriftmaster-global-holdings-inc-and-tm-global-biosciences-llc-641057
30 12/13/2022 11/21/2022 Euphoria Fancy Food Inc. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/euphoria-fancy-food-inc-641801-11212022
31 12/08/2022 12/08/2022 Cloud House Vape Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cloud-house-vape-647544-12082022
32 12/08/2022 12/08/2022 Vapors of Ohio Inc d/b/a Nostalgic Vapes Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapors-ohio-inc-dba-nostalgic-vapes-644739-12082022
33 12/06/2022 11/28/2022 AG Hair Limited Center for Drug Evaluation and Research | CDER CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/ag-hair-limited-638646-11282022
34 12/06/2022 11/22/2022 Glenmark Pharmaceuticals Limited Center for Drug Evaluation and Research | CDER CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/glenmark-pharmaceuticals-limited-637314-11222022
35 12/06/2022 09/23/2022 Saffron USA LLC Division of Human and Animal Food Operations East IV Unapproved New Drugs/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/saffron-usa-llc-629821-09232022
36 12/06/2022 10/24/2022 Cryos International USA LLC Division of Biological Products Operations I Deviations/CFR/Regulations for Human Cells, Tissues & Cellular Products (HCT/Ps) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cryos-international-usa-llc-639696-10242022
37 12/06/2022 10/17/2022 Zuland Distributor Corp Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/zuland-distributor-corp-638899-10172022
38 12/06/2022 11/07/2022 Manzela USA, LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/manzela-usa-llc-642268-11072022
39 12/06/2022 11/07/2022 Maliba African Market Corp. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/maliba-african-market-corp-642698-11072022
40 12/06/2022 11/30/2022 Kari Gran Inc. Division of Pharmaceutical Quality Operations IV CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/kari-gran-inc-640035-11302022
41 12/01/2022 12/01/2022 Vapor Candy Inc d/b/a The Vape Stop Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-candy-inc-dba-vape-stop-645475-12012022
42 11/30/2022 11/30/2022 Jayde's Vapor Lounge Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jaydes-vapor-lounge-645085-11302022
43 11/29/2022 11/10/2022 Vapor Plus OK LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-plus-ok-llc-646225-11102022
44 11/29/2022 11/18/2022 "David M. Lubeck, M.D./Arbor Centers for EyeCare Center for Drug Evaluation and Research | CDER Clinical Investigator (Sponsor) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/david-m-lubeck-mdarbor-centers-eyecare-643531-11182022
45 11/29/2022 06/01/2022 Jam Jam Services, Inc. Division of Southeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jam-jam-services-inc-630847-06012022
46 11/29/2022 09/19/2022 La Serranita Import and Export LLC Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/la-serranita-import-and-export-llc-633743-09192022
47 11/29/2022 11/09/2022 J R Imports LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/j-r-imports-llc-643214-11092022
48 11/29/2022 09/01/2022 Shuzy Rock Inc. Division of Pharmaceutical Quality Operations I CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/shuzy-rock-inc-630110-09012022
49 11/22/2022 10/19/2022 Pepe’s Foods Inc. Division of West Coast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/pepes-foods-inc-640716-10192022
50 11/22/2022 11/14/2022 yourtramadol.com Center for Drug Evaluation and Research | CDER Finished Pharmaceuticals/Unapproved New Drug/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/yourtramadolcom-639959-11142022
In particular, I'am trying to scrape this web site
I would like to setup the Button-menuitems on "50" rows per page:
My Currently core is the follow:
Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//[#class='btn btn-default dropdown-toggle']")))).select_by_visible_text('50')
where is my wrong? Can you help me ?
Thank you in advance for youre Time!
This should work for your case
driver = webdriver.Firefox(service=s)
driver.get(' https://whalewisdom.com/filer/fisher-asset-management-llc#tabholdings_tab_link')
button = driver.find_element(By.CSS_SELECTOR, '.btn-group.dropdown')
button.click()
element = driver.find_element(By.XPATH, '//li[#role="menuitem"]/a[contains(text(), "50")]')
element.click()
You can try this easier code which doesn't need Selenium but rather directly makes a call to the data API of the site with requests.
Please note the argument limit at the end of the query string that sets the limit to 50 rows, as you want. If you want to scrape the next 50 items just increase the offset to 50 then 100, 150, etc. This will get you all the available data.
import requests
import pandas as pd
import json
url = "https://whalewisdom.com/filer/holdings?id=berkshire-hathaway-inc&q1=-1&type_filter=1,2,3,4&symbol=&change_filter=&minimum_ranking=&minimum_shares=&is_etf=0&sc=true&sort=current_mv&order=desc&offset=0&limit=50"
raw = requests.get(url)
data = json.loads(raw.content)
df = pd.DataFrame(data["rows"])
df.head()
Print out:
symbol permalink security_type name sector industry current_shares previous_shares shares_change position_change_type ... percent_ownership quarter_first_owned quarter_id_owned source_type source_date filing_date avg_price recent_price quarter_end_price id
0 AAPL aapl SH Apple Inc INFORMATION TECHNOLOGY COMPUTERS & PERIPHERALS 8.909234e+08 8.871356e+08 3787856.0 addition ... 5.5045625 Q1 2016 61 13F 2022-03-31 2022-05-16 36.6604 160.01 174.61 None
1 BAC bac SH Bank of America Corp. (North Carolina National... FINANCE BANKS 1.010101e+09 1.010101e+09 0.0 None ... 12.5371165 Q3 2017 67 13F 2022-03-31 2022-05-16 25.5185 33.04 41.22 None
2 AXP axp SH American Express Co FINANCE CONSUMER FINANCE 1.516107e+08 1.516107e+08 0.0 None ... 20.1326115 Q1 2001 1 13F 2022-03-31 2022-05-16 39.3110 151.60 187.00 None
3 CVX cvx SH Chevron Corp. (Standard Oil of California) ENERGY INTEGRATED OIL & GAS 1.591781e+08 3.824504e+07 120933081.0 addition ... 8.1014366 Q4 2020 80 13F 2022-03-31 2022-05-16 125.3424 159.14 162.83 None
4 KO ko SH Coca Cola Co. CONSUMER STAPLES
I am trying to scrape this website, and this is my code thus far:
import click
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
table_rows = []
url = 'https://www.iagco.agco.ca/prod/pub/en/Default.aspx?PossePresentation=PublicNoticeSearch'
driver = webdriver.Chrome('/Applications/Python 3.9/chromedriver')
driver.get(url)
driver.find_element_by_xpath("/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[5]/div/table/tbody/tr[5]/td[3]/span/select/option[2]").click()
driver.implicitly_wait(1)
driver.find_element_by_xpath("/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[6]/div/table/tbody/tr/td/div/a").click()
driver.implicitly_wait(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
tables = soup.find_all('table')
driver.implicitly_wait(2)
for table in driver.find_elements_by_xpath('/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[5]/div/table/tbody/tr/td'):
data = [item.text for item in table.find_elements_by_xpath('/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[5]/div/table/tbody/tr/td')]
print(data)
What prints out is a long list, and I am trying to figure out how to get it into a table format.
['City Premises Deadline for Objections / Submissions File Number Application Type Areas\nBRAMPTON Tweed\n10010 MCLAUGHLIN RD N\nBRAMPTON, ON L7A2X6 2021-09-24 1248893 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nBRAMPTON Tweed\n1990 STEELES AVE W\nBRAMPTON, ON L6Y0R4 2021-09-24 1250690 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nCAMPBELLVILLE Welcome Cannabis Campbellville\n6 MAIN ST N\nCAMPBELLVILLE, ON L0P1B0 2021-09-29 1272273 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nCANNINGTON Bud Runners Cannabis\n17 CAMERON ST W.\nCANNINGTON, ON L0E 1E0 2021-09-23 1271708 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nETOBICOKE Cannabis 151\n188 THE QUEENSWAY\nETOBICOKE, ON M8Y1J3 2021-09-20 1157846 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nETOBICOKE Fire & Flower Cannabis Co.\n764 THE QUEENSWAY\nETOBICOKE, ON M8Z0E8 2021-09-30 1211417 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nGUELPH Tweed\n138 COLLEGE AVE W UNIT A\nGUELPH, ON N1G1S4 2021-09-28 1267278 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nHAMILTON HARVEST CANNABIS CO\n318 QUEENSTON RD UNIT G\nHAMILTON, ON L8K1H5 2021-09-23 1283383 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nHAMILTON Lady Leaf\n372 KING ST E SUITE 101\nHAMILTON, ON L8N1C3 2021-09-25 1188839 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nHAVELOCK Sunfish Cannabis Trainstation\n30 ONTARIO STREET, EAST UNIT\nHAVELOCK, ON K0L1Z0 2021-10-02 1285465 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nKITCHENER THE CANNABIST SHOP – KING E\n325 KING ST E\nKITCHENER, ON N2G2L2 2021-10-04 1297162 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nMORRISBURG Morrisburg cannabis\n137 MAIN STREET\nMORRISBURG, ON K0C1X0 2021-10-01 1196780 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nMORRISBURG, SOUTH DUNDAS The Oz Store\n147 MAIN ST., UNIT #2\nMORRISBURG, SOUTH DUNDAS, ON K0C 1X0 2021-09-30 1190679 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nMORRISTON Welcome Cannabis\n3 BADENOCH ST\nMORRISTON, ON N0B 2C0 2021-10-01 1256018 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nNORTH YORK Taste Buds Cannabis\n1193 LAWRENCE AVE W\nNORTH YORK, ON M6A1E2 2021-09-25 1266788 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nOTTAWA BlueBird Cannabis Co\n27 YORK ST\nOTTAWA, ON K1N5S7 2021-10-03 1221838 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nOTTAWA Planet Earth Cannabis\n1666 BANK ST SUITE 600\nOTTAWA, ON K1V7Y6 2021-09-27 1215970 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nSTITTSVILLE SHINYBUD CANNABIS CO. STITTSVILLE\n1261 MAIN ST UNIT 2\nSTITTSVILLE, ON K2S2E4 2021-10-01 1167849 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nTORONTO BLACKSTAR CANNABIS SHUTER\n985 DOVERCOURT RD\nTORONTO, ON M6H2X6 2021-09-30 1277446 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nTORONTO Canna North Cannabis Store\n117 YONGE ST\nTORONTO, ON M5C1W4 2021-10-02 1001902 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nTORONTO Olive Jar\n554 ANNETTE ST\nTORONTO, ON M6S2C2 2021-10-02 1196618 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nWATERLOO The Cannabist Shop - Bridgeport W\n10 BRIDGEPORT RD W\nWATERLOO, ON N2L2Y1 2021-10-04 1294285 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nWELLINGTON The Community Store\n186 MAIN STREET UNIT 3\nWELLINGTON, ON K0K 3L0 2021-09-23 1252887 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission\nWINCHESTER BlueBird Cannabis Co - Winchester\n507 ST LAWRENCE ST LOWER LEVEL\nWINCHESTER, ON K0C2K0 2021-10-01 1299098 New Application\nCannabis Retail Store Authorization Indoor Area File Objection / Submission']
I'm getting the following output as a table format.
Code:
import click
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
data = []
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
url = 'https://www.iagco.agco.ca/prod/pub/en/Default.aspx?PossePresentation=PublicNoticeSearch'
driver.get(url)
time.sleep(8)
driver.find_element_by_xpath(
"/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[5]/div/table/tbody/tr[5]/td[3]/span/select/option[2]").click()
driver.implicitly_wait(1)
driver.find_element_by_xpath(
"/html/body/div[1]/form/div[3]/div[2]/div/div/div/div[6]/div/table/tbody/tr/td/div/a").click()
driver.implicitly_wait(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
t = soup.find('table', class_='possegrid')
# Get all the rows from the table
trs = t.select('tr')
for tr in trs:
data.append(tr.stripped_strings)
df = pd.DataFrame(data)
print(df)
Output:
0 City Premises ... Indoor Area File Objection / Submission
1 City Premises ... None None
2 BRAMPTON Tweed ... None None
3 BRAMPTON Tweed ... None None
4 CAMPBELLVILLE Welcome Cannabis Campbellville ... None None
5 CANNINGTON Bud Runners Cannabis ... None None
6 ETOBICOKE Cannabis 151 ... None None
7 ETOBICOKE Fire & Flower Cannabis Co. ... None None
8 GUELPH Tweed ... None None
9 HAMILTON HARVEST CANNABIS CO ... None None
10 HAMILTON Lady Leaf ... None None
11 HAVELOCK Sunfish Cannabis Trainstation ... None None
12 KITCHENER THE CANNABIST SHOP – KING E ... None None
13 MORRISBURG Morrisburg cannabis ... None None
14 MORRISBURG, SOUTH DUNDAS The Oz Store ... None None
15 MORRISTON Welcome Cannabis ... None None
16 NORTH YORK Taste Buds Cannabis ... None None
17 OTTAWA BlueBird Cannabis Co ... None None
18 OTTAWA Planet Earth Cannabis ... None None
19 STITTSVILLE SHINYBUD CANNABIS CO. STITTSVILLE ... None None
20 TORONTO BLACKSTAR CANNABIS SHUTER ... None None
21 TORONTO Canna North Cannabis Store ... None None
22 TORONTO Olive Jar ... None None
23 WATERLOO The Cannabist Shop - Bridgeport W ... None None
24 WELLINGTON The Community Store ... None None
25 WINCHESTER BlueBird Cannabis Co - Winchester ... None None
[26 rows x 246 columns]
i have been trying to parse out a list from tradingview for quite a while and tried everything. This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
WEBDRIVER_PATH = 'chromedriver.exe'
driver = webdriver.Chrome(WEBDRIVER_PATH)
URL = 'https://tradingview.com/markets/stocks-usa/market-movers-most-volatile/'
driver.get(URL)
print(driver.title)
# waiting data to be loaded
time.sleep(5)
stocks = []
for result in driver.find_elements_by_xpath('//*[#id="js-screener-container"]/div/table/tbody/tr'):
stock = result.find_element_by_xpath('//*[#id="js-screener-container"]/div/table/tbody/tr/td/div/div/span[2]').text
stocks.append({'stock': stock})
print(stocks)
The problem is that it is only repeating the first item in the list correct number of first. I've seen plenty of cases that solves this by rewriting this part by adding a dot "."
stock = result.find_element_by_xpath('//*[#id="js-screener-container"]/div/table/tbody/tr/td/div/div/span[2]').text
So it looks like this:
stock = result.find_element_by_xpath('.//*[#id="js-screener-container"]/div/table/tbody/tr/td/div/div/span[2]').text
But that gives me that breaks the code and give me this error instead:
stock = result.find_element_by_xpath('.//*[#id="js-screener-container"]/div/table/tbody/tr/td/div/div/span[2]').text
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webelement.py", line 351, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webelement.py", line 659, in find_element
{"using": by, "value": value})['value']
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Python\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//*[#id="js-screener-container"]/div/table/tbody/tr/td/div/div/span[2]"}
(Session info: chrome=91.0.4472.77)
Process finished with exit code 1
Can someone help me go further?
Kind regards
Try this
for result in driver.find_elements_by_xpath('//*[#id="js-screener-container"]/div/table/tbody/tr'):
stock = result.find_element_by_xpath('.//td/div/div/span[2]').text
stocks.append({'stock': stock})
print(stocks)
browser = webdriver.Edge('msedgedriver.exe')
browser.get('https://tradingview.com/markets/stocks-usa/market-movers-most-volatile/')
stocks = browser.find_element_by_css_selector('#js-screener-container > div.tv-screener__content-pane > table > tbody').text
stocks = stocks.split('\n')
stocks2 = stocks[::4]
stocks_list = []
for stock in stocks2:
stocks_list.append(stocks2)
this worked for me, hope it helps
You need to move to elements to get all the names.
code :
driver.get('https://tradingview.com/markets/stocks-usa/market-movers-most-volatile/')
sleep(5)
stocks = []
for result in driver.find_elements_by_css_selector(".tv-screener__description"):
ActionChains(driver).move_to_element(result).click().perform()
print(result.text)
o/p:
PROSHARES TRUST II ULTRA VIX SHT TRM FUTR ETF ETF
LARIMAR THERAPEUTICS INC
LIZHI INC SPON ADS EACH REP 20 ORD SHS CL A DR
PROTAGENIC THERAPEUTICS INC
NABRIVA THERAPEUTICS PLC
PIERIS PHARMACEUTICALS INC
KINTARA THERAPEUTICS INC
ASENSUS SURGICAL INC
PLANET GREEN HOLDINGS CORP
TAKUNG ART CO LTD
BUILD A BEAR WORKSHOP INC
OWENS & MINOR INC
INNSUITES HOSPITALITY TRUST TF
EXPRESS INC
ASSERTIO HOLDINGS INC
INTRUSION INC
EZGO TECHNOLOGIES LTD
CARLOTZ INC
UCLOUDLINK GROUP INC SPON ADS EACH REP 10 ORD SHS CL A DR
IT TECH PACKAGING INC
OHIO VALLEY BANCORP
ORIENTAL CULTURE HLDG LTD
VALLON PHARMACEUTICALS INC
LIPOCINE INC
PDS BIOTECHNOLOGY CORPORATION
PDL COMMUNITY BANCORP
ONDAS HOLDINGS INC
GSX TECHEDU INC SPON ADS EACH REP 0.666 ORD SHS CL A DR
AKERNA CORP
INTERCEPT PHARMACEUTICALS INC
TRAVERE THERAPEUTICS INC
MOXIAN INC
SEACOR MARINE HOLDINGS INC
GREENBOX POS
21VIANET GROUP INCORPORATED SPON ADS EACH REPR 6 ORD SHS A DR
ARCIMOTO INC
BAOSHENG MEDIA GROUP HLDGS LTD
LA JOLLA PHARMACEUTICAL CO
READING INTERNATIONAL INC
RENESOLA LTD SPON ADS EACH REP 10 ORD DR
AMC ENTERTAINMENT HOLDINGS INC
APTEVO THERAPEUTICS INC
SCOPUS BIOPHARMA INC
NOVAN INC
INSPIRED ENTERTAINMENT INC
DYADIC INTERNATIONAL INC
HUDSON TECHNOLOGIES INC
LIQUID MEDIA GROUP LTD
EHANG HOLDINGS LIMITED SPON ADS EACH REP 2 ORD SHS CL A DR
PAINREFORM LTD
YIELD10 BIOSCIENCE INC
BEAM GLOBAL
PROVENTION BIO INC
SELECTA BIOSCIENCES INC
LI AUTO INC SPN ADS ECH REP 2 ORD SHS CL A DR
REX AMERICAN RESOURCES CORPORATION
WALDENCAST ACQUISITION CORP
KUKE MUSIC HOLDING LIMITED SPON ADS REPRESENTING ONE CL A ORD SHS DR
SKYLINE CHAMPION CORPORATION
TDH HOLDINGS INC
FUBOTV INC
GAMESTOP CORPORATION
UP FINTECH HOLDING LIMITED SPON ADS EACH REP 15 ORD SHS CL A DR
RA MEDICAL SYSTEMS INC
DICKS SPORTING GOODS INC
MICROVISION INC WASHINGTON
VISTAGEN THERAPEUTICS INC
TELLURIAN INC
JAGUAR HEALTH INC
SCORE MEDIA AND GAMING INC
BOQII HOLDING LIMITED SPON ADS EACH REP 0.75 CL A ORD SHS DR
APPLIED THERAPEUTICS INC
ELECTRAMECCANICA VEHS CORP
RANPAK HOLDINGS CORP
BURNING ROCK BIOTECH LTD SPON ADS EACH REP 1 ORD SHS CL A DR
ASIA PACIFIC WIRE & CABLE
BELLICUM PHARMACEUTICALS
DOUBLEVERIFY HLDGS INC
EBANG INTERNATIONAL HLDGS INC
CANAAN INC SPON ADS EACH REP 15 ORD SHS CL A DR
URBAN OUTFITTERS INC
AGEAGLE AERIAL SYSTEMS INC
ABCELLERA BIOLOGICS INC
DOLPHIN ENTERTAINMENT INC
ZK INTERNATIONAL GROUP CO LTD
CELLECT BIOTECHNOLOGY LTD SPON ADS EACH REP 100 ORD SHS(POST SPLT) DR
REGIONAL HEALTH PROPERTIES INC REIT
ANNOVIS BIO INC
TUYA INC SPON ADS EACH REP 1 CL A ORD SHS DR
RLX TECHNOLOGY INC SPON ADS EACH REP 1 CL A ORD SHS DR
CINEDIGM CORPORATION
BIT DIGITAL INC
NAM TAI PPTY INC
EXAGEN INC
WRAP TECHNOLOGIES INC
IMAGE SENSING SYSTEMS INC
WUNONG NET TECHNOLOGY CO LTD
WORKHORSE GROUP INC
BIG 5 SPORTING GOODS CORP
KOSS CORP
I am in the process of scraping a website and it pulls the contents of the page, but there are categories with headers that are technically empty, but it still shows the header. I would like to only see categories with events in them. Ideally I could even have the components of each transactions so I can choose which elements I want displayed.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for events in NHsoup.findAll('tr')[2:]:
print(events.text)
NHlist.append(events.text)
print(' '.join(NHlist))
Like I said, this works to get all of the information, but there are a lot of headers/empty space that doesn't need to be pulled. For example, at the time I'm writing this the 'acquisitions', 'conversions', and 'change in control' are empty, but the headers still come in and there's are relatively large blank space after the headers. I feel like a I need some sort of loop to go through each header ('td') and then get it's contents ('tr') but I'm just not quite sure how to do it.
You can use itertools.groupby to group elements and then filter out empty rows:
import requests
from itertools import groupby
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for _, g in groupby(NHsoup.select('tr'), lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.select('th') else (None, d['g'])):
s = [tag.get_text(strip=True, separator=' ') for tag in g]
if any(i == '' for i in s):
continue
NHlist.append(s)
# This is just pretty printing, all the data are already in NHlist:
l = max(map(len,(j for i in NHlist for j in i))) + 5
for item in NHlist:
print('{: <4} {}'.format(' ', item[0]))
print('-' * l)
for i, ev in enumerate(item[1:], 1):
print('{: <4} {}'.format(i, ev))
print()
Prints:
Scraping NH Dept of Banking...
New Bank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/11/18 The Millyard Bank
Interstate Bank Combination
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/16/19 Optima Bank & Trust Company with and into Cambridge Trust Company Portsmouth, NH 03/29/19
Amendment to Articles of Agreement or Incorporation; Business or Capital Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
2 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
3 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
4 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
5 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
6 5/10/19 AB Trust Company New York, NY 06/04/19
Reduction in Capital
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 03/07/19 Primary Bank Bedford, NH 04/10/19
Amendment to Bylaws
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
2 2/25/19 Members First Credit Union Manchester, NH 04/05/19
3 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
4 6/28/19 Bellwether Community Credit Union
Interstate Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
2 03/08/19 One Credit Union Newport, NH 03/29/19
3 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
4 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
5 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
6 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
7 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
Interstate Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
New Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
2 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
3 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
4 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
New Loan Production Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH
03766-1430 04/15/19
Loan Production Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
Trade Name Requests
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Management Trust Company" 04/24/19
New Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/19/19 Janney Trust Co., LLC
2 02/25/19 Darwin Trust Company of New Hampshire, LLC
3 07/15/`9 Harbor Trust Company
Dissolution of Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
Trust Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/10/19 Charter Trust Company Rochester, NH 05/20/19
New Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 06824 03/22/19
Relocation of Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Floor, Boston MA To: 100 Summer Street, 12th Flr, Boston, MA 02/01/19
2 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New York, NY 10017 To: 410 Park Avenue, Suite 900 New York, NY 10022 03/29/19
3 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY 10022 To: One Grand Central Place 60 East 42nd Street, Ste 1550 New York, NY 10165 04/23/19
You could test which rows contain all '\xa0' (appear blank) and exclude. I append to list and convert to pandas dataframe but you could just print the row direct.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.nh.gov/banking/corporate-activities/index.htm')
soup = bs(r.content, 'lxml')
results = []
for tr in soup.select('tr'):
row = [i.text for i in tr.select('th,td')]
if row.count('\xa0') != len(row):
results.append(row)
pd.set_option('display.width', 100)
df = pd.DataFrame(results)
df.style.set_properties(**{'text-align': 'left'})
df.columns = df.iloc[0]
df = df[1:]
df.fillna(value='', inplace=True)
print(df.head(20))
Not sure if this is how you want it, and there is probably a more elegant way, but I basically did was
Pandas to get the table
Pandas automatically assigns columns, so moved column to first row
Found were rows are all nulls
Dropped rows with all nulls and the previous row (it's sub header)
import pandas as pd
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
top_row = pd.DataFrame([df.columns], index=[-1])
df.columns = top_row.columns
df = df.append(top_row, sort=True).sort_index().reset_index(drop=True)
null_rows = df[df.isnull().values.all(axis=1)].index.tolist()
drop_hdr_rows = [x - 1 for x in null_rows ]
drop_rows = drop_hdr_rows + null_rows
new_df = df[~df.index.isin(drop_rows)]
Output:
print (new_df.to_string())
0 1 2 3
2 New Bank New Bank New Bank New Bank
3 12/11/18 The Millyard Bank NaN NaN
4 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
5 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
12 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
13 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
14 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
15 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
16 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
17 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
18 5/10/19 AB Trust Company New York, NY 06/04/19
19 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
20 03/07/19 Primary Bank Bedford, NH 04/10/19
21 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
22 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
23 2/25/19 Members First Credit Union Manchester, NH 04/05/19
24 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
25 6/28/19 Bellwether Community Credit Union NaN NaN
26 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
27 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
28 03/08/19 One Credit Union Newport, NH 03/29/19
29 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
30 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
31 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
32 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
33 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
34 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
35 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
36 New Branch Office New Branch Office New Branch Office New Branch Office
37 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
38 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
39 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
40 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
41 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
42 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
43 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
44 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
45 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
46 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
51 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
52 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
53 New Trust Company New Trust Company New Trust Company New Trust Company
54 02/19/19 Janney Trust Co., LLC NaN NaN
55 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
56 07/15/`9 Harbor Trust Company NaN NaN
57 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
58 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
59 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
60 5/10/19 Charter Trust Company Rochester, NH 05/20/19
61 New Trust Office New Trust Office New Trust Office New Trust Office
62 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
63 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
64 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
65 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
66 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19