I am trying to scrape the table below using python. Tried pulling html tags to find the element id_dt1_NGY00 and so on but cannot find it once the page is populated so someone told me use Selenium and did manage to scrape some data.
https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=13&y=8
The numbers are updated every 10 minutes so this website is dynamic. Used the following code below but it is printing out everything in a linear format rather than in a format that can be tabular as rows and columns. Included below are two sections of sample output
Contract
Last
Change
Open
High
Low
Volume
Prev. Stl.
Time
Links
May '21 (NGK21)
2.550s
+0.006
2.550
2.550
2.550
1
2.544
05/21/18
Q / C / O
Jun '21 (NGM21)
2.576s
+0.006
0.000
2.576
2.576
0
2.570
05/21/18
Q / C / O
Code below
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome(executable_path= "C:\Users\siddk\PycharmProjects\WebSraping\venv\selenium\webdriver\chromedriver.exe")
browser.get("https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=14&y=16")
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
th_tags = soup.find_all('tr')
for th in th_tags:
print (th.get_text())
I want to extract this data in Panda and analyze averages etc on daily basis. Please help. I have exhausted my strength on doing this myself with multiple iterations to code.
Try the below script to get the tabular data. It is necessary to find the right url which contains the same table but does not get generated dynamically so that you can do your operation without using any browser simulator.
Give it a go:
from bs4 import BeautifulSoup
import requests
url = "https://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=ng&x=13&y=8&domain=if&display_ice=1&enabled_ice_exchanges=&tz=0&ed=0"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for tr in soup.find(class_="bcQuoteTable").find_all("tr"):
data = [item.get_text(strip=True) for item in tr.find_all(["th","td"])]
print(data)
Rusults are like:
['Contract', 'Last', 'Change', 'Open', 'High', 'Low', 'Volume', 'Prev. Stl.', 'Time', 'Links']
['Cash (NGY00)', '2.770s', '+0.010', '0.000', '2.770', '2.770', '0', '2.760', '05/21/18', 'Q/C/O']
["Jun \\'18 (NGM18)", '2.901', '-0.007', '2.902', '2.903', '2.899', '138', '2.908', '17:11', 'Q/C/O']
["Jul \\'18 (NGN18)", '2.927', '-0.009', '2.928', '2.930', '2.926', '91', '2.936', '17:11', 'Q/C/O']
["Aug \\'18 (NGQ18)", '2.944', '-0.008', '2.945', '2.947', '2.944', '42', '2.952', '17:10', 'Q/C/O']
Related
I have a csv file with about 500 hundred links from google patents and I iterate them in scrapy in order to download the csv file from each link (There is a download link in each link). I have successfully implement this but what I want to do now is a way from the html markup to discover the name from each downloaded file in order to edit with python. An example link is this https://patents.google.com/?q=O1C(%3dCCCC1C)C&oq=O1C(%3dCCCC1C)C. The name for the downloaded file is generated dynamically so is there a way to find it out?
the name is just the date : gp-search-20210816-142027.csv 2021-08-16 14:20:27.
As a demo of what you may want to do, and if I'm understanding the question you could follow the direction in the code below. Note: It is only a suggested idea and it only scrapes the PDF links from the first page to show the idea.
Code:
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# next 12 lines have the job of getting the links of the PDF files from the URL below
# just the FIRST PAGE as a demo
url = "https://patents.google.com/?q=O1C(%3dCCCC1C)C&oq=O1C(%3dCCCC1C)C"
path = r'chromedriver'
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = []
print('# this just gets the FIRST PAGE for a demo')
for link in soup.find_all('a',attrs={'class':'pdfLink style-scope search-result-item'}):
print(link['href'])
links.append(link['href'])
# next 11 lines cover creating a dataframe from the downloaded CSV file on the Google search page
# and a 2nd frame for the links scraped. The two frames are eventually joined by a partial match
# of the 'result_link' from the first frame and a partial match of the filename of the pdf URL from the 2nd frame
pattern = r'/([A-Z]{2}\w{7})'
df = pd.read_csv('gp-search-20210816-190925.csv', skiprows=1)
df.columns = df.columns.str.replace(' ', '_')
df['partial_file_name'] = df['result_link'].str.extract(pattern)
df1 = pd.DataFrame(links, columns=['pdf_link'])
df1['partial_file_name'] = df1['pdf_link'].str.extract(pattern)
df = pd.concat([df, df1], axis=1)
df['filename'] = df['pdf_link'].str.extract(r'/([A-Z]{2}\w+)\.')
del df['partial_file_name']
print('\n\n', df.columns)
# 12 columns in total but for demo showing five
df[['filename', 'id', 'title', 'filing/creation_date', 'pdf_link']].head(
)
Outputs:
# this just gets the FIRST PAGE for a demo
https://patentimages.storage.googleapis.com/01/d1/77/6b0b7640eaccda/US7550931.pdf
https://patentimages.storage.googleapis.com/a2/32/15/69cf7713e8e2bf/JP2008525498A.pdf
https://patentimages.storage.googleapis.com/7e/6b/b7/001a8040e216ee/TWI686424B.pdf
https://patentimages.storage.googleapis.com/0f/14/fc/ecb56564f14f6b/WO2005009447A1.pdf
https://patentimages.storage.googleapis.com/95/fd/d5/ed4fe960bdec1c/KR20140096378A.pdf
https://patentimages.storage.googleapis.com/7e/29/01/231cc0813a0f6a/US5026677.pdf
https://patentimages.storage.googleapis.com/ff/f9/c9/7b775d6534d9cb/EP0628427A1.pdf
https://patentimages.storage.googleapis.com/bd/b3/ba/f38866e0b298e2/KR960004857B1.pdf
https://patentimages.storage.googleapis.com/79/e2/11/78aea87078687f/US5942486.pdf
https://patentimages.storage.googleapis.com/62/f5/da/f291e7552a45a6/US5142089.pdf
Index(['id', 'title', 'assignee', 'inventor/author', 'priority_date',
'filing/creation_date', 'publication_date', 'grant_date', 'result_link',
'representative_figure_link', 'pdf_link', 'filename'],
dtype='object')
filename id title filing/creation_date pdf_link
0 US7550931 US-7550931-B2 Controlled lighting methods and apparatus 2007-03-15 https://patentimages.stora.....ccda/US7550931.pdf
1 JP2008525498A JP-2008525498-A Enzyme modulators and therapy 2005-12-23 https://patentimages.stora....f/JP2008525498A.pdf
2 TWI686424B TW-I686424-B Polymer containing triazine ring and compositi... 2016-01-15 https://patentimages.storage.googleapis.com/7e...
3 WO2005009447A1 WO-2005009447-A1 Single dose fast dissolving azithromycin 2004-07-22 https://patentimages.storage.googleapis.com/0f...
4 KR20140096378A KR-20140096378-A Low chloride compositions of olefinically func... 2012-11-19 https://patentimages.storage.googleapis.com/95...
It shows a way you can get the filenames, links, and other fields aligned.
I am trying to use this code to scrape information from Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)), specifically names of executives. The code I have here is only giving me the following output:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
I am trying to get the names as well. Here is my current code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
The very last line is the one in which I believe the problem exists. I have tried removing and adding code to the section "table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter" but keep getting the same result. I copied it straight from the site no I'm not sure what I'm missing. If not this, is there something wrong with the rest of the code in that line? Thank you!
There's a simpler way to do it. Taking one of your urls at random, try this:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
Output:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
etc.
You could try to reach the table via selector:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]
Use following css selector to find the table first and then use pandas to read_html()
and load into data frame.
This will give you all data in a single dataframe.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
If you want to fetch individual dataframe then try this.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)
I am trying to use the United States Securities and Exchange (SEC) database, to look at company financial reports (known as 10k’s) to pull out a list of the executive committee members for each filing. I am currently using the most recent files for Microsoft (stock ticker: MSFT) and Walmart (stock ticker: WMT). I know I can look up this information elsewhere on finance websites but I am trying to make a flexible database for personal use. My issue:
The table index position is different in each report, on one company report the table I want may
be table 38 and on another it may be table 45 so a static index/position count will not work across
multiple filings.
The specific attributes in each HTML table tag change so I cannot search for a common attribute. In
some cases I find common attributes and sometimes I do not.
I am starting to think I may not be able to automate this due to lack of identifiers that are unique within each file and common across all files. I've banged my head looking at many Python Webscraping tutorials and videos the last few weeks. Any suggestions are appreciated, full automation would be ideal so I can loop through multiple filings, partial helps too I'm here to learn. I might be bumping into trying to automate something that is too diverse.
Microsoft Link:
https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm
Desired Table:
<table border="0" cellspacing="0" cellpadding="0" align="center" style="border-collapse:collapse; width:100%;">
Walmart Link:
https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm
Desired Table:
<table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
Code to Count Number of Tables in Each Page:
from selenium import webdriver
from bs4 import BeautifulSoup
chrome_path = r"C:\webdrivers\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)
#Microsoft
browser.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
msft = browser.page_source
page_msft = BeautifulSoup(msft, 'html.parser')
tables_msft = page_msft.find_all("table")
#Walmart
browser.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
wmt = browser.page_source
page_wmt = BeautifulSoup(wmt, 'html.parser')
tables_wmt = page_wmt.find_all("table")
print("MSFT Result Table Count: " + str(len(tables_msft)))
print("Walmart Result Table Count: " + str(len(tables_wmt)))
Results:
MSFT Result Table Count: 263
Walmart Result Table Count: 258
Process finished with exit code 0
Firstly you don't need Selenium, requests library will be faster and avoid overhead. So I was able to partially figure out a way to extract the required data. But since the number of columns is different they cannot be combined together(for Microsoft and Walmart).
The below code generates two required dataframe one for Microsoft and one for Walmart.
You still need to manipulate the column names. The idea is to get the table with td value as 'Age' since it is a unique table data. Let me know if you need some clarifications:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
#Microsoft
page = requests.get("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
soup = BeautifulSoup(page.text, 'html')
resmsft = []
tables_msft = soup.find(text="Age").find_parent("table")
for row in tables_msft.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
resmsft.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty = [sublist for sublist in resmsft if any(sublist)]
df_msft = pd.DataFrame.from_records(non_empty)
df_msft[df_msft==''] = np.nan
df_msft=df_msft.dropna(axis=1,how='all')
#Walmart
page = requests.get("https://www.sec.gov/Archives/edgar/data/104169/000010416919000016/wmtform10-kx1312019.htm")
soup = BeautifulSoup(page.text, 'html')
#page_wmt = BeautifulSoup(soup, 'html.parser')
tables_wmt = soup.find(text="Age").find_parent("table")
reswmt = []
for row in tables_wmt.find_all("tr")[1:]:
# print([cell.get_text(strip=True) for cell in row.find_all("td")])
if row:
reswmt.append([cell.get_text(strip=True) for cell in row.find_all("td")])
non_empty_wmt = [sublist for sublist in reswmt if any(sublist)]
df_wmt = pd.DataFrame.from_records(non_empty_wmt)
df_wmt[df_wmt==''] = np.nan
df_wmt=df_wmt.dropna(axis=1,how='all')
I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.
I am trying to get some information from this page : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275
where I am particularly interested in extracting the Characteristics data as follows:
group_id: xxx
medicore_id: xxxxxxx
date_of_visit_sample_drawn_date: xxxxxxx
rin: xxxxxx
donor_id: xxxxx
sle_visit_designation: xxxxxxx
bold_shipment_batch: xxxxxx
rna_concentrated: xxxxxx
subject_type: xxxxxxx
so on and so forth.
Upon inspecting the page, I realize that this information is deeply nested within other larger tables and that there is no special class/id for me to effectively parse for the characteristics information.
I have been unsuccessfully trying to look for table within tables but I find that sometimes not all tables are being read. This is what I have so far:
from bs4 import BeautifulSoup
import requests
source= requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSM2437275").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table')
for i in table:
print i.prettify()
print (len(table)) #22 tables
print (table[6].prettify()) #narrow down on relevant table
table = table[6]
table_subtables = table.find_all('table')
for i in table_subtables:
print (i.prettify())
print len(table_subtables) #14 tables
tbb = table_subtables[1]
tbb_subtable = tbb.find_all('table')
for i in tbb_subtable:
print (i.prettify())
print len(tbb_subtable) #12 tables
tbbb = tbb_subtable[5]
tbbb_subtable = tbbb.find_all('table')
for i in tbbb_subtable:
print (i.prettify())
print len(tbbb_subtable) # 6 tables
so on and so forth. However, as I keep doing this, I find that not all tables are being read. Can someone point me to a better solution?
You can scrape the data with regular expressions and urllib to specifically scrape the keywords and their corresponding values:
import re
import urllib
data = str(urllib.urlopen('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275').read())
target_vals = ['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']
final_data = {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
Output:
{
'date_of_visit_sample_drawn_date': '2009',
'rna_concentrated': 'No',
'sle_visit_designation': 'Baseline',
'rin': '8',
'subject_type': 'Patient',
'donor_id': '19',
'bold_shipment_batch': '1',
'medicore_id': 'B0019V1',
'group_id': 'A'
}
Edit: given multiple links, you can create a pandas dataframe out of the generated data for each:
import re
import urllib
import pandas as pd
def get_data_from_links(link, target_vals=['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']):
data = str(urllib.urlopen(link).read())
return {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
returned_data = get_data_from_links('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275')
df = pd.DataFrame([returned_data])
Output:
bold_shipment_batch date_of_visit_sample_drawn_date donor_id group_id \
0 1 2009 19 A
medicore_id rin rna_concentrated sle_visit_designation subject_type
0 B0019V1 8 No Baseline Patient
If you have a list of links you would like to retrieve your data from, you can construct a table by constructing a nested dictionary of the resulting data to pass to DataFrame.from_dict:
link_lists = ['link1', 'link2', 'link3']
final_data = {i:get_data_from_links(i) for i in link_lists}
new_table = pd.DataFrame.from_dict(final_data, orient='index')
Output (assuming the first link is 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275'):
rin rna_concentrated date_of_visit_sample_drawn_date \
link1 8 No 2009
sle_visit_designation bold_shipment_batch group_id subject_type \
link1 Baseline 1 A Patient
medicore_id donor_id
link1 B0019V1 19
The way Ajax1234 has shown in his solution is definitely the best way to go with. However, if hardcoded index is not a barrier and if you wish to avoid using regex to achieve the same then this is another approach you may think of trying:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275")
soup = BeautifulSoup(res.content, 'lxml')
for items in soup.select("td[style*='justify']")[2:3]:
data = '\n'.join([item for item in items.strings][:9])
print(data)
Output:
group_id: A
medicore_id: B0019V1
date_of_visit_sample_drawn_date: 2009-09-14
rin: 8.5
donor_id: 19
sle_visit_designation: Baseline
bold_shipment_batch: 1
rna_concentrated: No
subject_type: Patient