I've created a script in python to get all the names out of a table from a webpage. The names within that table are available in the page source so they are static content. However, when I try with my following script, I get few of them (upto 2012 Topps Heritage Run) whereas the list are many more.
Site address
How can I get all the names from the table under Company Sets header using requests?
I've tried with so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.psacard.com/psasetregistry/baseball/company-sets/16"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".dataTable tr td a[href*='/baseball/company-sets/']"):
print(item.text)
Can you try the following:
print([inner_tag.find('a').text for inner_tag in soup.findAll('table')[0].findAll('td') if inner_tag.find('a')])
Explanation:
Actually there are two tables in the page, and your code was extracting values from both the tables. That's the reason why you were getting the last value 2012.
The above code extracts the text only from the first table named Company Sets
You could combine requests with pandas read_html
import pandas as pd
import requests
url = 'https://www.psacard.com/psasetregistry/baseball/company-sets/16'
headers = {'User-Agent' : 'Mozilla/5.0'}
r= requests.get(url, headers= headers)
tables = pd.read_html(r.content)
df = tables[0]
df.drop(df.index[[0]], inplace = True)
print(df)
Related
I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:
I have setup BeautifulSoup to find a specific class for two webpages.
I would like to know how to write each URL's result to a unique cell in one CSV?
Also is there a limit to the number of URLs I can read as I would like to expand this to about 200 URLs once I get this working.
The class is always the same and I don't need any formatting just the raw HTML in one cell per URL.
Thanks for any ideas.
from bs4 import BeautifulSoup
import requests
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
soup.find('div', class_="block")
Use pandas to work with tabular data: pd.DataFrame to create a table, and pd.to_csv to save table as csv (might also check out the documentation, append mode for example).
Basically it.
import requests
import pandas as pd
from bs4 import BeautifulSoup
def func(urls):
for url in urls:
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')
yield {
"url": url, "raw_html": soup.find('div', class_="block")
}
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
data = func(urls)
table = pd.DataFrame(data)
table.to_csv("output.csv", index=False)
I am attempting to scrape tables from the website spotrac.com and save the data to a pandas dataframe. For whatever reason, if the table I am scraping is over 100 rows, the BeautifulSoup object only appears to grab the first 100 rows of the table. If you run my code below, you'll see that the resulting dataframe has only 100 rows, and ends with "David Montgomery." If you visit the webpage (https://www.spotrac.com/nfl/rankings/2019/base/running-back/) and ctrl+F "David Montgomery", you'll see that there are additional rows. If you change the webpage in the get row of the code to "https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/" you'll see that the same thing happens. Only the first 100 rows are included in the BeautifulSoup object and in the dataframe.
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
# Begin requests session
with requests.session() as s:
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
I have read that changing the parser can help. I have tried using different parsers by replacing the BeautifulSoup object code with the following:
soup = BeautifulSoup(r.content,'html5lib')
soup = BeautifulSoup(r.content,'html.parser')
Neither of these changes worked. I have run "pip install html5lib" and "pip install lxml" and confirmed that both were already installed.
This page uses JavaScript to load extra data.
In DevTools in Firefox/Chrome you can see it sends POST request with extra information {'ajax': True, 'mobile': False}
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
with requests.session() as s:
r = s.post('https://www.spotrac.com/nfl/rankings/2019/base/running-back/', data={'ajax': True, 'mobile': False})
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
print(df)
I suggest you use request-html
import pandas as pd
from bs4 import BeautifulSoup
from requests_html import HTMLSession
if __name__ == "__main__":
# Begin requests session
s = HTMLSession()
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
r.html.render()
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.html.html, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
Then you will get 140 lines.
I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.
You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)
Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)
I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.