I'd like zip some lists from html, I use codes like:
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles = link.text
links = f"Link : 'https://www.pds.com.ph/{link['href']}"
dates = td.text
for link, title, date in zip(links, titles, dates):
dataframe = pd.DataFrame({'col1':title,'col2':link,'col3':date},index=[0])
print(dataframe)
But the output is not what I expected:
col1 col2 col3
1 P L M
col1 col2 col3
1 D i a
...
What I EXPECT is:
Titles Links Dates
... ... ...
May I ask if the syntax is correct or what could I do to achieve that?
You can just pass the result from zip directly to pd.DataFrame, specifying the column names in a list:
df = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
If you are trying to create a dataframe from the extracted values then, you need to store them in list before performing zip
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
titles = [] # to store extracted values in list
links = []
dates = []
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles.append(link.text)
links.append(f"Link : 'https://www.pds.com.ph/{link['href']}")
dates.append(td.text)
dataframe = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
# or you can use
# dataframe = pd.DataFrame({'Titles': titles, 'Links': links, 'Dates': dates})
print(dataframe)
# Titles Links Dates
# 0 RCBC Lists PHP 17.87257 Billion ASEAN Sustaina... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 31, 2021
# 1 Aboitiz Power Corporation Raises 8 Billion Fix... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 16, 2021
# 2 Century Properties Group, Inc Returns to PDEx ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 1, 2021
# 3 PDS Group Celebrates 2020’s Top Performers in ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 29, 2021
Related
Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data
I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?
While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...
find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan AF
Åland Islands AX
Albania AL
find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]
I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.
New to python here and I have a question about creating a table from a scrape using Beautiful soup. Here is the code I am using:
import requests
page=requests.get("https://www.opensecrets.org/lobby/lobbyist.php?id=Y0000008510L&year=2018")
from bs4 import BeautifulSoup
soup=BeautifulSoup(page.content, 'lxml')
table=soup.find(‘table’,{‘id’:’lobbyist_summary’})
for row in table:
cells=row.find_all(‘a’)
rn=cells[0].get_text()
Error is:
AttributeError: 'NavigableString' object has no attribute 'find_all'
print(table) looks like this:
[Ballard Partners, Advanced Roofing Inc, Africell Holding, Amazon.com, ...]
I would like to (eventually) end up with a table that has each element of interest in a separate column so that it looks like:
[[firmsum,D000037635,2018,Ballard Partners],[clientsum,F203227,2018,Advanced Roofing Inc],[clientsum,F214670,2018,Africell Holding],[clientsum,D000023883, 2018, Amazon.com]...]
Assign 4 empty lists:
col1List = list()
col2List = list()
col3List = list()
col4List = list()
First, let's get the column 4 values:
trs = table.find_all('tr')[1]
tds = trs.find_all('a')
for i in range(len(tds)):
col4List.append(tds[i].get_text())
This gives:
['Ballard Partners', 'Advanced Roofing Inc', 'Africell Holding',....]
Now, let us extract the values for first 3 columns from href:
hrefVal = trs.find_all('a')
for i in hrefVal:
hVal = i.get('href')
col11 = hVal.split('.php?id=', 1)
col1 = col11[0]
col1List.append(col1)
col22 = col11[1].split('&', 1)
col2 = col22[0]
col2List.append(col2)
col33 = col22[1].split('=', 1)
col3 = col33[1]
col3List.append(col3)
Now, let us put all the lists in a dataframe to make it look neat:
import pandas as pd
df = pd.DataFrame()
df['Col1'] = col1List
df['Col2'] = col2List
df['Col3'] = col3List
df['Col4'] = col4List
If I output the first few rows, it will look like how you want it:
Col1 Col2 Col3 Col4
firmsum D000037635 2018 Ballard Partners
clientsum F203227 2018 Advanced Roofing Inc
clientsum F214670 2018 Africell Holding
clientsum D000023883 2018 Amazon.com
clientsum D000000192 2018 American Health Care Assn
clientsum D000021839 2018 American Road & Transport Builders Assn
I am trying to get the total assets values from the 10-K text filings. The problem is that the html format varies from one company to another.
Take Apple 10-K as an example:
total assets is in a table that has balance sheet header and typical terms like cash, inventories, ... exist in some rows of that table. In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. I have not been able to find a way that
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...)
2) get the value in the row that has the word total assets and belongs to the greater year (2015 for our example).
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml or html.parser instead of xml I can get
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))