I have a small problem concerning conversion of data to time series. Here are the steps that i carried out.
I have the output data as follows :
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=xxx&t=BBRI'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=xxx&t=BBRI'
soup1 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url1).text)[0])['componentData'], 'lxml')
soup2 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url2).text)[0])['componentData'], 'lxml')
def print_table(soup):
for i, tr in enumerate(soup.select('tr')):
row_data = [td.text for td in tr.select('th, td') if td.text]
if not row_data:
continue
if len(row_data) < 12:
row_data = ['X'] + row_data
for j, td in enumerate(row_data):
if j==0:
print('{: >30}'.format(td))
else:
print('{: ^12}'.format(td))
print()
print_table(soup1)
produce output
X
2010-12
2011-12
2012-12
2013-12
2014-12
2015-12
2016-12
2017-12
2018-12
2019-12
TTM
Revenue IDR Mil
30,552,600
40,203,051
43,104,711
51,133,344
59,556,636
69,813,152
82,504,537
90,844,308
99,067,098
108,468,320
105,847,159
I need to convert it to a dataframe with panda being to:
data
X Revenue IDR Mil
2010-12 30,552,600
2011-12 40,203,051
2012-12 43,104,711
2013-12 51,133,344
2014-12 59,556,636
2015-12 69,813,152
2016-12 82,504,537
2017-12 90,844,308
2018-12 99,067,098
2019-12 108,468,320
2020-12 105,847,159
This is a bit simplified from what you are doing, but I think it gets you where you need, mostly from Bitto Bennichan,
import json
import pandas as pd
url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?t=BBRI'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?t=BBRI'
lm_json = requests.get(url1).json()
df_list=pd.read_html(lm_json["componentData"])
df_list[0].transpose()
Related
I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21
Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data
I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?
While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...
find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan AF
Åland Islands AX
Albania AL
find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]
if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]
I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.