Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data
Related
I am trying to append XML data into empty lists to create a dataframe. I was able to create all lists besides three because some tags have null values. I have tried an xpath function to grab all the text from the desired tag.
import requests
from lxml import objectify
from lxml import etree
from bs4 import BeautifulSoup
URL = 'https://data.virginia.gov/api/views/xvir-sctz/rows.xml?accessType=DOWNLOAD'
response = requests.get(URL).content
import requests
from lxml import objectify
root = objectify.fromstring(response)
Here are some empty lists I would like to append
households_served = []
individuals_served = []
pounds_of_food_distributed = []
month = []
I tried this to give me lists, and it works.
pounds_of_food_distributed = root.xpath('//response/row/row/pounds_of_food_distributed/text()')
individuals_served = root.xpath('//response/row/row/individuals_served/text()')
households_served = root.xpath('//response/row/row/households_served/text()')
month = root.xpath('//response/row/row/month/text()')
But then I try to use pd.DataFrame with this code and I get an error.
table = pd.DataFrame(
{'Month': month,
'House': households_served,
'People': individuals_served,
'Pounds' : pounds_of_food_distributed
})
Any suggestions?
An alternative would be to use pandas directly with pandas.read_xml, set xpath to all children row of row element and just slice output to your needs - This will also take care of the missing / empty elements in XML structure:
import pandas as pd
df = pd.read_xml('https://data.virginia.gov/api/views/xvir-sctz/rows.xml?accessType=DOWNLOAD', xpath='row//row')[['month','individuals_served','households_served','pounds_of_food_distributed']]
df.columns = ['Month','House','People','Pounds']
df
Month
House
People
Pounds
0
October
nan
nan
156644
1
April
nan
nan
21602
2
August
nan
nan
51338
3
May
627
270
67633
4
May
nan
nan
54561
...
4254
August
37
17
482661
4255
August
1974
783
29211
4256
April
485
259
16254.5
4257
August
34986
8583
561709
4258
June
749
258
31560.7
Your problem is that some elements do not exist in the row, but months always exist.
One idea is to fill the missing data with 0s or anything you want.
import requests
from lxml import objectify
from lxml import etree
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://data.virginia.gov/api/views/xvir-sctz/rows.xml?accessType=DOWNLOAD'
response = requests.get(URL).content
root = objectify.fromstring(response)
households_served = []
individuals_served = []
pounds_of_food_distributed = []
month = []
for element in root.xpath('//row/row'):
month.append(element["month"]) # month always exists
individuals_served.append(element["individuals_served"] if hasattr(element, "individuals_served") else 0)
households_served.append(element["households_served"] if hasattr(element, "households_served") else 0)
pounds_of_food_distributed.append(element["pounds_of_food_distributed"] if hasattr(element, "pounds_of_food_distributed") else 0)
print(len(month))
print(len(individuals_served))
print(len(pounds_of_food_distributed))
print(len(households_served))
table = pd.DataFrame(
{'Month': month,
'House': households_served,
'People': individuals_served,
'Pounds' : pounds_of_food_distributed
})
print(table)
The output:
if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]
For a project I'm scraping data from futbin players and I would like to add that scraped data to a dict or pandas dataframe. I'm stuck for a couple of hours and would like some help if possible. I will put my code below on what I have so far. This piece of code only prints out the data and from that I'm clueless about what to do.
Code:
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', id='info_content')
rows = info.find_all('td')
for info in rows:
print(info.text.strip())
The work you have already done to identify the table you want is good.
use read_html() to convert to a dataframe
basic transforms to turn it into columns rather than key value pairs
in list comprehension get details of all wanted footballers
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
def myhtml(url):
# use BS4 to get table that has required data
html = str(BeautifulSoup(requests.get(url).content, 'html.parser').find('div', id='info_content').find("table"))
# read_html() returns a list, take first one, first column are attribute name, transpose to build DF
return pd.read_html(html)[0].set_index(0).T
df = pd.concat([myhtml(u) for u in urls])
Name
Club
Nation
League
Skills
Weak Foot
Intl. Rep
Foot
Height
Weight
Revision
Def. WR
Att. WR
Added on
Origin
R.Face
B.Type
DOB
Robert Lewandowski FIFA 21 Career Mode
Age
1
Edson Arantes Nascimento
FUT 21 ICONS
Brazil
Icons
5
4
5
Right
173cm
5'8"
70
Icon
Med
High
2020-09-10
Prime
nan
Unique
23-10-1940
nan
1
Robert Lewandowski
FC Bayern
Poland
Bundesliga
4
4
4
Right
184cm
6'0"
80
TOTY
Med
High
2021-01-22
TOTY
nan
Unique
nan
Robert Lewandowski FIFA 21 Career Mode
I would do it with open() and write()
file = open ("filename.txt", "w")
The w specifies the following :
"w" - Write - Opens a file for writing, creates the file if it does not exist
And then :
file.write (text_to_save)
Be sure to include os.path!
import os.path
I am trying to use this code to scrape information from Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)), specifically names of executives. The code I have here is only giving me the following output:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
I am trying to get the names as well. Here is my current code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
The very last line is the one in which I believe the problem exists. I have tried removing and adding code to the section "table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter" but keep getting the same result. I copied it straight from the site no I'm not sure what I'm missing. If not this, is there something wrong with the rest of the code in that line? Thank you!
There's a simpler way to do it. Taking one of your urls at random, try this:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
Output:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
etc.
You could try to reach the table via selector:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]
Use following css selector to find the table first and then use pandas to read_html()
and load into data frame.
This will give you all data in a single dataframe.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
If you want to fetch individual dataframe then try this.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)
I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.