creating table in python with beautifulsoup - python

New to python here and I have a question about creating a table from a scrape using Beautiful soup. Here is the code I am using:
import requests
page=requests.get("https://www.opensecrets.org/lobby/lobbyist.php?id=Y0000008510L&year=2018")
from bs4 import BeautifulSoup
soup=BeautifulSoup(page.content, 'lxml')
table=soup.find(‘table’,{‘id’:’lobbyist_summary’})
for row in table:
cells=row.find_all(‘a’)
rn=cells[0].get_text()
Error is:
AttributeError: 'NavigableString' object has no attribute 'find_all'
print(table) looks like this:
[Ballard Partners, Advanced Roofing Inc, Africell Holding, Amazon.com, ...]
I would like to (eventually) end up with a table that has each element of interest in a separate column so that it looks like:
[[firmsum,D000037635,2018,Ballard Partners],[clientsum,F203227,2018,Advanced Roofing Inc],[clientsum,F214670,2018,Africell Holding],[clientsum,D000023883, 2018, Amazon.com]...]

Assign 4 empty lists:
col1List = list()
col2List = list()
col3List = list()
col4List = list()
First, let's get the column 4 values:
trs = table.find_all('tr')[1]
tds = trs.find_all('a')
for i in range(len(tds)):
col4List.append(tds[i].get_text())
This gives:
['Ballard Partners', 'Advanced Roofing Inc', 'Africell Holding',....]
Now, let us extract the values for first 3 columns from href:
hrefVal = trs.find_all('a')
for i in hrefVal:
hVal = i.get('href')
col11 = hVal.split('.php?id=', 1)
col1 = col11[0]
col1List.append(col1)
col22 = col11[1].split('&', 1)
col2 = col22[0]
col2List.append(col2)
col33 = col22[1].split('=', 1)
col3 = col33[1]
col3List.append(col3)
Now, let us put all the lists in a dataframe to make it look neat:
import pandas as pd
df = pd.DataFrame()
df['Col1'] = col1List
df['Col2'] = col2List
df['Col3'] = col3List
df['Col4'] = col4List
If I output the first few rows, it will look like how you want it:
Col1 Col2 Col3 Col4
firmsum D000037635 2018 Ballard Partners
clientsum F203227 2018 Advanced Roofing Inc
clientsum F214670 2018 Africell Holding
clientsum D000023883 2018 Amazon.com
clientsum D000000192 2018 American Health Care Assn
clientsum D000021839 2018 American Road & Transport Builders Assn

Related

How to scrape table data with th and td with BeautifulSoup?

Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data

Data/Table Scraping from Website using Python

I'm trying to scrape a data from a table on a website.
However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data).
I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html, which returns a list of DataFrames after parsing every table tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered] part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'.
combinedDf should look like
You can try to use pd.read_html to read the table into a dataframe:
import pandas as pd
url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"
df = pd.read_html(url)[0]
print(df)
Prints:
rank Youtuber subscribers video views video count category started
0 1 ✿ Kids Diana Show 106000000 86400421379 1052 People & Blogs 2015
1 2 Movieclips 58500000 59672883333 39903 Film & Animation 2006
2 3 Ryan's World 34100000 53568277882 2290 Entertainment 2015
3 4 Toys and Colors 38300000 44050683425 901 Entertainment 2016
4 5 LooLoo Kids - Nursery Rhymes and Children's Songs 52200000 30758617681 605 Music 2014
5 6 LankyBox 22500000 30147589773 6913 Comedy 2016
6 7 D Billions 24200000 27485780190 582 NaN 2019
7 8 BabyBus - Kids Songs and Cartoons 31200000 25202247059 1946 Education 2016
8 9 FGTeeV 21500000 23255537029 1659 Gaming 2013
...and so on.

Python Zip list to Dataframe

I'd like zip some lists from html, I use codes like:
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles = link.text
links = f"Link : 'https://www.pds.com.ph/{link['href']}"
dates = td.text
for link, title, date in zip(links, titles, dates):
dataframe = pd.DataFrame({'col1':title,'col2':link,'col3':date},index=[0])
print(dataframe)
But the output is not what I expected:
col1 col2 col3
1 P L M
col1 col2 col3
1 D i a
...
What I EXPECT is:
Titles Links Dates
... ... ...
May I ask if the syntax is correct or what could I do to achieve that?
You can just pass the result from zip directly to pd.DataFrame, specifying the column names in a list:
df = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
If you are trying to create a dataframe from the extracted values then, you need to store them in list before performing zip
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
titles = [] # to store extracted values in list
links = []
dates = []
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles.append(link.text)
links.append(f"Link : 'https://www.pds.com.ph/{link['href']}")
dates.append(td.text)
dataframe = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
# or you can use
# dataframe = pd.DataFrame({'Titles': titles, 'Links': links, 'Dates': dates})
print(dataframe)
# Titles Links Dates
# 0 RCBC Lists PHP 17.87257 Billion ASEAN Sustaina... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 31, 2021
# 1 Aboitiz Power Corporation Raises 8 Billion Fix... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 16, 2021
# 2 Century Properties Group, Inc Returns to PDEx ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 1, 2021
# 3 PDS Group Celebrates 2020’s Top Performers in ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 29, 2021

How can I add webscraped html data to dataframe having multiple values in python?

I am learning Python and trying to create a DataFrame with Pandas. I want to take the data from the table from this website https://www.chilli-diy.com/chilikeimtabelle/ so I can later on visualize it with altair. I am having troubles with the column "Schärfe" because there are multiple values I guess (1-10, 1 bis 4, 10+,...)
So all I get is the "Brazilian Ghost", because it has no value?
thanks in advance
permalink = ('https://www.chilli-diy.com/chilikeimtabelle/')
chilis = requests.get(permalink).text
soup = bs4.BeautifulSoup(chilis, "html.parser")
tables = soup.find_all('table')
names = []
peps = []
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if(peps is None):
continue
peps.append(cells[2].text)
names.append(cells[0].text)
df = pd.DataFrame({"Chilisorte" : names, "Schärfe" : peps })
df
Out:
Chilisorte Schärfe
0 Brazilian Ghost
output
Your dataframe is empty because names and peps are appended outside of the loop. So all they have in is the last element. Also, I guess, you you wanted to check if the cell is None, not peps
names = []
peps = []
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if cells:
peps.append(cells[2].text)
names.append(cells[0].text)
If you examine the DOM of the page in your browser, you'll see that all Chilisorte column cells (<td>) have classes column-1, and Schärfe column table cells have classes column-3.
The easiest way would be to extract all cells with relevant classes into 2 list, extract their .text value, and add into a dataframe.
permalink = ('https://www.chilli-diy.com/chilikeimtabelle/')
chilis = requests.get(permalink).text
soup = bs4.BeautifulSoup(chilis, "html.parser")
col1_tds = tables[0].findAll('td', {"class": "column-1"})
col1 = [ td.text for td in col1_tds ]
col3_tds = tables[0].findAll('td', {"class": "column-3"})
col3 = [ td.text for td in col3_tds ]
pd.DataFrame({
"Chilisorte": col1,
"Schärfe": col3
})
Output:
Chilisorte Schärfe
0 Anaheim 1 bis 4
1 Habanero Chocolate 10
2 Habanero White 10
3 Bird Pepper Wild 9
4 Bhut(Bih) Jolokia Yellow 10+
... ... ...
272 Naglah 10++
273 Dorset Naga 10+++
274 Jigsaw 10+++
275 Black Naga 10+
276 Brazilian Ghost
277 rows × 2 columns

Is it possible to read HTML tables into pandas with style tag?

I am trying to use pandas read_html function to read the "Official List of the House of Representatives," located here.
Using
df_list = pd.read_html('http://clerk.house.gov/member_info/olmbr.aspx',header=0,encoding = "UTF-8")
house = df_list[0]
I do get a nice DataFrame with the representatives name, state, and district. The header is correct and the encoding also. So far so good.
However, the problem is the party. There is no column for the party. Instead, the party is denoted by the font (roman or italic). Looking at the HTML source, here's an entry for a democrat:
<tr><td><em>Adams, Alma S.</em></td><td>NC</td><td>12th</td></tr>
and here's an entry for a republican:
<tr><td>Anderholt, Robert B.</td><td>AL</td><td>4th</td></tr>
Republicans are lacking the <em></em> tags around their name.
How would one go about retrieving this information? Can it be done with pandas or do I need some more sophisticated HTML parsers? If so, which ones?
I think you need create parser:
import requests
from bs4 import BeautifulSoup
url = "http://clerk.house.gov/member_info/olmbr.aspx"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html5lib')
table = soup.find_all('table')[0]
#print (table)
data = []
#remove first header
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
#get all children tags of first td
childrens = cols[0].findChildren()
#extracet all tags joined by ,
a = ', '.join([x.name for x in childrens]) if len(childrens) > 0 else ''
cols = [ele.text.strip() for ele in cols]
#add tag value for each row
cols.append(a)
data.append(cols)
#DataFrame contructor
cols = ['Representative', 'State', 'District', 'Tag']
df = pd.DataFrame(data, columns=cols)
print (df.head())
Representative State District Tag
0 Abraham, Ralph Lee LA 5th
1 Adams, Alma S. NC 12th em
2 Aderholt, Robert B. AL 4th
3 Aguilar, Pete CA 31st em
4 Allen, Rick W. GA 12th
Also is possible create columns with 1 and 0 for all possible tags:
import requests
from bs4 import BeautifulSoup
url = "http://clerk.house.gov/member_info/olmbr.aspx"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html5lib')
table = soup.find_all('table')[0]
#print (table)
data = []
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
childrens = cols[0].findChildren()
a = '|'.join([x.name for x in childrens]) if len(childrens) > 0 else ''
cols = [ele.text.strip() for ele in cols]
cols.append(a)
data.append(cols)
cols = ['Representative', 'State', 'District', 'Tag']
df = pd.DataFrame(data, columns=cols)
df = df.join(df.pop('Tag').str.get_dummies())
print (df.head())
Representative State District em strong
0 Abraham, Ralph Lee LA 5th 0 0
1 Adams, Alma S. NC 12th 1 0
2 Aderholt, Robert B. AL 4th 0 0
3 Aguilar, Pete CA 31st 1 0
4 Allen, Rick W. GA 12th 0 0

Categories