first thanks for taking time to review my problem.
I am trying to get web page in a list with beautifullsoup and then pass it to a dataframe.
# search for tbody and print each tr to a list
table_body=soup.find('tbody')
rows = table_body.find_all('tr')
res = []
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip()
for x in cols]
# print(cols)
if row:
res.append(row)
# transfer python list to dataframe
df_wocaps = pd.DataFrame(res, columns=["drop1", "Platz", "Vorher", "Wertpapier", "Kurs", "drop2", "Perf", "drop3", "1 Monat", "Suchanfragen", "drop4", "drop5", "drop6"])
df_wocaps.head()
# drop unused columns
df = df_wocaps.drop(['drop1', 'drop2', 'drop3', 'drop4', 'drop5', 'drop6'], axis=1)
df.head()
At the End it look slike this:
dataframe picture
i want to extract only the relevant information without all the html stuff and brackets. Anyone an idea how to do that? Thanks in advance.
Use variable "cols" instead of rows which gives you the input data
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip()
for x in cols]
# print(cols)
if row:
res.append(cols)
Related
I'm trying to scrape data from investing.com. My code is working except from the table header.
My "columns" variable has the names as: data-col-name = "abc", but I don't know how to extract them as column_names.
table_rows = soup.find("tbody").find_all("tr")
table = []
for i in table_rows:
td = i.find_all("td")
row = [cell.string for cell in td]
table.append(row)
columns = soup.find("thead").find_all("th")
column_names =
df_temp = pd.DataFrame(data=table, columns=column_names)
df_dji = df_dji.append(df_temp)
You have to use .text instead of .string
columns = soup.find("thead").find_all("th")
#print(columns)
column_names = [cell.text for cell in columns]
print(column_names)
or use .get_text() or even .get_text(strip=True)
column_names = [cell.get_text() for cell in columns]
print(column_names)
Official documentation shows .string (.text is unofficial method in new versions but probably was official in older versions) but here .string doesn't work - maybe because there is another object <span> inside <th>. And get_text() get all strings from all elements in th and create one string.
EDIT:
If you want to get value form data-col-name= then use
cell['data-col-name']
cell.get('data-col-name')
cell.attrs['data-col-name']
cell.attrs.get('data-col-name')
(and the same is with cell['id'] or cell['class'])
column_names = [cell['data-col-name'] for cell in columns]
column_names = [cell.get('data-col-name') for cell in columns]
# etc.
attrs is normal dictionary so you can use attrs.get(key, default_value), attrs.keys(), attrs.items(), attrs.values() or use like dictionary with for-loop.
I am trying to scrape the table "List of chemical elements" from this website https://en.wikipedia.org/wiki/List_of_chemical_elements
I want to then store the table data into a pandas dataframe such that i can convert it into a csv file. So far i have scraped and stored the headers of the table into a dataframe. I also managed to retrieve each individual rows of data from the table. However, i am having trouble in storing the data for the table into the dataframe. Below is what i have so far
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table = soup.select_one('table.wikitable')
table_body = table.find('tbody')
#print(table_body)
rows = table_body.find_all('tr')
cols = [c.text.replace('\n', '') for c in rows[1].find_all('th')]
df2a = pd.DataFrame(columns = cols)
df2a
for row in rows:
records = row.find_all('td')
if records != []:
records = [r.text.strip() for r in records]
print(records)
Here i have found all columns data in which it is divided to two parts first and second columns data
all_columns=soup.find_all("tr",attrs={"style":"vertical-align:top"})
first_column_data=[i.get_text(strip=True) for i in all_columns[0].find_all("th")]
second_column_data=[i.get_text(strip=True) for i in all_columns[1].find_all("th")]
Now as we need 16 columns so take appropriate columns and added data to new_lst list which is column list
new_lst=[]
new_lst.extend(second_column_data[:3])
new_lst.extend(first_column_data[1:])
Now we have to find row data iterate through all tr with attrs and find respectivetd and it will return list of table data and append to main_lst
main_lst=[]
for i in soup.find_all("tr",attrs={"class":"anchor"}):
row_data=[row.get_text(strip=True) for row in i.find_all("td")]
main_lst.append(row_data)
Output:
Atomic numberZ Symbol Name Origin of name[2][3] Group Period Block Standardatomicweight[a] Density[b][c] Melting point[d] Boiling point[e] Specificheatcapacity[f] Electronegativity[g] Abundancein Earth'scrust[h] Origin[i] Phase atr.t.[j]
0 1 H Hydrogen Greekelementshydro-and-gen, 'water-forming' 1 1 s-block 1.008 0.00008988 14.01 20.28 14.304 2.20 1400 primordial gas
....
Let pandas parse it for you:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
df.to_csv('file.csv', index=False)
Goal: The goal of my project is to use BeautifulSoup aka bs4 to scrape only necessary data from an HTML file and import it into excel. The html file is heavily formatted so unfortunately I haven't been able to tailor more common solutions to my needs.
What I have tried: I have been able to parse the HTML file to the point where I am only pulling the tables I need, and I am able to detect every column of data and print it. In example, if there are a total of 18 columns and 3 rows of data, the code will output 54 times with each piece of table data going from row 1 col 1 to row 3 col 18.
My code is as follows:
from bs4 import BeautifulSoup
import csv
import pandas as pd
url =
output =
#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
try:
func(*args, **kw)
return False
except Exception:
return True
#read the html
with open(url) as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
#table = soup.find_all('table')
all_tables = soup.find_all('table')
#print(len(table))
df = pd.DataFrame( columns=(pks_col_names))
col_list = []
table_list = []
for tnum, tables in enumerate(all_tables):
if iserror(all_tables[tnum].tbody): #Finds table with data
table = tables.findAll('tr')
#Loops through rows of each data table
for rnum, row in enumerate(table):
table_row = table[rnum].findAll('td')
if len(table_row)==17:
#Loops through columns of each data table
for col in table_row:
col_list.append(col.string)
else:
pass
else:
pass
Example of data output currently achieved
row 1 column 1 (first string in list)
row 1 column 2
row 1 column 3 ...
row 3 column 17
row 3 column 18 (last string in list)
The current code creates a single list with the data outputted above, though I am unable to figure out a way to convert that list into a pandas dataframe tying each list output to the appropriate row/column. Could anyone provide ideas on how to do this or how to otherwise rework my code to import this data into a dataframe?
it's all messed up: your function iserror does in fact check if there's no error (and i don't think it works at all). what you call tables are rows and you don't need to enumerate
as you haven't provided the data, i made only rough tests. but it's a bit cleaner
row_list = []
for table in all_tables:
if is_ok(table.tbody): #Finds table with data
rows = table.findAll('tr')
#Loops through rows of each data table
for row in rows:
cols = row.findAll('td')
col_list = []
if len(cols)==17:
#Loops through columns of each data table
for col in cols:
col_list.append(col.string)
row_list.append(col_list)
df = pd.DataFrame(row_list, columns=(pks_col_names))
Thanks everyone for the help. I was able to achieve the desired data frame and the final code looks like this:
url = 'insert url here'
#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
try:
func(*args, **kw)
return False
except Exception:
return True
#All tables
all_tables = soup.findAll('table')
#Define column names
pks_col_names = ['TBD','Trade Date', 'Settle Date', 'Source', 'B/S', 'Asset Name', 'Security Description',
'Ticker','AccountID','Client Name','Shares','Price','Amount','Comms',
'Fees', 'Payout %','Payout']
row_list = []
for table in all_tables:
if iserror(table.tbody): #Finds table with data
rows = table.findAll('tr')
#Loops through rows of each data table
for row in rows:
cols = row.findAll('td')
col_list = []
if len(cols)==17:
#Loops through columns of each data table
for col in cols:
col_list.append(col.string)
row_list.append(col_list)
df = pd.DataFrame(row_list, columns=(pks_col_names))
df.to_csv(output, index=False, encoding = 'utf-8-sig')
I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)
Below loop supposes to add multiple table's row (html page) in one dataframe. loop works fine, it creates a dataframe for each table one by one but it also replaces previous table's data from dataframe which is i want to fix. it should append each table's data into the same dataframe, It should not replace previous table's data from dataframe. Plase help me on this.
column_headers = ['state', 'sr_no', 'district_name', 'country']
headers = ['district_id']
district_link = [[li.get('href') for li in data_rows_link[i].findAll('a')]
for i in range(len(data_rows))]
district_data_02 = [] # create an empty list to hold all the data
for i in range(len(data_rows)): # for each table row
district_row = [] # create an empty list for each pick/player
district_row.append("a")
# for each table data element from each table row
for li in data_rows[i].findAll('li'):
# get the text content and append to the district_row
district_row.append(li.getText())
# then append each pick/player to the district_data matrix
district_data_02.append(district_row)
district_data == district_data_02
#dataframe - districtlist
districtlist = pd.DataFrame(district_data ,columns=column_headers)
districtid = pd.DataFrame(district_link, columns=headers)
#df_row_merged = pd.concat([df, df1])
#dataframe - districtid
final_districtlist =pd.concat([districtlist, districtid], axis=1)