Scrape table header from investing.com - python

I'm trying to scrape data from investing.com. My code is working except from the table header.
My "columns" variable has the names as: data-col-name = "abc", but I don't know how to extract them as column_names.
table_rows = soup.find("tbody").find_all("tr")
table = []
for i in table_rows:
td = i.find_all("td")
row = [cell.string for cell in td]
table.append(row)
columns = soup.find("thead").find_all("th")
column_names =
df_temp = pd.DataFrame(data=table, columns=column_names)
df_dji = df_dji.append(df_temp)

You have to use .text instead of .string
columns = soup.find("thead").find_all("th")
#print(columns)
column_names = [cell.text for cell in columns]
print(column_names)
or use .get_text() or even .get_text(strip=True)
column_names = [cell.get_text() for cell in columns]
print(column_names)
Official documentation shows .string (.text is unofficial method in new versions but probably was official in older versions) but here .string doesn't work - maybe because there is another object <span> inside <th>. And get_text() get all strings from all elements in th and create one string.
EDIT:
If you want to get value form data-col-name= then use
cell['data-col-name']
cell.get('data-col-name')
cell.attrs['data-col-name']
cell.attrs.get('data-col-name')
(and the same is with cell['id'] or cell['class'])
column_names = [cell['data-col-name'] for cell in columns]
column_names = [cell.get('data-col-name') for cell in columns]
# etc.
attrs is normal dictionary so you can use attrs.get(key, default_value), attrs.keys(), attrs.items(), attrs.values() or use like dictionary with for-loop.

Related

Separate merged text and move one of the separated text in a new column in python

I'm extracting table data from a this url(https://www.tradingview.com/markets/stocks-india/market-movers-large-cap/) after making the adjustments here and there I finally get it to extract the data successfully except for one problem, just below the header row from second row onwards to the last row only the first column of text (ticker symbol and the company name) get merged turns out they have their own tag one is inside tag which contains the ticker name and the other is inside which contains the company name both are inside only one of the multiple tags in their respective tags, so how do I split those 2 texts and move one of those text in a new column named "company name", this column doesn't exist in the above link so I would want to know how to create it and add the data in it.
here's the code
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create an URL object
url = 'https://www.tradingview.com/markets/stocks-india/market-movers-large-cap/'# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
# Obtain information from tag <table>
table1 = soup.find('table', class_='table-DR3mi0GH')
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns=headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
screenshot:
The words in the ticker column appear merged, btw I'm using the python console in pycharm

Get tables from Wiki that match specific text

I'm quite new to Python and BeautifulSoup, and I've been trying to work this out for several hours...
Firstly, I want to extract all table data from below link with "general election" in the title:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
I do have another dataframe with names of each table (eg. "1961 general election", "1965 general election"), but am hoping to get away with just searching for "general election" on each table to confirm if it's what I need.
I then want to get all the names that are in Bold (which indicates they won) and finally I want another list of "Count 1" (or sometimes 1st Pref) in the original order, which I want to compare to the "Bold" list. I haven't even looked at this piece yet, as I haven't gotten past the first hurdle.
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
Any help on this would be greatly appreciated...
This page requires some gymnastics, but it can be done:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
The output should be all the general election tables on the page.

How to Create Pandas DataFrame out of Parsed Code using bs4/selenium on Python?

I have parsed a table and would like to convert two of those variables to a Pandas Dataframe to print to excel.
FYI:
I did ask a similar question, however, it was not answered thoroughly. There was no suggestion on how to create a Pandas DataFrame. This was the whole point of my question.
Caution:
There is small issue with the data that I parsed. The data contains "TEAM" and "SA/G" multiple times in the output.
The 1st variable that I would like in the DataFrame is 'TEAM'.
The 2nd variable that I would like in the DataFrame is 'SA/G'.
Here is my code so far:
# imports
from selenium import webdriver
from bs4 import BeautifulSoup
# make a webdriver object
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
# open some page using get method - url -- > parameters
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')
# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
#close driver
driver.close()
#find table
table = soup.find('table')
#find_all table rows
t_rows = table.find_all('tr')
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
# print(row)
# print(row[9])
# print(row[1], row[9])
team = row[1]
sag = row[9]
# print(team, sag)
data = [(team, sag)]
print(data)
Here is the final output that I would like printed to excel using the Pandas DataFrame option:
Team SA/G
Nashville 30.1
Colorado 33.6
Washington 31.0
... ...
Thanks in advance for any help that you may offer. I am still learning and appreciate any feedback that I can get.
Looks like you want to create a DataFrame from a list of tuples, which has been answered here.
I would change your code like this:
# Initial empty list
data = []
#loop through tr to find_all td
for tr in t_rows:
td = tr.find_all('td')
row = [i.text for i in td]
team = row[1]
sag = row[9]
# Add tuple containing one row of data
data.append((team, sag))
# Create df from list of tuples
df = pd.DataFrame(data, columns=['Team', 'SA/G'])
# Remove lines where Team value is "TEAM"
df = df[df["Team"] != "TEAM"]
EDIT: Add line to remove ("TEAM", "SA/G") rows in df
First inside the "for loop" append tuples into a list (instead of doing data=[(x,y)] declare the data variable before the loop as a list data = list() and append the tuples to list in the loop data.append((x,y))) and do the following
import pandas as pd
data=[("t1","sag1"),("t2","sag2"),("t3","sag3")]
df = pd.DataFrame(data,columns=['Team','SA/G'])
print(df)

Python : looping through each row of html table based on headers

I'm new to python(3.4) and have parsed an html table into headings[ ] , rows[ ] and cells [ ]
I wish to store each of these into a table (MySQL) eventually with the field names being the items in the headings[ ]
There are 4 headings("data0","data1","data2","data3")
There are 6 rows
The code to get there is very rudimentary using BeautifulSoup:
soup = BeautifulSoup(r.text)
table = soup.find("table")
cells = []
rows = table.findAll('tr')
headings = [th.get_text().strip() for th in table.findAll("th")]
for row in rows:
for cell in row.findAll('td')
cells .append(cell.get_text().strip())
I'm used to doing CASE statements, or, heaven forbid, a number of IF statements. I would normally place them under the
for cell in row.findAll('td') . But I would be working with a counter and be doing something like:
for row = 0 to len(rows)
for cell = 0 to len(headings)
select case cell
case = 0
(save cell contents to field called headings[0]
case = 1
(save cell contents to field called headings[1]
...
I'm not too worried about the saving part (yet) but I cant wrap my head around not being able to use counters.
I realize this is way beginner, but I should would appreciate any insight (and so would my brain)

How can I get data from a specific cell in an HTML table in python?

This link contains the table I'm trying to parse.
I'm trying to use BeautifulSoup in Python. I'm very new to BeautifulSoup and HTML. This is my attempt to solve my problem.
soup = BeautifulSoup(open('BBS_student_grads.php'))
data = []
table = soup.find('table')
rows = table.find_all('tr') #array of rows in table
for x,row in enumerate(rows[1:]):# skips first row
cols = row.find_all('td') # finds all cols in rows
for y,col in enumerate(cols): # iterates through col
data.append([])
data[x].append(col) # puts table into a 2d array called data
print(data[0][0]) #prints top left corner
Sample Output
I'm trying to extract all the names in the table, then update the names in the list and then update the table. I'm also using a local copy of this HTML. Temporary fix till I learn how to do more web programming.
help is much appreciated
I think you need just the td elements in the tr element with class="searchbox_black".
You can use CSS Selectors to get to the desired td elements:
for cell in soup.select('tr.searchbox_black td'):
print cell.text
It prints:
BB Salsa
Adams State University Alamosa, CO
Sensei: Oneyda Maestas
Raymond Breitstein
...

Categories