Extracting Tables From Different Sites With BeautifulSoup IN A LOOP - python

I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'

Related

web scraping a dataframe

i'm currently trying to web scraping a dataframe (about sctack exchange of a companie) in a website in order to make a new dataframe in python this data.
I've tried to scrap the row of the dataframe in order to store in a csv file and use the method pandas.read_csv().
I meet some trouble because the csv file is not as good as i thought.
How can i manage to get the exactly same dataframe in python with web-scraping it
Here's my code :
from bs4 import BeautifulSoup
import urllib.request as ur
import csv
import pandas as pd
url_danone = "https://www.boursorama.com/cours/1rPBN/"
our_url = ur.urlopen(url_danone)
soup = BeautifulSoup(our_url, 'html.parser')
with open('danone.csv', 'w') as filee:
for ligne in soup.find_all("table", {"class": "c-table c-table--generic"}):
row = ligne.find("tr", {"class": "c-table__row"}).get_text()
writer = csv.writer(filee)
writer.writerow(row)
The dataframe in the website
The csv file
You can use pd.read_html to read the required table:
import pandas as pd
url = "https://www.boursorama.com/cours/1rPBN/"
df = pd.read_html(url)[1].rename(columns={"Unnamed: 0": ""}).set_index("")
print(df)
df.to_csv("data.csv")
Prints and saves data.csv (screenshot from LibreOffice):
Please try this for loop instead:
rows = []
headers = []
# loop to get the values
for tr in soup.find_all("tr", {"class": "c-table__row"})[13:18]:
row = [td.text.strip() for td in tr.select('td') if td.text.strip()]
rows.append(row)
# get the header
for th in soup.find_all("th", {"class": "c-table__cell c-table__cell--head c-table__cell--dotted c-table__title / u-text-uppercase"}):
head = th.text.strip()
headers.append(head)
This would get your values and header in the way you want. Note that, since the tables don't have ids or any unique identifiers, you need to proper stabilish which rows you want considering all tables (see [13:18] in the code above).
You can check your content making a simple dataframe from the headers and rows as below:
# write csv
df = pd.DataFrame(rows, columns=headers)
print(df.head())
Hope this helps.

How can I get the data of this table from HackerRank and filter it by country of origin and score for then exporting it as a csv file?

I'm learning web scraping on Python and I decided to test my skills in the HackerRank Leaderboard page, so I wrote the code below expecting no errors before adding the country restriction to the tester function for then exporting my csv file successfully.
But then the Python console replied:
AttributeError: 'NoneType' object has no attribute 'find_all'
The error above corresponds to the line 29 from my code (for i in table.find_all({'class':'ellipsis'}):), so I decided to come here in order to ask for assistance, I'm afraid there could be more syntax or logic errors, so it's better to get rid of my doubts by getting a feedback from experts.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
pd.set_option('display.max_columns', None)
#Declaring a variable for looping over all the pages
pages = np.arange(1, 93, 1)
a = pd.DataFrame()
#loop cycle
for url in pages:
#get html for each new page
url ='https://www.hackerrank.com/leaderboard?page='+str(url)
page = requests.get(url)
sleep(randint(3,10))
soup = BeautifulSoup(page.text, 'lxml')
#get the table
table = soup.find('header', {'class':'table-header flex'})
headers = []
#get the headers of the table and delete the "white space"
for i in table.find_all({'class':'ellipsis'}):
title = i.text.strip()
headers.append(title)
#set the headers to columns in a new dataframe
df = pd.DataFrame(columns=headers)
rows = soup.find('div', {'class':'table-body'})
#get the rows of the table but omit the first row (which are headers)
for row in rows.find_all('table-row-wrapper')[1:]:
data = row.find_all('table-row-column ellipsis')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#set the data of the Txn Count column to float
Txn = df['SCORE'].values
#combine all the data rows in one single dataframe
a = a.append(pd.DataFrame(df))
def tester(mejora):
mejora = mejora[(mejora['SCORE']>2250.0)]
return mejora.to_csv('new_test_Score_Count.csv')
tester(a)
Do you guys have any ideas or suggestions that could fix the problem?
the error states, that you table element is None. i'm guessing here but you cant get the table from the page with bs4 because it is loaded after with javascript. I would recommend to use selenium for this instead

Splitting up text into separate rows - BeautifulSoup

I'm trying to loop over a few hundred pages of a site and grab Buddhist quotes and then save them into a dataframe. I've mostly got the code working, but am struggling with parsing some of the text appropriately. On each page i'm scraping there are 5 quotes, and from what I can tell in the HTML output no obvious identifier for each. So i've attempted to loop over what I scrape from each page but it's either overwriting all previous quotes (i.e quotes 1-4) or just grouping them all together into a single cell.
See set-up and code below:
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
new_result = pd.DataFrame({
"quote":[soup.find_all("div", class_="post-content")],
"quote_date":[soup.find_all("div", class_="post-date")],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata
As you can see from the image attached this is bundling each 5 quotes into a single cell and making a new row of the data for each page. Any thoughts on how I can split these up so I have one row per quote and date? I tried looping over soup.find_all("div", class_="post-content") but figure I must have been constructing the dataframe incorrectly as that overwrote all but the last quote on each page.
what my dataframe currently looks like
Thanks in advance! Chris
How to fix?
You should add an additional for loop to get your goal:
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
Example
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata

How to Loop and Save Data from Each Iteration

I am trying to learn how to scrape data from a webpage in python and am running into trouble with how to structure my nested loops in python. I received some assistance in how I was scraping with this question (How to pull links from within an 'a' tag). I am trying to have that code essentially iterate through different weeks (and eventually years) of webpages. What I have currently is below, but it is not iterating through the two weeks I would like it to and saving it off.
import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])
scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')
Edit: So to add some clarification, it is creating duplicates of the data from one week and repeatedly appending it to the end. I also edited the code to include the proper libraries, it should be able to be copy and pasted and run in python.
The problem is in your loop logic:
if 'window.espn.scoreboardData' in script.text:
...
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
Your indentation on the last line is wrong. As given, you append data2 regardless of whether you have new scoreboard data. When you don't, you skip the if body and simply append the previous data2 value.
So the workaround I came up with is below, I am still getting duplicate game ID's in my final dataset, but at least I am looping through the entire desired set and getting all of them. Then at the end I dedupe.
import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd
years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])
all_links = {}
for year in years:
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')

web scraping table from multiple pages from a search and creating a pandas dataframe

I got this code working for the first page and needed the user agent as it didn't work otherwise.
The problem I get is the search brings the first page, but on the second you have "page=2" and continuing so need to scrape all or as much as needed from the search
"https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4"
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Output is a Pandas DataFrame
need to scrape all pages to output a large dataframe
Okay, so this problem ended up getting stuck in my head, so I worked it out.
import pandas as pd
import requests
hdr={'User-Agent':'Chrome/70.0.3538.110'}
table_dfs={}
for page_number in range(951):
http= "https://www.vesselfinder.com/vessels?page={}&minDW=20000&maxDW=300000&type=4".format(page_number+1)
url= requests.get(http,headers=hdr)
table_dfs[page_number]= pd.read_html(url.text)
it will return the first column (vessel) as a nan value. That's the column for the image, ignore it if you don't need it.
the next column will be called 'built' it has the ships name, and type of ship in it. You'll need to .split() to separate them, and then you can replace column(vessel) with the ships name.
If it works for you I'd love to boost my reputation with a nice green check mark.
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
^this code above is the same thing as
urls=['some list of urls you want to scrape']
table_dfs= [pd.read_html(url) for url in urls]
you can crawl through the urls you're looking for and apply that, and then if you want to do something with/to the tables you can just go:
for table in table_dfs:
table + 'the thing you want to do'
Note that the in-line for loop of table_dfs is in a list. That means that you might not be able to discern which url it came from if the scrape is big enough. Pieca seemed to have a solution that could be used to iterate the websites urls, and create a dictionary key. Note that this solution may not apply to every website.
url_list = {page_number:"https://www.vesselfinder.com/vessels?page=
{}&minDW=20000&maxDW=300000&type=4".format(page_number) for page_number
in list(range(1, 953))}
table_dfs={}
for url in range(1,len(url_list)):
table_dfs[url]= pd.read_html(url_list[url],header=hdr)

Categories