web scraping a dataframe - python

i'm currently trying to web scraping a dataframe (about sctack exchange of a companie) in a website in order to make a new dataframe in python this data.
I've tried to scrap the row of the dataframe in order to store in a csv file and use the method pandas.read_csv().
I meet some trouble because the csv file is not as good as i thought.
How can i manage to get the exactly same dataframe in python with web-scraping it
Here's my code :
from bs4 import BeautifulSoup
import urllib.request as ur
import csv
import pandas as pd
url_danone = "https://www.boursorama.com/cours/1rPBN/"
our_url = ur.urlopen(url_danone)
soup = BeautifulSoup(our_url, 'html.parser')
with open('danone.csv', 'w') as filee:
for ligne in soup.find_all("table", {"class": "c-table c-table--generic"}):
row = ligne.find("tr", {"class": "c-table__row"}).get_text()
writer = csv.writer(filee)
writer.writerow(row)
The dataframe in the website
The csv file

You can use pd.read_html to read the required table:
import pandas as pd
url = "https://www.boursorama.com/cours/1rPBN/"
df = pd.read_html(url)[1].rename(columns={"Unnamed: 0": ""}).set_index("")
print(df)
df.to_csv("data.csv")
Prints and saves data.csv (screenshot from LibreOffice):

Please try this for loop instead:
rows = []
headers = []
# loop to get the values
for tr in soup.find_all("tr", {"class": "c-table__row"})[13:18]:
row = [td.text.strip() for td in tr.select('td') if td.text.strip()]
rows.append(row)
# get the header
for th in soup.find_all("th", {"class": "c-table__cell c-table__cell--head c-table__cell--dotted c-table__title / u-text-uppercase"}):
head = th.text.strip()
headers.append(head)
This would get your values and header in the way you want. Note that, since the tables don't have ids or any unique identifiers, you need to proper stabilish which rows you want considering all tables (see [13:18] in the code above).
You can check your content making a simple dataframe from the headers and rows as below:
# write csv
df = pd.DataFrame(rows, columns=headers)
print(df.head())
Hope this helps.

Related

How to retrieve and store 2nd and 3rd row elements in a dataframe

I am new to Pandas and Webscraping and BeautifulSoup in Python.
As I was learning to do some basic webscraping in Python by using requests and BeautifulSoup to scrape a webpage, I am confused with the task of assigning the 2nd and 3rd elements of an html table into a pandas dataframe.
Suppose I have this table:
Here is my code so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_data, 'html.parser')
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
for row in soup.find_all('tbody')[3].find_all('tr'): #This line will make sure to get to the third table which is this "By market capitalization" table on the webpage and finding all the rows of this table
col = row.find_all('td') #This is to target individual column values in a particular row of the table
for j, cell in enumerate(col):
#Further code here
As it can be seen, I want to target all the 2nd and 3rd column values of a row and append to the
empty dataframe, data, so that data contains the Bank names and market cap values. How can I achieve that kind of functionality?
For tables I would suggest pandas:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(url)
df = tables[1]
When you prefer using beautifulsoup, you can try this to accomplish the same:
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser').find_all('table')
table = soup[1]
table_rows = table.find_all('tr')
table_header = [th.text.strip() for th in table_rows[0].find_all('th')]
table_data = []
for row in table_rows[1:]:
table_data.append([td.text.strip() for td in row.find_all('td')])
df = pd.DataFrame(table_data, columns=table_header)
When needed, you can set Rank as index with df.set_index('Rank', inplace=True. Image below is the unmodified dataframe.

Converting columns in panda dataframes to numerical values when exporting to excel

I've created a panda dataframe scraped from a website and have exported it into excel but the number values appear in text format in excel so wanted a quick way of converting all the number values into numbers that I can then analyse in excel automatically.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook
import csv
import os
def url_scraper(url):
response=requests.get(url)
html=response.text
soup=BeautifulSoup(html,"html.parser")
return soup
def first_inns_bowling_scorecard_scraper(url):
soup=url_scraper(url)
for divs in soup.find_all("div",{"id":"gp-inning-00"}):
for bowling_div in soup.find_all("div",{"class":"scorecard-section bowling"}):
table_headers=bowling_div.find_all("th")
table_rows=bowling_div.find_all("tr")[1:]
headers=[]
for th in table_headers:
headers.append(th.text)
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
df=pd.DataFrame(data, columns=headers)
df.drop(df.columns[[1,9]], axis = 1,inplace=True)
df.to_excel(r'C:\\Users\\nathang\\Downloads\\random.xlsx',index = None, header=True)
os.chdir('C:\\Users\\nathang\\Downloads')
os.system("start EXCEL.EXE random.xlsx")
return df
url="https://www.espncricinfo.com/series/19781/scorecard/1216418/afghanistan-vs-ireland-3rd-t20i-ireland-tour-of-india-2019-20"
first_inns_bowling_scorecard_scraper(url)
I've tried multiple different variations of the df.apply(pd.to_numeric) on individual columns, multiple columns, the whole dataset and so on but can't get anything to work for it. Ideally, I would like to just input the whole dataframe into it and if there is an error it ignores it.
This might solve your problem.
a = "5"
int(a) = 5
row = [int(tr.text) for tr in td]

How to store pandas dataframe information in a csv file

I am new to scraping and python. I am trying to scrape multiple tables from this URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes. I did the scraping and now I am trying to save the dataframe to a csv file. I tried but it just stores the the first table from the page.
code:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))
for line in range(7):
df= pd.DataFrame(wikitables[line].head())
df.to_csv('file1.csv')
You need to reshape the list of dataframes into a single dataframe and then you need to export it to csv file.
wikitable = wikitables[0]
for i in range(1,len(wikitables)):
wikitable = wikitable.append(wikitables[i],sort=True)
wikitable.to_csv('wikitable.csv')
You forgot
import pandas as pd
but you don't need it because read_html gives list of dataframes and you don't have to convert it o dataframes. You can write it directly.
from pandas.io.html import read_html
url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(url, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print("Extracted {num} wikitables".format(num=len(wikitables)))
for i, dataframe in enumerate(wikitables):
dataframe.to_csv('file{}.csv'.format(i))

Extracting Tables From Different Sites With BeautifulSoup IN A LOOP

I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'

web scraping table from multiple pages from a search and creating a pandas dataframe

I got this code working for the first page and needed the user agent as it didn't work otherwise.
The problem I get is the search brings the first page, but on the second you have "page=2" and continuing so need to scrape all or as much as needed from the search
"https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4"
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Output is a Pandas DataFrame
need to scrape all pages to output a large dataframe
Okay, so this problem ended up getting stuck in my head, so I worked it out.
import pandas as pd
import requests
hdr={'User-Agent':'Chrome/70.0.3538.110'}
table_dfs={}
for page_number in range(951):
http= "https://www.vesselfinder.com/vessels?page={}&minDW=20000&maxDW=300000&type=4".format(page_number+1)
url= requests.get(http,headers=hdr)
table_dfs[page_number]= pd.read_html(url.text)
it will return the first column (vessel) as a nan value. That's the column for the image, ignore it if you don't need it.
the next column will be called 'built' it has the ships name, and type of ship in it. You'll need to .split() to separate them, and then you can replace column(vessel) with the ships name.
If it works for you I'd love to boost my reputation with a nice green check mark.
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
^this code above is the same thing as
urls=['some list of urls you want to scrape']
table_dfs= [pd.read_html(url) for url in urls]
you can crawl through the urls you're looking for and apply that, and then if you want to do something with/to the tables you can just go:
for table in table_dfs:
table + 'the thing you want to do'
Note that the in-line for loop of table_dfs is in a list. That means that you might not be able to discern which url it came from if the scrape is big enough. Pieca seemed to have a solution that could be used to iterate the websites urls, and create a dictionary key. Note that this solution may not apply to every website.
url_list = {page_number:"https://www.vesselfinder.com/vessels?page=
{}&minDW=20000&maxDW=300000&type=4".format(page_number) for page_number
in list(range(1, 953))}
table_dfs={}
for url in range(1,len(url_list)):
table_dfs[url]= pd.read_html(url_list[url],header=hdr)

Categories