How to store pandas dataframe information in a csv file - python

I am new to scraping and python. I am trying to scrape multiple tables from this URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes. I did the scraping and now I am trying to save the dataframe to a csv file. I tried but it just stores the the first table from the page.
code:
from pandas.io.html import read_html
page = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))
for line in range(7):
df= pd.DataFrame(wikitables[line].head())
df.to_csv('file1.csv')

You need to reshape the list of dataframes into a single dataframe and then you need to export it to csv file.
wikitable = wikitables[0]
for i in range(1,len(wikitables)):
wikitable = wikitable.append(wikitables[i],sort=True)
wikitable.to_csv('wikitable.csv')

You forgot
import pandas as pd
but you don't need it because read_html gives list of dataframes and you don't have to convert it o dataframes. You can write it directly.
from pandas.io.html import read_html
url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
wikitables = read_html(url, index_col=0, attrs={"class":"wikitable plainrowheaders wikiepisodetable"})
print("Extracted {num} wikitables".format(num=len(wikitables)))
for i, dataframe in enumerate(wikitables):
dataframe.to_csv('file{}.csv'.format(i))

Related

How do I export a read_html df to Excel, when it related to table ID rather than data in the code?

I am experiencing this error with the code below:
File "\<stdin\>", line 1, in \<module\>
AttributeError: 'list' object has no attribute 'to_excel'
I want to save the table I am scraping from wikipedia to an Excel file - but I can't work out how to adjust the code to get the data list from the terminal to the Excel file using to_excel.
I can see it works for a similar problem when a dataset has data set out as a 'DataFrame'
(i.e. df = pd.DataFrame(data, columns = \['Product', 'Price'\]).
But can't work out how to adjust my code for the df = pd.read*html(str(congresstable)) line - which I think is the issue. (i.e. using read*_html and sourcing the data from a table id)
How can I adjust the code to make it save an excel file to the path specified?
from bs4 import BeautifulSoup
import requests
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')
congress_table = soup.find('table', attrs={'id': table_id})
df = pd.read_html(str(congress_table))
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)
print(df)
I was expecting the data list to be saved to Excel at the folder path specified.
I tried following multiple guides, but they don't show the read_html item, only DataFrame solutions.
pandas.read_html() creates a list of tables respectiv dataframe objects, so you have to pick one by index in your case [0] - You also do not need requests and BeautifulSoup, separatly, just go with pandas.read_html()
pd.read_html(wiki_url,attrs={'id': table_id})[0]
Example
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
congress_table = soup.find('table', )
df = pd.read_html(wiki_url,attrs={'id': table_id})[0]
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)

web scraping a dataframe

i'm currently trying to web scraping a dataframe (about sctack exchange of a companie) in a website in order to make a new dataframe in python this data.
I've tried to scrap the row of the dataframe in order to store in a csv file and use the method pandas.read_csv().
I meet some trouble because the csv file is not as good as i thought.
How can i manage to get the exactly same dataframe in python with web-scraping it
Here's my code :
from bs4 import BeautifulSoup
import urllib.request as ur
import csv
import pandas as pd
url_danone = "https://www.boursorama.com/cours/1rPBN/"
our_url = ur.urlopen(url_danone)
soup = BeautifulSoup(our_url, 'html.parser')
with open('danone.csv', 'w') as filee:
for ligne in soup.find_all("table", {"class": "c-table c-table--generic"}):
row = ligne.find("tr", {"class": "c-table__row"}).get_text()
writer = csv.writer(filee)
writer.writerow(row)
The dataframe in the website
The csv file
You can use pd.read_html to read the required table:
import pandas as pd
url = "https://www.boursorama.com/cours/1rPBN/"
df = pd.read_html(url)[1].rename(columns={"Unnamed: 0": ""}).set_index("")
print(df)
df.to_csv("data.csv")
Prints and saves data.csv (screenshot from LibreOffice):
Please try this for loop instead:
rows = []
headers = []
# loop to get the values
for tr in soup.find_all("tr", {"class": "c-table__row"})[13:18]:
row = [td.text.strip() for td in tr.select('td') if td.text.strip()]
rows.append(row)
# get the header
for th in soup.find_all("th", {"class": "c-table__cell c-table__cell--head c-table__cell--dotted c-table__title / u-text-uppercase"}):
head = th.text.strip()
headers.append(head)
This would get your values and header in the way you want. Note that, since the tables don't have ids or any unique identifiers, you need to proper stabilish which rows you want considering all tables (see [13:18] in the code above).
You can check your content making a simple dataframe from the headers and rows as below:
# write csv
df = pd.DataFrame(rows, columns=headers)
print(df.head())
Hope this helps.

Having trouble putting data into a pandas dataframe

I am new to coding, so take it easy on me! I recently started a pet project which scrapes data from a table and will create a csv of the data for me. I believe I have successfully pulled the data, but trying to put it into a dataframe returns the error "Shape of passed values is (31719, 1), indices imply (31719, 23)". I have tried looking at the length of my headers and my rows and those numbers are correct, but when I try to put it into a dataframe it appears that it is only pulling one column into the dataframe. Again, I am very new to all of this but would appreciate any help! Code below
from bs4 import BeautifulSoup
from pandas.core.frame import DataFrame
import requests
import pandas as pd
url = 'https://www.fangraphs.com/leaders.aspx? pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#pulling table from HTML
Table1 = soup.find('table', id = 'LeaderBoard1_dg1_ctl00')
#finding and filling table columns
headers = []
for i in Table1.find_all('th'):
title = i.text
headers.append(title)
#finding and filling table rows
rows = []
for j in Table1.find_all('td'):
data = j.text
rows.append(data)
#filling dataframe
df = pd.DataFrame(rows, columns = headers)
#show dataframe
print(df)
You are creating a dataframe with 692 rows with 23 columns as a new dataframe. However looking at the rows array, you only have 1 dimensional array so shape of passed values is not matching with indices. You are passing 692 x 1 to a dataframe with 692 x 23 which won't work.
If you want to create with the data you have, you should just use:
df=pd.DataFrame(rows, columns=headers[1:2])
Alternativly you can achieve your goal directly by using pandas.read_html that processe the data by BeautifulSoup for you:
pd.read_html(url, attrs={'id':'LeaderBoard1_dg1_ctl00'}, header=[1])[0].iloc[:-1]
attrs={'id':'LeaderBoard1_dg1_ctl00'} selects table by id
header=[1] adjusts the header cause there are multiple headers
.iloc[:-1] removes the table footer with pagination
Example
import pandas as pd
pd.read_html('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500',
attrs={'id':'LeaderBoard1_dg1_ctl00'},
header=[1])[0]\
.iloc[:-1]

Converting columns in panda dataframes to numerical values when exporting to excel

I've created a panda dataframe scraped from a website and have exported it into excel but the number values appear in text format in excel so wanted a quick way of converting all the number values into numbers that I can then analyse in excel automatically.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook
import csv
import os
def url_scraper(url):
response=requests.get(url)
html=response.text
soup=BeautifulSoup(html,"html.parser")
return soup
def first_inns_bowling_scorecard_scraper(url):
soup=url_scraper(url)
for divs in soup.find_all("div",{"id":"gp-inning-00"}):
for bowling_div in soup.find_all("div",{"class":"scorecard-section bowling"}):
table_headers=bowling_div.find_all("th")
table_rows=bowling_div.find_all("tr")[1:]
headers=[]
for th in table_headers:
headers.append(th.text)
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
df=pd.DataFrame(data, columns=headers)
df.drop(df.columns[[1,9]], axis = 1,inplace=True)
df.to_excel(r'C:\\Users\\nathang\\Downloads\\random.xlsx',index = None, header=True)
os.chdir('C:\\Users\\nathang\\Downloads')
os.system("start EXCEL.EXE random.xlsx")
return df
url="https://www.espncricinfo.com/series/19781/scorecard/1216418/afghanistan-vs-ireland-3rd-t20i-ireland-tour-of-india-2019-20"
first_inns_bowling_scorecard_scraper(url)
I've tried multiple different variations of the df.apply(pd.to_numeric) on individual columns, multiple columns, the whole dataset and so on but can't get anything to work for it. Ideally, I would like to just input the whole dataframe into it and if there is an error it ignores it.
This might solve your problem.
a = "5"
int(a) = 5
row = [int(tr.text) for tr in td]

Extract json data in web page using pd.read_json()?

Trying to extract the table from this page "https://www.hkex.com.hk/Market-Data/Statistics/Consolidated-Reports/Monthly-Bulletin?sc_lang=en#select1=0&select2=28". By inspect/network function of chorme, the data request link is "https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485". This links looks like json format when access directly. However, the codes using this link does not work.
My codes:
import pandas as pd
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485"
df = pd.read_json(url)
print(df.info(verbose=True))
print(df)
also tried:
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?"
You can try downloading the json first and then convert it back to DataFrame
import pandas as pd
url='https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485'
import urllib.request, json
with urllib.request.urlopen(url) as r:
data = json.loads(r.read().decode())
df = pd.DataFrame(data['tables'][0]['body'])
columns = [item['text'] for item in data['tables'][0]['header']]
row_count = max(df['row'])
new_df = pd.DataFrame(df.text.values.reshape((row_count,-1)),columns = columns)

Categories