How to convert data into csv file - python

I am unable to convert the data retrieved with bs4 into a meaningful csv file. It only takes the last set of data from what is actually retrieved.
#Beautiful soup or BS4 is a package I will be using to allow me to parse the HTML data which I will be retrieving from a website.
#parsing is te conversion of codes from machine language into a code which humans can understand and allow it to be structured.
#(Converting data from one format to another) with BS4
from bs4 import BeautifulSoup
#requests is an HTTP Library which allows me to send requests to websites the retrieve date using Python. This is helpful as
#The website is writtin in a different language so it allows me to retrieve what I want and read it as well.
import requests
#import writer
url= "https://myanimelist.net/anime/season"
#requesting to get data using 'requests' and gain acess as well.
#hadve to check the response before moving forward to ensure there is no problem retrieving data.
page= requests.get(url)
#print(page)
#<Response [200]> response was "200" meaning "Successful responses"
soup = BeautifulSoup(page.content, 'html.parser')
#here i retrieve my
#for this to identify the html code and determine what we will be producing(retrieveing data) for each item on the page we had to
#find the parent category which contains all the info we need to make our data categories.
lists = soup.select("[data-genre]")
#we add _ after class to make class_ because without the underscore the program identifies it as a python class
# when really it is more of a cs class
all_data = []
#must create loop to find titles seperate as there are alot that will come up
for list in lists:
#identify and find class which includes the title of the shows, show ratings, members watching, and episodes
#added .text.replace in order to get rid of the|n spacing which was in html format
title= list.find("a", class_="link-title").text.replace("\n", "")
rating= list.find("div", class_="score").text.replace("\n", "")
members= list.find("div", class_="scormem-item member").text.replace("\n", "")
release_date= list.find("span", class_="item").text.replace("\n", "")
all_data.append(
[title.strip(), rating.strip(), members.strip(), release_date.strip()]
)
print(*all_data, sep="\n")
#testing for errors and makins sure locations are correct to withdraw/request the data
#this allows us to create and close a csv file. using 'w' to allow editing
from csv import writer
#defining the website which I will be retrieving my code
#organizing chart
header=['Title', 'Show Rating', 'Members', 'Release Date']
info= [title.strip(), rating.strip(), members.strip(), release_date.strip()]
with open('shows.csv', 'w', encoding='utf8', newline='')as f:
#will write onto our file 'f'
writing = writer(f)
#use our writer to write a row in file
writing.writerow(header)
writing.writerow(info)
I tried to change definition of list but to no avail. This is currently what I get. Even though it should be much longer.

Instead of writing just the last line with
writing.writerow(info)
you need to write all of the lines:
writing.writerows(all_data)

Related

Function to web scrape tables from several pages

I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about. The code works perfectly when web scraping a single table and saving it into a data frame...
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url)
response
scraping_html_table_BD = BeautifulSoup(response.content, "lxml")
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]
But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.
import pandas as pd
from bs4 import BeautifulSoup
import requests
link_list = ['/Bangladesh.csv',
'/Nepal.csv',
'/Mongolia.csv']
def get_info(page_url):
page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
scape = BeautifulSoup(page.text, 'html.parser')
vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
result = {}
df = pd.read_html(str(vaccination_rates))
vaccination_rates = df[0]
df = pd.DataFrame(vaccination_rates)
print(df)
df.to_csv("testdata.csv", index=False)
for link in link_list:
get_info(link)
edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.
new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new
This is because in every iteration your 'testdata.csv' is overwritten with a new one.
so you can do :
df.to_csv(page_url[1:], index=False)
I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:
for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)
Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python.

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

How should I scrape the text of website which is represented by 1 of the 'p' tags?

I'm a newbie at Python, and am practicing web scraping by extracting data from a news website.
I currently face 2 problems:
How do I scrape the text, which is represented by a tag? It
is one of many on the web page. For e.g. the first one is just before
the author's name.
The CSV file I exported only contains the headers, but no text. Why? How
do I fix this?
Here's the code, many thanks for your help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas import DataFrame
import csv
import re
f = open ('nprtest1.csv', 'w', encoding='utf8', newline="")
writer = csv.writer(f, delimiter=',')
writer.writerow ('headline', 'date', 'author', 'body' )
*#set the page you want to visit*
url="https://www.npr.org/2019/12/29/792241464/civil-rights-leader-rep-john-lewis-to-start-treatment-for-pancreatic-cancer"
#request page using the request library
page=requests.get(url)
#create soup - parse HTML of webpage
soup=BeautifulSoup(page.content,'html.parser')
headline=soup.find("h1").text
date=soup.find("time").text
body = soup.find_all('p')
### regex to remove tags and other irrelevant bits
date_final = re.sub("\n","",date)
webdata = [headline, date_final, body]
writer.writerow (webdata)
df = pd.read_csv('webscraping_test1.csv')
For getting the specific text that you want, you can find the element by class. You can find more about that in this answer: How to find elements by class
For the csv problem, since you are already using pandas, I think you'll be better off using panda's to_csv function.

Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data.
This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format.
Here is what the scraper looks like right now:
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.select_one('table.stats_table')
headers = [th.text.encode("utf-8") for th in table.select("tr th")]
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows([
[td.text.encode("utf-8") for td in row.find_all("td")]
for row in table.select("tr + tr")
])
table_Scrape()
This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. Instead, it fetches the first table on the page 'Team Stats and Ranking'. It also renders the CSV in a rather ugly/not useful format like so:
b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'
b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'
I know my issue with fetching the correct table lies within the line:
table = soup.select_one('table.stats_table')
I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative!
Thanks in advance!
The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving
You can then get the table directly using its id rushing_and_receiving.
This seems to work.
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.find('table', id='rushing_and_receiving')
headers = [th.text for th in table.findAll("tr")[1]]
body = table.find('tbody')
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
for data_row in body.findAll("tr"):
th = data_row.find('th')
wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])
table_Scrape()
I would bypass beautiful soup altogether since pandas works well for this site. (at least the first 4 tables I glossed over)
Documentation here
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')
I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas.
Thanks to everyone who responded!

Creating a csv file from an html that does not have a table element to use with BeautifulSoup

Thank you in advance for any help. I have a current CSV of historical data relating to the CFTC url: https://www.cftc.gov/dea/options/other_lof.htm
I am looking to create a script to pull the data from this site once a week and update my historical data CSV automatically. I am currently stuck when trying to import only the "Random Length Lumber" data into a new CSV. The HTML code looks like this:
<pre> <!--ih:includeHTML file="other_lof.txt"-->PALLADIUM - NEW YORK MERCANTILE EXCHANGE... # It then continues listing ALL data from all of the commodities
<!--/ih:includeHTML-->
</pre>
and continues listing all the data for all of the commodities.
My python code starts like this:
from bs4 import BeautifulSoup
import urllib.request
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
# table = soup.find('')
From here I would like to only access the Lumber data and export to excel, however until I can select the data I want, I do not want to write all of the data over to excel. Any help or guidance would be greatly appreciated. Thank you.
URL that you provided is not in CSV format, its plain ascii formatted table report.
cftc.gov provides combined "Disaggregated Futures and Options Commitments" report in plain CSV format for all commodities here:
https://www.cftc.gov/dea/newcot/c_disagg.txt
and here you can find field names for this kind of report:
https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalViewable/CFTC_023168
Here is a sample Python code to parse those details:
import requests, csv, lxml.etree, io
def get_dissag():
output = []
feed_url = 'https://www.cftc.gov/dea/newcot/c_disagg.txt'
fields_url = 'https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalViewable/CFTC_023168'
fields_response = requests.get(fields_url)
doc = lxml.etree.HTML(fields_response.content.decode())
header = [field.split(' ')[1] for field in doc.xpath("//td/p/text()")]
response = requests.get(feed_url)
f = io.StringIO(response.content.decode())
csv_reader = csv.reader(f)
for row in csv_reader:
row_dict = {}
for index, value in enumerate(row):
row_dict[header[index]] = value.strip()
output.append(row_dict)
return output
print(get_dissag())
After fetching the HTML page source code try to extract all the list tag items from the source as in general, the table of contents is prepared using 'li' tag in HTML. Study the Source code and find out the class of the list tag which is used to give a unique identity to the element
tr = soup.findChildren('li',class_= "toclevel-1")
for item in tr:
print(item.text)

Categories