Error while writing a file - python

I am trying to write a file but i am getting the following error: TypeError: a bytes-like object is required, not 'str'
import requests, pandas
from bs4 import BeautifulSoup
r = requests.get("https://www.basketball-reference.com/players/a/")
c = r.content
soup = BeautifulSoup(c, "html.parser")
full_record_heading = soup.findAll("tr")
full_record = soup.findAll("tr")
playerdata = ""
playerdata_saved = ""
for record in full_record:
playerdata = ""
for player in record.findAll("td"):
playerdata = playerdata +","+player.text
playerdata_saved = playerdata_saved + playerdata[1:]+("\n")
# print(playerdata_saved)
header="From,To,Pos,Ht,Wt,Birth Date,College"
file=open("Basketball.csv","r+b")
file.write(header)
Can anyone tell me the reason for the error? How can we know the correct syntax of any command and documentation available? I am new to python

When you open a file in python, you must specify its "file mode" - read-only, write-only, read AND write, and if the file is binary. So, in this line:
open("Basketball.csv","r+b")
You opened your file as READ-ONLY, and set the file to be read as BINARY.
You should have opened the file as:
open("Basketball.csv","w")
As write and as STRING
Nevertheless, you are manually writting a CSV file - you do not have to do that in Pyhton! Look at this example:
import requests
import pandas # Always import in different lines
from bs4 import BeautifulSoup
r = requests.get("https://www.basketball-reference.com/players/a/")
c = r.content
soup = BeautifulSoup(c, "html.parser")
full_record_heading = soup.findAll("tr")
full_record = soup.findAll("tr")
# Initialize your data buffer
my_data = []
# For each observation in your data source
for record in full_record:
# We extract a row of data
observation = record.findAll("td")
# Format the row as a dictionary - a "python hashmap"
dict_observation = {
"From": observation[0],
"To": observation[1],
"Pos": observation[2],
"Ht": observation[3],
"Wt": observation[4],
"Birth Date": observation[5],
"College": observation[6]
}
# Add the row to our DataFrame buffer
my_data.append(dict_observation)
# Now our DataFrame buffer contains all our data.
# We can format it as a Pandas DataFrame
dataframe = pandas.DataFrame().from_dict(my_data)
# Pandas DataFrames can be turned into CSVs seamlessly. Like:
dataframe.to_csv("Basketball.csv", index=False)
# Or even MS Excel:
dataframe.to_excel("Basketball.xlsx")
Use python data structures as often as you can!

If you want to write bytes you have to make it like below
file.write(bytes(header, encoding="UTF-8"))

Related

looping through JSON data and inserting all bikestations into a dictionary

I am trying to loop through a JSON file and add data for all the bike stations listed into the dictionary instead of only the final bikestation, which is 502. So if my code was working the way I want it to, The resulting dictionary would have JSON data for all 5 bikestations. I am very new to this and any help is appreciated. here is my code so far:
import json
import urllib.request, urllib.parse, urllib.error
import datetime
import pymongo
stations = (123, 258, 290, 501, 502)
chicagoBikesURL = "https://data.cityofchicago.org/resource/eq45-8inv.json?"
for station in stations:
paramD = dict()
paramD["id"] = station
paramD["$order"] = "timeStamp DESC"
paramD["$limit"] = 2
params = urllib.parse.urlencode(paramD)
print(chicagoBikesURL+params)
document = urllib.request.urlopen(chicagoBikesURL+params)
# get all of the text from the document
text = document.read().decode()
if document.getcode() != 200 :
print("Error code=",document.getcode(), chicagoBikesURL+params)
text = "{}"
# Load the JSON text from the URL into a dictionary using the json library
js = json.loads(text)
# Output first Record
print("\nFirst BikeStation")
print(js[0])
# Write JSON data to a file
fdumps = open('bike_data.txt', "w")
fdumps.write(json.dumps(js).strip())
bikeStation_list.append(js)
# Make sure you close the file otherwise data may not be saved
fdumps.close()
#Process JSON Data
bikeStation_list = []`enter code here`
"w" mode on the file open always creates a new file. So each one overwrites the previous so you will only see the final item.
To append to the file, use:
fdumps = open('bike_data.txt', "a")

Writing Printed Output to CSV - Numpy

I want this output written via CSV
['https://www.lendingclub.com/loans/personal-loans' '6.16% to 35.89%']
['https://www.lendingclub.com/loans/personal-loans' '1% to 6%']
['https://www.marcus.com/us/en/personal-loans' '6.99% to 24.99%']
['https://www.marcus.com/us/en/personal-loans' '6.99% to 24.99%']
['https://www.marcus.com/us/en/personal-loans' '6.99% to 24.99%']
['https://www.marcus.com/us/en/personal-loans' '6.99% to 24.99%']
['https://www.marcus.com/us/en/personal-loans' '6.99% to 24.99%']
['https://www.discover.com/personal-loans/' '6.99% to 24.99%']
However when I run the code to write the output to CSV I only get the last line written to the CSV file:
['https://www.discover.com/personal-loans/' '6.99% to 24.99%']
Could it be because my printed output is not comma separated? I attempted to circumvent having to put a comma in there by using a space as the delimiter. Let me know your thoughts. Would love some help on this because I am having the hardest time reshaping this collected data.
plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
'https://www.marcus.com/us/en/personal-loans',
'https://www.discover.com/personal-loans/']
#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
cdate = datetime.date.today()
l = r.get(link)
l.encoding = 'utf-8'
data = l.text
soup = bs(data, 'html.parser')
#captures Discover's rate perfectly but catches too much for lightstream/prosper
paragraph = soup.find_all(text=re.compile('[0-9]%'))
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
try:
irate = str(matches[0])
array = np.asarray(irate)
array2 = np.append(link,irate)
array2 = np.asarray(array2)
print(array2)
#with open('test.csv', "w") as csv_file:
# writer = csv.writer(csv_file, delimiter=' ')
# for line in test:
# writer.writerow(line)
except IndexError:
pass
When it comes to using csv file, pandas comes handy.
import datetime
import requests as r
from bs4 import BeautifulSoup as bs
import numpy as np
import regex as re
import pandas as pd
plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
'https://www.marcus.com/us/en/personal-loans',
'https://www.discover.com/personal-loans/']
df = pd.DataFrame({'Link':[],'APR Rate':[]})
#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
cdate = datetime.date.today()
l = r.get(link)
l.encoding = 'utf-8'
data = l.text
soup = bs(data, 'html.parser')
#captures Discover's rate perfectly but catches too much for lightstream/prosper
paragraph = soup.find_all(text=re.compile('[0-9]%'))
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
irate = ''
try:
irate = str(matches[0])
df2 = pd.DataFrame({'Link':[link],'APR Rate':[irate]})
df = pd.concat([df,df2],join="inner")
except IndexError:
pass
df.to_csv('CSV_File.csv',index=False)
I have stored each link and it's irate value in a data frame df2 and I concatenate it to parent data frame df.
At the end, I write parent data frame df to a csv file.
I think the problem is that you are opening the file in write-mode (the "w" in open('test.csv', "w")), meaning that Python overwrites what's already written in the file. I think you're looking for append-mode:
# open the file before the loop, and close it after
csv_file = open("test.csv", 'a') # change the 'w' to an 'a'
csv_file.truncate(0) # clear the contents of the file
writer = csv.writer(csv_file, delimiter=' ') # make the writer beforehand for efficiency
for n in paragraph:
matches = re.findall('(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%', n.string)
try:
irate = str(matches[0])
array = np.asarray(irate)
array2 = np.append(link,irate)
array2 = np.asarray(array2)
print(array2)
for line in test:
writer.writerow(line)
except IndexError:
pass
# close the file
csv_file.close()
If this doesn't work, please let me know!

write list of paragraph tuples to a csv file

The following code is designed to write a tuple, each containing a large paragraph of text, and 2 identifiers behind them, to a single line per each entry.
import urllib2
import json
import csv
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}
writer = csv.writer(open("./data/events.csv", "a"))
writer.writerow(["description", "category_id", "subcategory_id"])
def format_event(event):
return event["description"]["text"].encode("utf-8").rstrip("\n\r"), event["category_id"], event["subcategory_id"]
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
#print event
writer.writerow(event)
print "wrote out events for page - {}".format(x)
The ideal format would be to have each line contain a single paragraph, followed by the other fields listed above, yet here is a screenshot of how the data comes out.
If instead I this line to the following:
writer.writerow([event])
Here is how the file now looks:
It certainly looks much closer to what I want, but its got parenthesis around each entry which are undesirable.
EDIT
here is a snippet that contains a sample of the data Im working with.
Can you try writing to the CSV file directly without using using the csv module? You can write/append comma-delimited strings to the CSV file just like writing to typical text files. Also, the way you deal with removing \r and \n characters might not be working. You can use regex to find those characters and replace them with an empty string "":
import urllib2
import json
import re
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}"
def format_event(event):
ws_to_strip = re.compile(r"(\r|\n)")
description = re.sub(ws_to_strip, "", event["description"]["text"].encode("utf-8"))
return [description, event["category_id"], event["subcategory_id"]]
with open("./data/events.csv", "a") as events_file:
events_file.write(",".join(["description", "category_id", "subcategory_id"]))
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
events_file.write(",".join(event))
print "wrote out events for page - {}".format(x)
Change your csv writer to be DictWriter.
Make a few tweaks:
def format_event(event):
return {"description": event["description"]["text"].encode("utf-8").rstrip("\n\r"),
"category_id": event["category_id"],
"subcategory_id": event["subcategory_id"]}
May be a few other small things you need to do, but using DictWriter and formatting your data appropriately has been the easiest way to work with csv files that I've found.

Python - Web Scraping - BeautifulSoup & CSV

I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.
Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!
The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
(related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?
CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
PREFERRED CSV FORMAT:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
Here is my current code:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1
Answers:
Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:
if percent_difference_title == "expensiver"
with:
if "expensiver" in percent_difference.span['class']
Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).
Other issues:
open csv file for writing before the loop
use with context managers while working with files
try to follow PEP8 style guide
Here's the code with modifications:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
Make sure you understand what changes have I made. Let me know if you need further help.
csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.
Just build a list, then write it to the CSV file.
First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.
import requests
import csv
from bs4 import BeautifulSoup
HomeCity = 'Phoenix'
with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["City", "Food", "Housing", "Clothes",
"Transportation", "Personal Care", "Entertainment"])
for line in cities:
city = line.strip()
url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
HomeCity, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
titles = soup.select("table.comparison tr.expandable")
row = [city]
for title in titles:
percent_difference = title.find("th", class_="percent")
changeclass = percent_difference.span['class']
change = percent_difference.span.string
if "expensiver" in changeclass:
change = '+' + change
else:
change = '-' + change
row.append(change)
writer.writerow(row)
So, first of all, one passes the writerow method an iterable, and each object in that iterable gets written with commas separating them. So if you give it a string, then each character gets separated:
WriteResultsFile.writerow('hello there')
writes
h,e,l,l,o, ,t,h,e,r,e
But
WriteResultsFile.writerow(['hello', 'there'])
writes
hello,there
That's why you are getting results like
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
The rest of your problems are errors in your webscraping. First of all, when I scrape the site, searching for tables with CSS class "comparison" gives me None. So I had to use
expatistan_table = soup_expatistan.find("table","comparison")
Now, the reason your "if statement is broken" is because
percent_difference.span['class']
returns a list. If we modify that to
percent_difference.span['class'][0]
things will work the way you expect.
Now, your real issue is that inside the innermost loop you are finding the % changing in price for the individual items. You want these as items in your row of price differences, not individual rows. So, I declare an empty list items to which I append percent_difference.span.string, and then write the row outside the innermost loop Like so:
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
The final error, is the in the while loop you re-open the csv file, and overwrite everything so you only have the final city in the end. Accounting for all theses errors (many of which you should have been able to find without help) leaves us with:
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
print url
soup_expatistan = BeautifulSoup(page)
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table","comparison")
expatistan_titles = expatistan_table.find_all("tr","expandable")
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
i+=1
YAA - Yet Another Answer.
Unlike the other answers, this treats the data as a series key-value pairs; ie: a list of dictionaries, which are then written to CSV. A list of wanted fields is provided to the csv writer (DictWriter), which discards additional information (beyond the specified fields) and blanks missing information. Also, should the order of the information on the original page change, this solution is unaffected.
I also assume you are going to open the CSV file in something like Excel. Additional parameters need to be given to the csv writer for this to happen nicely (see dialect parameter). Given that we are not sanitising the returned data, we should explicitly delimit it with unconditional quoting (see quoting parameter).
import csv
import requests
from bs4 import BeautifulSoup
#Read text file
with open("City.txt") as cities_h:
cities = cities_h.readlines()
home_city = "Phoenix"
city_data = []
for city in cities:
url = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding = resp.encoding)
titles = soup.select("table.comparison tr.expandable")
if titles:
data = {}
for title in titles:
name = title.find("th", class_ = "clickable")
diff = title.find("th", class_ = "percent")
exp = bool(diff.find("span", class_ = "expensiver"))
data[name.text] = ("+" if exp else "-") + diff.span.text
data["City"] = soup.find("strong", class_ = "city-2").text
city_data.append(data)
with open("Expatistan.csv","w") as csv_h:
fields = \
[
"City",
"Food",
"Housing",
"Clothes",
"Transportation",
"Personal Care",
"Entertainment"
]
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(city_data)

Scraping values from HTML header and saving as a CSV file in Python

All,
I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.
The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.
Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.
I am sure it's something simple but need a bit of help if possible!
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os
from string import replace
bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'
data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))
data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()
file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'
file = open(file_name,"wb")
file.write(data)
file.close()
Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!
Try starting like this:
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
Then you can use:
partition('=')[2] to cut off the "var gs_csv" bit.
strip(' \n"') to remove unwanted characters at each end (space, newline, ")
replace("\\n","\n") to sort out the new lines.
Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....
Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:
for line in data.splitlines():
row = line.split(",")
SOLUTION
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time
bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"
bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6
data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()
csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))
for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)
del csvwriter

Categories