Python - Web Scraping - BeautifulSoup & CSV

Python - Web Scraping - BeautifulSoup & CSV - python

I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.
Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!
The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
(related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?
CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
PREFERRED CSV FORMAT:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
Here is my current code:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1

Answers:
Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:
if percent_difference_title == "expensiver"
with:
if "expensiver" in percent_difference.span['class']
Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).
Other issues:
open csv file for writing before the loop
use with context managers while working with files
try to follow PEP8 style guide
Here's the code with modifications:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
Make sure you understand what changes have I made. Let me know if you need further help.

csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.
Just build a list, then write it to the CSV file.
First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.
import requests
import csv
from bs4 import BeautifulSoup
HomeCity = 'Phoenix'
with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["City", "Food", "Housing", "Clothes",
"Transportation", "Personal Care", "Entertainment"])
for line in cities:
city = line.strip()
url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
HomeCity, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
titles = soup.select("table.comparison tr.expandable")
row = [city]
for title in titles:
percent_difference = title.find("th", class_="percent")
changeclass = percent_difference.span['class']
change = percent_difference.span.string
if "expensiver" in changeclass:
change = '+' + change
else:
change = '-' + change
row.append(change)
writer.writerow(row)

So, first of all, one passes the writerow method an iterable, and each object in that iterable gets written with commas separating them. So if you give it a string, then each character gets separated:
WriteResultsFile.writerow('hello there')
writes
h,e,l,l,o, ,t,h,e,r,e
But
WriteResultsFile.writerow(['hello', 'there'])
writes
hello,there
That's why you are getting results like
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
The rest of your problems are errors in your webscraping. First of all, when I scrape the site, searching for tables with CSS class "comparison" gives me None. So I had to use
expatistan_table = soup_expatistan.find("table","comparison")
Now, the reason your "if statement is broken" is because
percent_difference.span['class']
returns a list. If we modify that to
percent_difference.span['class'][0]
things will work the way you expect.
Now, your real issue is that inside the innermost loop you are finding the % changing in price for the individual items. You want these as items in your row of price differences, not individual rows. So, I declare an empty list items to which I append percent_difference.span.string, and then write the row outside the innermost loop Like so:
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
The final error, is the in the while loop you re-open the csv file, and overwrite everything so you only have the final city in the end. Accounting for all theses errors (many of which you should have been able to find without help) leaves us with:
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
print url
soup_expatistan = BeautifulSoup(page)
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table","comparison")
expatistan_titles = expatistan_table.find_all("tr","expandable")
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
i+=1

YAA - Yet Another Answer.
Unlike the other answers, this treats the data as a series key-value pairs; ie: a list of dictionaries, which are then written to CSV. A list of wanted fields is provided to the csv writer (DictWriter), which discards additional information (beyond the specified fields) and blanks missing information. Also, should the order of the information on the original page change, this solution is unaffected.
I also assume you are going to open the CSV file in something like Excel. Additional parameters need to be given to the csv writer for this to happen nicely (see dialect parameter). Given that we are not sanitising the returned data, we should explicitly delimit it with unconditional quoting (see quoting parameter).
import csv
import requests
from bs4 import BeautifulSoup
#Read text file
with open("City.txt") as cities_h:
cities = cities_h.readlines()
home_city = "Phoenix"
city_data = []
for city in cities:
url = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding = resp.encoding)
titles = soup.select("table.comparison tr.expandable")
if titles:
data = {}
for title in titles:
name = title.find("th", class_ = "clickable")
diff = title.find("th", class_ = "percent")
exp = bool(diff.find("span", class_ = "expensiver"))
data[name.text] = ("+" if exp else "-") + diff.span.text
data["City"] = soup.find("strong", class_ = "city-2").text
city_data.append(data)
with open("Expatistan.csv","w") as csv_h:
fields = \
[
"City",
"Food",
"Housing",
"Clothes",
"Transportation",
"Personal Care",
"Entertainment"
]
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(city_data)

Related

Python 2.7: AttributeError: 'list' object has no attribute 'get'

I've built a script that crawls court listings in the UK, generates a list of links to each court's address page, and then want to scrape the address from said page.
It works pretty well so far but I am stuck at the "write to csv" bit. I think it's got to do with the iteritems()'s lack of get method, based on a similar problem. I get that an iterator doesn't have the same methods as an iterable (I am using an iterator in my code), but it didn't help me solve my particular problem.
Here's my code:
import csv
import time
import random
import requests
from bs4 import BeautifulSoup as bs
# lambda expression to request url and parse it through bs
soup = lambda url: bs((requests.get(url)).text, "html.parser")
def crawl_court_listings(base, buff, char):
""" """
# common URL segment + cuffer URL segment + end character -> URL
url = base + buff + str(chr(char))
# soup lambda expression -> grab first unordered list
links = (soup(url)).find('div', {'class', 'content inner cf'}).find('ul')
# empty dictionary
results = {}
# loop through links, get link title and href
for item in links.find_all('a', href=True):
court_link = item['href']
title = item.string
# generate full court address page url from href
full_court_link = base + court_link
# save title and full URL to results
results[title] = full_court_link
# increment char var by 1
char += 1
# return results dict and incremented char value
return results, char
def get_court_address(court_name, full_court_link):
""" """
# get horrible chunk of poorly formatted address(es)
address_blob = (soup(full_court_link)).find('div', {'id': 'addresses'}).text
# clean the blob
clean_address = ("\n".join(line.strip() for line in address_blob.split("\n")))
# write to csv
with open('court_addresses.csv', 'w') as csvfile:
fieldnames = [court_name, full_court_link, clean_address]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(fieldnames)
if __name__ == "__main__":
base = 'https://courttribunalfinder.service.gov.uk/'
buff = 'courts/'
# 65 = "A". Starting from Char "A", retrieve list of Titles and Links of for Court Addresses. Return Char +1
results, char = crawl_court_listings(base, buff, 65)
# 90 = "Z". Until Z, pass title and list from results into get_court_address(), then wait a few seconds
while char <= 90:
for t, l in results.iteritems():
get_court_address(t, l)
time.sleep(random.randint(0,5))
When I run this, I get the following:
Traceback (most recent call last):
File ".\CourtScraper.py", line 63, in <module>
get_court_address(t, l)
File ".\CourtScraper.py", line 49, in get_court_address
writer.writerow(fieldnames)
File "c:\python27\Lib\csv.py", line 152, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "c:\python27\Lib\csv.py", line 149, in _dict_to_list
return [rowdict.get(key, self.restval) for key in self.fieldnames]
AttributeError: 'list' object has no attribute 'get'
Even though I get an error, it produces the csv file with cells A1 and A2 populated with title and full-court_link, but no address. The address (when printed) looks like this:
Write to us:
1st Floor
Piccadilly Exchange
Piccadilly Plaza
Manchester
Greater Manchester
M1 4AH
So my first thoughts were that I was trying to write multi-line text into a single cell which was causing the error, but not really sure how to confirm that. I used print(type(address)) which came back as unicode and not a list, so I don't think that's causing the issue. I don't understand where it's getting the list the issue relates to from, if that makes sense.
If it is the iteritems() method causing the issue, how do I go about resolving it?
Can someone explain the error and point me in the direction of solving it please?

Your problem is here:
writer.writerow(fieldnames)
"fieldnames" is a list of field names. You need to pass a dict of key-value pairs. So it should look more like this:
# write to csv
with open('court_addresses.csv', 'w') as csvfile:
# note - these are strings, not variables
fieldnames = ['court_name', 'full_court_link', 'clean_address']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({"court_name" : court_name,
"full_court_link" : full_court_link},
"clean_address" : clean_address})
PSST: you have another issue. You are re-opening your output file for every court that you parse. You probably want to open that file once (under __main__) and then pass the handle into get_court_address()

For each row you are writing, you need to pass in a dictionary - you are passing in the header list
https://docs.python.org/2/library/csv.html#csv.DictWriter
# write to csv
with open('court_addresses.csv', 'w') as csvfile:
fieldnames = [court_name, full_court_link, clean_address]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(fieldnames)
^^^^^^^^^^^ This should be a dict
The dict needs to look like::
{'court_name': X, 'full_court_link': Y, 'clean_address': Z}
HTH

with open('court_addresses.csv', 'w') as csvfile:
fieldnames = ['court_name', 'full_court_link', 'clean_address']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'court_name': court_name, 'full_court_link': full_court_link, 'clean_address': clean_address})

write list of paragraph tuples to a csv file

The following code is designed to write a tuple, each containing a large paragraph of text, and 2 identifiers behind them, to a single line per each entry.
import urllib2
import json
import csv
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}
writer = csv.writer(open("./data/events.csv", "a"))
writer.writerow(["description", "category_id", "subcategory_id"])
def format_event(event):
return event["description"]["text"].encode("utf-8").rstrip("\n\r"), event["category_id"], event["subcategory_id"]
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
#print event
writer.writerow(event)
print "wrote out events for page - {}".format(x)
The ideal format would be to have each line contain a single paragraph, followed by the other fields listed above, yet here is a screenshot of how the data comes out.
If instead I this line to the following:
writer.writerow([event])
Here is how the file now looks:
It certainly looks much closer to what I want, but its got parenthesis around each entry which are undesirable.
EDIT
here is a snippet that contains a sample of the data Im working with.

Can you try writing to the CSV file directly without using using the csv module? You can write/append comma-delimited strings to the CSV file just like writing to typical text files. Also, the way you deal with removing \r and \n characters might not be working. You can use regex to find those characters and replace them with an empty string "":
import urllib2
import json
import re
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}"
def format_event(event):
ws_to_strip = re.compile(r"(\r|\n)")
description = re.sub(ws_to_strip, "", event["description"]["text"].encode("utf-8"))
return [description, event["category_id"], event["subcategory_id"]]
with open("./data/events.csv", "a") as events_file:
events_file.write(",".join(["description", "category_id", "subcategory_id"]))
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
events_file.write(",".join(event))
print "wrote out events for page - {}".format(x)

Change your csv writer to be DictWriter.
Make a few tweaks:
def format_event(event):
return {"description": event["description"]["text"].encode("utf-8").rstrip("\n\r"),
"category_id": event["category_id"],
"subcategory_id": event["subcategory_id"]}
May be a few other small things you need to do, but using DictWriter and formatting your data appropriately has been the easiest way to work with csv files that I've found.

Previously working script now fails to generate csv file. Why?

title can be misleading: python script WORKS, but fails to generate a csv file as it previously had no problem of doing
Source:
import requests
import unicodecsv as csv
import json
api_url = 'http://api.indeed.com/ads/apisearch?publisher=8710117352111766&v=2&limit=100000&format=json'
number= 0
SearchTerm = 'McKinsey'
countries = set(['us','ar','au','at','bh','be','br','ca','cl','cn','co','cz','dk','fi','fr','de','gr','hk','hu','in','id','ie','il','it','jp','kr','kw','lu','my','mx','nl','nz','no','om','pk','pe','ph','pl','pt','qa','ro','ru','sa','sg','za','es','se','ch','tw','tr','ae','gb','ve'])
with open( SearchTerm + '.csv' , 'a' ) as csvfile:
fieldnames = ['city','company','country','date','expired','formattedLocation','formattedLocationFull','formattedRelativeTime','indeedApply','jobkey','jobtitle','latitude','longitude','onmousedown','snippet','source','sponsored','state','url']
writer = csv.DictWriter(csvfile, fieldnames = fieldnames, lineterminator = '\n')
writer.writeheader()
for SCountry in countries:
Country = SCountry #this is the variable assigned to the country
urlfirst = api_url + '&co=' + Country + '&q=' + SearchTerm
grabforNum = requests.get(urlfirst)
json_content = json.loads(grabforNum.content)
print(json_content["totalResults"])
numresults = (json_content["totalResults"])
# must match the actual number of job results to the lower of the 25 increment or the last page will repeat over and over
for number in range(0, numresults, 25):
url = api_url + '&co=' + Country + '&q=' + SearchTerm + '&latlong=1' + '&start=' + str(number)
response = requests.get(url)
grabforclean = json.loads(response.content)
clean_json = (grabforclean['results'])
print 'Complete '+ url
for job in clean_json:
writer.writerow(job)
This is the original owner of the script. I was using it 3 days ago until I had to reinstall my operating system. Now for some reason, it fails to store all the content it collects into a CSV file. API key works, no error messages. requests unicodecsv and json are all installed.
stuff like this really drives me up the wall, how can you diagnose something that previously worked? I had multiple versions of the script searching for different keywords so I know my modifications are not to blame, but perhaps something outside the script is broken.

The website has probably recently started returning a new field, so you have two choices:
Add stations to your list of fieldnames.
Add extrasaction='ignore' to your csv.Dictwriter parameters to keep all your existing fields and ignore any new ones that are added.
Both of these solutions will allow your script to work again.

how to add lists to a dictionary then output to .csv

I'm try to iterate through tables in html by a searchlabel, then update the found value to a dictionary, then write those values to a csv. The output currently works for both the url and the headline, but the name output will either be blank or show "None." If i print the output of blog["name'] however, it is correctly pulling the information I want. I suspect that it's an indentation error but I can't figure out where to line things up. I've tried moving things around but nothing seems to work to get the name assignment to work inside that loop.
import os
from bs4 import BeautifulSoup
import my_csv_writer
def td_finder(tr, searchLabel):
value = ""
index = tr.text.find(searchLabel)
if index>-1:
tds = tr.findAll('td')
if len(tds)>1:
value = tds[1].text
return value
def main():
topdir = 'some_directory'
writer = my_csv_writer.CsvWriter("output.csv")
writer.writeLine(["url", "headline", "name"])
"""Main Function"""
blog = []
for root, dirs, files in os.walk(topdir):
for f in files:
url = os.path.join(root, f)
url = os.path.dirname(url).split('some_file')[1]
if f.lower().endswith((".html")):
file_new = open(os.path.join(root, f), "r").read()
soup = BeautifulSoup(file_new)
blog = {}
#Blog Title
blog["title"] = soup.find('title').text
for table in soup.findAll("table"):
for tr in table.findAll("tr"):
#name
blog["name"] = td_finder(tr, "name:")
seq = [url, unicode(blog["title"]), unicode(blog.get("name"))]
writer.writeLine(seq)
#return ""
if __name__ == '__main__':
main()
print "Finished main"

You're writing unicode strings to a csv file which according to the official docs "The csv module doesn’t directly support reading and writing Unicode...".
It does offer alternative classes to enable different encodings via UnicodeWriter. The following answer from Boud on SO highlights the need to set the desired encoding in the CSV file.

Scraping values from HTML header and saving as a CSV file in Python

All,
I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.
The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.
Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.
I am sure it's something simple but need a bit of help if possible!
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os
from string import replace
bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'
data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))
data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()
file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'
file = open(file_name,"wb")
file.write(data)
file.close()

Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!
Try starting like this:
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
Then you can use:
partition('=')[2] to cut off the "var gs_csv" bit.
strip(' \n"') to remove unwanted characters at each end (space, newline, ")
replace("\\n","\n") to sort out the new lines.
Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....
Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:
for line in data.splitlines():
row = line.split(",")

SOLUTION
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time
bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"
bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6
data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()
csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))
for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)
del csvwriter

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Web Scraping - BeautifulSoup & CSV - python

Related

Python 2.7: AttributeError: 'list' object has no attribute 'get'

write list of paragraph tuples to a csv file

Previously working script now fails to generate csv file. Why?

how to add lists to a dictionary then output to .csv

Scraping values from HTML header and saving as a CSV file in Python

Categories

Resources