I have a conversion script, which converts pdf files and image files to text files. But it takes forever to run my script. It took me almost 48 hours to finished 2000 pdf documents. Right now, I have a pool of documents (around 12000+) that I need to convert. Based on my previous rate, I can't imagine how long will it take to finish the conversion using my code. I am wondering is there anything I can do/change with my code to make it run faster?
Here is the code that I used.
def tesseractOCR_pdf(pdf):
filePath = pdf
pages = convert_from_path(filePath, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
# Variable to get count of total number of pages
filelimit = image_counter-1
# Create an empty string for stroing purposes
text = ""
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Delete all the jpg files that created from above
for i in glob.glob("*.jpg"):
os.remove(i)
return text
def tesseractOCR_img(img):
filePath = img
text = str(pytesseract.image_to_string(filePath,lang='eng',config='--psm 6'))
text = text.replace('-\n', '')
return text
def Tesseract_ALL(docDir, txtDir, troubleDir):
if docDir == "": docDir = os.getcwd() + "\\" #if no docDir passed in
for doc in os.listdir(docDir): #iterate through docs in doc directory
try:
fileExtension = doc.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = docDir + doc
text = tesseractOCR_pdf(pdfFilename) #get string of text content of pdf
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
else:
# elif (fileExtension == "tif") | (fileExtension == "tiff") | (fileExtension == "jpg"):
imgFilename = docDir + doc
text = tesseractOCR_img(imgFilename) #get string of text content of img
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
except:
print("Error in file: "+ str(doc))
shutil.move(os.path.join(docDir, doc), troubleDir)
for filename in os.listdir(txtDir):
fileExtension = filename.split(".")[-2]
if fileExtension == "pdf":
os.rename(txtDir + filename, txtDir + filename.replace('.pdf', ''))
elif fileExtension == "tif":
os.rename(txtDir + filename, txtDir + filename.replace('.tif', ''))
elif fileExtension == "tiff":
os.rename(txtDir + filename, txtDir + filename.replace('.tiff', ''))
elif fileExtension == "jpg":
os.rename(txtDir + filename, txtDir + filename.replace('.jpg', ''))
docDir = "/drive/codingstark/Project/pdf/"
txtDir = "/drive/codingstark/Project/txt/"
troubleDir = "/drive/codingstark/Project/trouble_pdf/"
Tesseract_ALL(docDir, txtDir, troubleDir)
Does anyone know how can I edit my code to make it run faster?
I think a process pool would be perfect for your case.
First you need to figure out parts of your code that can run independent of each other, than you wrap it into a function.
Here is an example
from concurrent.futures import ProcessPoolExecutor
def do_some_OCR(filename):
pass
with ProcessPoolExecutor() as executor:
for file in range(file_list):
_ = executor.submit(do_some_OCR, file)
The code above will open a new process for each file and start processing things in parallel.
You can find the oficinal documentation here: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor
There is also an really awesome video that shows step-by-step how to use processes for exactly this: https://www.youtube.com/watch?v=fKl2JW_qrso
Here is a compact version of the function removing the file write stuff. I think this should work based on what I was reading on the APIs but I haven't tested this.
Note that I changed from string to list because adding to a list is MUCH less costly than appending to a string (See this about join vs concatenation
How slow is Python's string concatenation vs. str.join?) TLDR is that string concat makes a new string every time you are concatenating so with large strings you start having to copy many times.
Also, when you were calling replace each iteration on the string after concatenation, it was doing again creating a new string. So I moved that to operate on each string that is generated. Note that if for some reason that string '-\n' is an artifact that occured due to the concatenation previously, then it should be removed from where it is and placed here: return ''.join(pageText).replace('-\n','') but realize putting it there will be creating a new string with the join, then creating a whole new string from the replace.
def tesseractOCR_pdf(pdf):
pages = convert_from_path(pdf, 500)
# Counter to store images of each page of PDF to image
# Create an empty list for storing purposes
pageText = []
# Iterate through all the pages stored above will be a PIL Image
for page in pages:
# Recognize the text as string in image using pytesserct
# Add the text to a list while removing the -\n characters.
pageText.append(str(pytesseract.image_to_string(page)).replace('-\n',''))
return ''.join(pageText)
An even more compact one-liner version
def tesseractOCR_pdf(pdf):
#This takes each page of the pdf, extracts the text, removing -\n and combines the text.
return ''.join([str(pytesseract.image_to_string(page)).replace('-\n', '') for page in convert_from_path(pdf, 500)])
I'm trying to update a program to pull/read 10-K html and am getting a FileNotFound error. The error throws during the readHTML function. It looks like the FileName parameter is looking for a path to the Form10KName column, when it should be looking to the FileName column. I've no idea why this is happening, any help?
Here is the error code:
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 105, in <module>
main()
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 92, in main
match=readHTML(FileName)
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 18, in readHTML
input_file = open(input_path,'r+')
FileNotFoundError: [Errno 2] No such file or directory: './HTML/a10-k20189292018.htm'
And here is what I'm running.
from bs4 import BeautifulSoup #<---- Need to install this package manually using pip
from urllib.request import urlopen
os.chdir('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = "./HTML/" #<===The subfolder with the 10-K files in HTML format
txtSubPath = "./txt/" #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath+FileName
output_path = txtSubPath+FileName.replace(".htm",".txt")
input_file = open(input_path,'r+')
page = input_file.read() #<===Read the HTML file into Python
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip() #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
while ' ' in page:
page = page.replace(' ', ' ') #<===remove extra space
#Using regular expression to extract texts that match a pattern
#Define pattern for regular expression.
#The following patterns find ITEM 1 and ITEM 1A as diplayed as subtitles
#(.+?) represents everything between the two subtitles
#If you want to extract something else, here is what you should change
#Define a list of potential patterns to find ITEM 1 and ITEM 1A as subtitles
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.', #<===pattern 1: with an attribute bold before the item subtitle
'b>\s*Item 1\.(.+?)b>\s*Item 1A\.', #<===pattern 2: with a tag <b> before the item subtitle
'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>', #<===pattern 3: with a tag <\b> after the item subtitle
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle
#Now we try to see if a match can be found...
for regex in regexs:
match = re.search (regex, page, flags=re.IGNORECASE) #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
#If a match exist....
if match:
#Now we have the extracted content still in an HTML format
#We now turn it into a beautiful soup object
#so that we can remove the html tags and only keep the texts
soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?)
#soup.text removes the html tags and only keep the texts
rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
#remove space at the beginning and end and the subtitle "business" at the beginning
#^ matches the beginning of the text
outText = re.sub("^business\s*","",rawText.strip(),flags=re.IGNORECASE)
output_file = open(output_path, "w")
output_file.write(outText)
output_file.close()
break #<=== if a match is found, we break the for loop. Otherwise the for loop continues
input_file.close()
return match
def main():
if not os.path.isdir(txtSubPath): ### <=== keep all texts files in this subfolder
os.makedirs(txtSubPath)
csvFile = open(DownloadLogFile, "r") #<===A csv file with the list of 10k file names (the file should have no header)
csvReader = csv.reader(csvFile, delimiter=",")
csvData = list(csvReader)
logFile = open(ReadLogFile, "a+") #<===A log file to track which file is successfully extracted
logWriter = csv.writer(logFile, quoting = csv.QUOTE_NONNUMERIC)
logWriter.writerow(["filename","extracted"])
i=1
for rowData in csvData[1:]:
if len(rowData):
FileName = rowData[7]
if ".htm" in FileName:
match=readHTML(FileName)
if match:
logWriter.writerow([FileName,"yes"])
else:
logWriter.writerow([FileName,"no"])
i=i+1
csvFile.close()
logFile.close()
print("done!")
if __name__ == "__main__":
main()
CSV of file info
Your error message explains it is not looking inside the "HTML" directory for the file.
I would avoid using os.chdir to change the working directory - it is likely to complicate things. Instead, use pathlib and join paths correctly to ensure file paths are less error prone.
Try with this:
from pathlib import Path
base_dir = Path('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = base_dir.joinpath("HTML") #<===The subfolder with the 10-K files in HTML format
txtSubPath = base_dir.joinpath("txt") #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath.joinpath(FileName)
output_path = txtSubPath.joinpath(FileName.replace(".htm",".txt"))
I am trying to copy elements of a doc from one doc file to other. The text part is easy, the images is where it gets tricky.
Attaching an image to explain the structure of the doc: Just some text and 1 image.
from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')
new_doc = Document()
for elem in doc.element.body:
new_doc.element.body.append(elem)
new_doc.save('/Users/neha/Desktop/out.docx')
This gets me the whole structure of the doc in the new_doc but the image is still blank. Image below:
Good thing is I have the blank image in the right place so I thought of getting the byte level data from the previous image and insert it in the new doc. Here is how I extended the above code:
from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')
new_doc = Document()
for elem in doc.element.body:
new_doc.element.body.append(elem)
im = doc.inline_shapes[0]
blip = im._inline.graphic.graphicData.pic.blipFill.blip
rId = blip.embed
doc_part = doc.part
image_part = doc_part.related_parts[rId]
bytes = image_part._blob #Here I get the byte level data for the image
im2 = new_doc.inline_shapes[0]
blip2 = im2._inline.graphic.graphicData.pic.blipFill.blip
rId2 = blip2.embed
document_part2 = new_doc.part
document_part2.related_parts[rId2]._blob = bytes
new_doc.save('/Users/neha/Desktop/out.docx')
But the image still shows empty in the new_doc. What should I do from here?
I figured out a solution a couple of days back. However the text loses formatting using this way, but the images are correctly placed.
So the idea is, for para in paras for the source doc, if there is text, I write it to dest doc. And if there is an inline image present, I add a unique identifier at that place in the dest doc (refer here to see how these identifiers work, and contexts in docxtpl). These identifiers and docxtpl proved to be particularly useful here. And then using those unique identifiers I create a 'context' (as shown below) which is basically a map mapping the unique identifier to its particular InlineImage, and finally I render this context..
Below is my code (Apologies for the unnecessary indentation, I copied it directly from my text editor, and shift+tab doesn't work here :P)
from docxtpl import DocxTemplate, InlineImage
import Document
import io
import xml.etree.ElementTree as ET
dest = DocxTemplate()
source = Document(source_path)
context = {}
ims = [im for im in source.inline_shapes]
im_addresses = []
im_streams = []
count = 0
for im in ims:
blip = im._inline.graphic.graphicData.pic.blipFill.blip
rId = blip.embed
doc_part = source.part
image_part = doc_part.related_parts[rId]
byte_data = image_part._blob
image_stream = io.BytesIO(byte_data)
im_streams.append(image_stream)
image_name = self.img_path+"img_"+"_"+str(count)+".jpeg"
with open(image_name, "wb") as fh:
fh.write(byte_data)
fh.close()
im_addresses.append(image_name)
count += 1
paras = source.paragraphs
im_idx = 0
for para in paras:
p = dest.add_paragraph()
r = p.add_run()
if(para.text):
r.add_text(para.text)
root = ET.fromstring(para._p.xml)
namespace = {'wp':"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
inlines = root.findall('.//wp:inline',namespace)
if(len(inlines) > 0):
uid = "img_"+str(im_idx)
r.add_text("{{ " + uid + " }}")
context[uid] = InlineImage(dest,im_addresses[im_idx])
im_idx += 1
try:
dest.render(context)
except Exception as e:
print(e)
dest.save(dest_path)
PS: If a paragraph has two images, this code will prove to be sub-optimal.. One will have to make some change in the following:
if(len(inlines) > 0):
uid = "img_"+str(im_idx)
r.add_text("{{ " + uid + " }}")
context[uid] = InlineImage(dest,im_addresses[im_idx])
im_idx += 1
Will have to add a for loop inside the if statement as well. Since I didn't need as usually my images were big enough, so they always came in different paragraphs. Just a side note for anyone who may need it..
Cheers!
You could try:
Extracting the images from the first document by unzipping the .docx file (per How can I search a word in a Word 2007 .docx file?)
Save those images to the file system (as foo.png, for instance)
Generate the new .docx file with Python and add the .png file using document.add_picture('foo.png').
This problem is solved by this package https://docxtpl.readthedocs.io/en/latest/
I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.
Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!
The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
(related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?
CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
PREFERRED CSV FORMAT:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
Here is my current code:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1
Answers:
Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:
if percent_difference_title == "expensiver"
with:
if "expensiver" in percent_difference.span['class']
Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).
Other issues:
open csv file for writing before the loop
use with context managers while working with files
try to follow PEP8 style guide
Here's the code with modifications:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
Make sure you understand what changes have I made. Let me know if you need further help.
csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.
Just build a list, then write it to the CSV file.
First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.
import requests
import csv
from bs4 import BeautifulSoup
HomeCity = 'Phoenix'
with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["City", "Food", "Housing", "Clothes",
"Transportation", "Personal Care", "Entertainment"])
for line in cities:
city = line.strip()
url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
HomeCity, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
titles = soup.select("table.comparison tr.expandable")
row = [city]
for title in titles:
percent_difference = title.find("th", class_="percent")
changeclass = percent_difference.span['class']
change = percent_difference.span.string
if "expensiver" in changeclass:
change = '+' + change
else:
change = '-' + change
row.append(change)
writer.writerow(row)
So, first of all, one passes the writerow method an iterable, and each object in that iterable gets written with commas separating them. So if you give it a string, then each character gets separated:
WriteResultsFile.writerow('hello there')
writes
h,e,l,l,o, ,t,h,e,r,e
But
WriteResultsFile.writerow(['hello', 'there'])
writes
hello,there
That's why you are getting results like
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
The rest of your problems are errors in your webscraping. First of all, when I scrape the site, searching for tables with CSS class "comparison" gives me None. So I had to use
expatistan_table = soup_expatistan.find("table","comparison")
Now, the reason your "if statement is broken" is because
percent_difference.span['class']
returns a list. If we modify that to
percent_difference.span['class'][0]
things will work the way you expect.
Now, your real issue is that inside the innermost loop you are finding the % changing in price for the individual items. You want these as items in your row of price differences, not individual rows. So, I declare an empty list items to which I append percent_difference.span.string, and then write the row outside the innermost loop Like so:
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
The final error, is the in the while loop you re-open the csv file, and overwrite everything so you only have the final city in the end. Accounting for all theses errors (many of which you should have been able to find without help) leaves us with:
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
print url
soup_expatistan = BeautifulSoup(page)
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table","comparison")
expatistan_titles = expatistan_table.find_all("tr","expandable")
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
i+=1
YAA - Yet Another Answer.
Unlike the other answers, this treats the data as a series key-value pairs; ie: a list of dictionaries, which are then written to CSV. A list of wanted fields is provided to the csv writer (DictWriter), which discards additional information (beyond the specified fields) and blanks missing information. Also, should the order of the information on the original page change, this solution is unaffected.
I also assume you are going to open the CSV file in something like Excel. Additional parameters need to be given to the csv writer for this to happen nicely (see dialect parameter). Given that we are not sanitising the returned data, we should explicitly delimit it with unconditional quoting (see quoting parameter).
import csv
import requests
from bs4 import BeautifulSoup
#Read text file
with open("City.txt") as cities_h:
cities = cities_h.readlines()
home_city = "Phoenix"
city_data = []
for city in cities:
url = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding = resp.encoding)
titles = soup.select("table.comparison tr.expandable")
if titles:
data = {}
for title in titles:
name = title.find("th", class_ = "clickable")
diff = title.find("th", class_ = "percent")
exp = bool(diff.find("span", class_ = "expensiver"))
data[name.text] = ("+" if exp else "-") + diff.span.text
data["City"] = soup.find("strong", class_ = "city-2").text
city_data.append(data)
with open("Expatistan.csv","w") as csv_h:
fields = \
[
"City",
"Food",
"Housing",
"Clothes",
"Transportation",
"Personal Care",
"Entertainment"
]
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(city_data)
All,
I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.
The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.
Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.
I am sure it's something simple but need a bit of help if possible!
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os
from string import replace
bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4¶m2=¶m3=¶m4=¶m5=2011-02-05¶m6=*'
data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))
data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()
file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'
file = open(file_name,"wb")
file.write(data)
file.close()
Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!
Try starting like this:
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
Then you can use:
partition('=')[2] to cut off the "var gs_csv" bit.
strip(' \n"') to remove unwanted characters at each end (space, newline, ")
replace("\\n","\n") to sort out the new lines.
Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....
Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:
for line in data.splitlines():
row = line.split(",")
SOLUTION
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time
bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "¶m2=¶m3=¶m4=¶m5="
bm_date = "2011-02-04"
bm_param6 = "¶m6=*"
bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6
data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()
csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))
for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)
del csvwriter