Python scraping and outputting to excel

Python scraping and outputting to excel - python

I am trying to create a web crawler. I am currently just testing it on Youtube, but I intend to expand it to do more later. For now, I am still learning.
Currently I am trying to export the information to a csv, the code below is what I have at the moment and it seemed to be working great when I was running it to pull title descriptions. However, when I added in code to get the "views" and "likes", it messes up the output file because they have commas in them.
Does anyone know what I can do to get around this?
import urllib2
import __builtin__
from selenium import webdriver
from selenium.common.exceptions import NoSuchAttributeException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from time import sleep
from random import randint
from lxml import etree
browser = webdriver.Firefox()
time.sleep(2)
browser.get("https://www.youtube.com/results?search_query=funny")
time.sleep(2)
browser.find_element_by_xpath("//*[#id='section-list']/li/ol/li[1]/div/div/div[2]/h3/a").click()
time.sleep(2)
url = browser.current_url
title = browser.find_element_by_xpath("//*[#id='eow-title']").text
views = browser.find_element_by_xpath("//*[#id='watch7-views-info']/div[1]").text
likes = browser.find_element_by_xpath("//*[#id='watch-like']/span").text
dislikes = browser.find_element_by_xpath("//*[#id='watch-dislike']/span").text
tf = 'textfile.csv'
f2 = open(tf, 'a+')
f2.write(', '.join([data.encode('utf-8') for data in [url]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [title]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [views]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [likes]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [dislikes]]) + '\n')
f2.close()

First, the fact that you see those numbers with commas rather than a point is dependant on the language and regional settings that youtube detects for your browser.
Once you have your views, likes and dislikes as strings, you could perform an operation like the following to get rid of the commas:
likes = "3,141,592"
likes = likes.replace(',', '') # likes is now: "3141592"
likes = int(likes) # likes is now an actual integer, not just a string
This works because those 3 parameters are all integers, so you don't have to start thinking of commas or points that are actually important to indicate the start of the non-integer part.
Finally, good examples on how to use the csv module are everywhere on the internet. I could suggest the one from Python Module of the Week. If you understand the examples, you'll be able to change your code to use this highly efficient module.

You needn't write raw csv format yourself. Use https://docs.python.org/2/library/csv.html.
a sample code:
stringio = StringIO.StringIO()
csv_writer = csv.writer(stringio)
csv_writer.writerow([data.encode('utf-8') for data in [url]])
csv_writer.writerow([data.encode('utf-8') for data in [title]])
csv_writer.writerow([data.encode('utf-8') for data in [views]])
csv_writer.writerow([data.encode('utf-8') for data in [likes]])
csv_writer.writerow([data.encode('utf-8') for data in [dislikes]])
with open('textfile.csv') as fp:
fp.write(stringio.getvalue())
I can't understand the purpose of [data.encode('utf-8') for data in [url]] or you mean:
csv_writer.writerow([data.encode('utf-8') for data in [url, title, views, likes, dislikes]])
you can also try csv.writer(open('textfile.csv', 'a+')) without writing to a string buffer.

Related

Iterate through a html file and extract data to CSV file

I have searched high and low for a solution, but non have quite fit what I need to do.
I have an html page that is saved, as a file, lets call it sample.html and I need to extract recurring json data from it. An example file is as follows:
I need to get the info from these files regulary, so the amount of objects change every time, an object would be considered as "{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}"
I need to get each of the values to a CSV file, with column headings being SpecificIdent, SpecificNum, Meter, Power, WPower, Snumber, isI. The associated data would be the rows from each.
I apologize if this is a basic question in Python, but I am pretty new to it and cannot fathom the best way to do this. Any assistance would be greatly appreciated.
Kind regards
A
<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">[{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":3716,"SpecificNum":39,"Meter":1835,"Power":11240.0,"WPower":null,"SNumber":"0703-403548","isI":false},{"SpecificIdent":6364,"SpecificNum":27,"Meter":7768,"Power":29969.0,"WPower":null,"SNumber":"467419","isI":false},{"SpecificIdent":6583,"SpecificNum":51,"Meter":7027,"Power":36968.0,"WPower":null,"SNumber":"JE1449-521248","isI":false},{"SpecificIdent":6612,"SpecificNum":57,"Meter":12828,"Power":53918.0,"WPower":null,"SNumber":"JE1509-534327","isI":false},{"SpecificIdent":7139,"SpecificNum":305,"Meter":6264,"Power":33101.0,"WPower":null,"SNumber":"JE1449-521204","isI":false},{"SpecificIdent":7551,"SpecificNum":116,"Meter":0,"Power":21569.0,"WPower":null,"SNumber":"JE1449-521252","isI":false},{"SpecificIdent":7643,"SpecificNum":56,"Meter":7752,"Power":40501.0,"WPower":null,"SNumber":"JE1449-521200","isI":false},{"SpecificIdent":8653,"SpecificNum":49,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":9733,"SpecificNum":142,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":10999,"SpecificNum":20,"Meter":7723,"Power":6987.0,"WPower":null,"SNumber":"JE1608-625534","isI":false},{"SpecificIdent":12086,"SpecificNum":24,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":14590,"SpecificNum":35,"Meter":394,"Power":10941.0,"WPower":null,"SNumber":"BN1905-944799","isI":false},{"SpecificIdent":14954,"SpecificNum":100,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"517163","isI":false},{"SpecificIdent":14995,"SpecificNum":58,"Meter":0,"Power":38789.0,"WPower":null,"SNumber":"JE1444-511511","isI":false},{"SpecificIdent":15245,"SpecificNum":26,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"430149","isI":false},{"SpecificIdent":18824,"SpecificNum":55,"Meter":8236,"Power":31358.0,"WPower":null,"SNumber":"0703-310839","isI":false},{"SpecificIdent":20745,"SpecificNum":41,"Meter":0,"Power":60963.0,"WPower":null,"SNumber":"JE1447-517260","isI":false},{"SpecificIdent":31584,"SpecificNum":11,"Meter":0,"Power":3696.0,"WPower":null,"SNumber":"467154","isI":false},{"SpecificIdent":32051,"SpecificNum":40,"Meter":7870,"Power":13057.0,"WPower":null,"SNumber":"JE1608-625593","isI":false},{"SpecificIdent":32263,"SpecificNum":4,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":33137,"SpecificNum":132,"Meter":5996,"Power":26650.0,"WPower":null,"SNumber":"459051","isI":false},{"SpecificIdent":33481,"SpecificNum":144,"Meter":4228,"Power":16136.0,"WPower":null,"SNumber":"JE1603-617807","isI":false},{"SpecificIdent":33915,"SpecificNum":145,"Meter":5647,"Power":3157.0,"WPower":null,"SNumber":"JE1518-549610","isI":false},{"SpecificIdent":36051,"SpecificNum":119,"Meter":2923,"Power":12249.0,"WPower":null,"SNumber":"135493","isI":false},{"SpecificIdent":37398,"SpecificNum":21,"Meter":58,"Power":5540.0,"WPower":null,"SNumber":"BN1925-982761","isI":false},{"SpecificIdent":39024,"SpecificNum":50,"Meter":7217,"Power":38987.0,"WPower":null,"SNumber":"JE1445-511599","isI":false},{"SpecificIdent":39072,"SpecificNum":59,"Meter":5965,"Power":32942.0,"WPower":null,"SNumber":"JE1449-521199","isI":false},{"SpecificIdent":40601,"SpecificNum":9,"Meter":0,"Power":59655.0,"WPower":null,"SNumber":"JE1447-517150","isI":false},{"SpecificIdent":40712,"SpecificNum":37,"Meter":0,"Power":5715.0,"WPower":null,"SNumber":"JE1502-525840","isI":false},{"SpecificIdent":41596,"SpecificNum":53,"Meter":8803,"Power":60669.0,"WPower":null,"SNumber":"JE1503-527155","isI":false},{"SpecificIdent":50276,"SpecificNum":30,"Meter":2573,"Power":4625.0,"WPower":null,"SNumber":"JE1545-606334","isI":false},{"SpecificIdent":51712,"SpecificNum":69,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":56140,"SpecificNum":10,"Meter":5169,"Power":26659.0,"WPower":null,"SNumber":"JE1547-609024","isI":false},{"SpecificIdent":56362,"SpecificNum":6,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":58892,"SpecificNum":113,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65168,"SpecificNum":5,"Meter":12739,"Power":55833.0,"WPower":null,"SNumber":"JE1449-521284","isI":false},{"SpecificIdent":65255,"SpecificNum":60,"Meter":5121,"Power":27784.0,"WPower":null,"SNumber":"JE1449-521196","isI":false},{"SpecificIdent":65665,"SpecificNum":47,"Meter":11793,"Power":47576.0,"WPower":null,"SNumber":"JE1509-534315","isI":false},{"SpecificIdent":65842,"SpecificNum":8,"Meter":10783,"Power":46428.0,"WPower":null,"SNumber":"JE1509-534401","isI":false},{"SpecificIdent":65901,"SpecificNum":22,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65920,"SpecificNum":17,"Meter":9316,"Power":38242.0,"WPower":null,"SNumber":"JE1509-534360","isI":false},{"SpecificIdent":66119,"SpecificNum":43,"Meter":12072,"Power":52157.0,"WPower":null,"SNumber":"JE1449-521259","isI":false},{"SpecificIdent":70018,"SpecificNum":34,"Meter":11172,"Power":49706.0,"WPower":null,"SNumber":"JE1449-521285","isI":false},{"SpecificIdent":71388,"SpecificNum":54,"Meter":6947,"Power":36000.0,"WPower":null,"SNumber":"JE1445-512406","isI":false},{"SpecificIdent":71892,"SpecificNum":36,"Meter":15398,"Power":63691.0,"WPower":null,"SNumber":"JE1447-517256","isI":false},{"SpecificIdent":72600,"SpecificNum":38,"Meter":14813,"Power":62641.0,"WPower":null,"SNumber":"JE1447-517189","isI":false},{"SpecificIdent":73645,"SpecificNum":2,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77208,"SpecificNum":28,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77892,"SpecificNum":15,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":78513,"SpecificNum":31,"Meter":6711,"Power":36461.0,"WPower":null,"SNumber":"JE1445-511601","isI":false},{"SpecificIdent":79531,"SpecificNum":18,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}]</pre></body></html>
I have tried examples from bs4, jsontoxml, and others, but I am sure there is a simple way to iterate and extract this?

I would harness python's standard library following way
import csv
import json
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if data.strip():
self.data = data
parser = MyHTMLParser()
with open("sample.html","r") as f:
parser.feed(f.read())
with open('sample.csv', 'w', newline='') as csvfile:
fieldnames = ['SpecificIdent', 'SpecificNum', 'Meter', 'Power', 'WPower', 'SNumber', 'isI']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(json.loads(parser.data))
which creates file starting with following lines
SpecificIdent,SpecificNum,Meter,Power,WPower,SNumber,isI
2588,29,0,0.0,,,False
3716,39,1835,11240.0,,0703-403548,False
6364,27,7768,29969.0,,467419,False
6583,51,7027,36968.0,,JE1449-521248,False
6612,57,12828,53918.0,,JE1509-534327,False
7139,305,6264,33101.0,,JE1449-521204,False
7551,116,0,21569.0,,JE1449-521252,False
7643,56,7752,40501.0,,JE1449-521200,False
8653,49,0,0.0,,,False
Disclaimer: this assumes JSON array you want is last text element which is not empty (i.e. contain at least 1 non-whitespace character).

There is a python library, called BeautifulSoup, that you could utilize to parse the whole HTML file:
# pip install bs4
from bs4 import BeautifulSoup
html = BeautifulSoup(your-html)
From here on, you can perform any actions upon the html. In your case, you just need to find the <pre> element, and get its contents. This can be achieved easily:
pre = html.body.find('pre')
text = pre.text
Finally, you need to parse the text, which it seems is JSON. You can do with Python's internal json library:
import json
result = json.loads(text)
Now, we need to convert this to a CSV file. This could be done, using the csv library:
import csv
with open('GFG', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)
Finally, your code should look something like this:
from bs4 import BeautifulSoup
import json
import csv
with open('raw.html', 'r') as f:
raw = f.read()
html = BeautifulSoup(raw)
pre = html.body.find('pre')
text = pre.text
result = json.loads(text)
with open('result.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)

How to scrape data once a day and write it to csv

i'm a total noobie, i'm just starting with web scraping as a hobby.
I want to scrape data from forum (total numer of post, total numer of subjects and numer of all users) from https://www.fly4free.pl/forum/
photo of which data I want to scrape
Watching some turotirals i've came to this code:
from bs4 import BeautifulSoup
import requests
import datetime
import csv
source = requests.get('https://www.fly4free.pl/forum/').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('4fly_forum.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Data i godzina', 'Wszytskich postów', 'Wszytskich tematów', 'Wszytskich użytkowników'])
czas = datetime.datetime.now()
czas = czas.strftime("%Y-%m-%d %H:%M:%S")
print(czas)
dane = soup.find('p', class_='genmed')
posty = dane.find_all('strong')[0].text
print(posty)
tematy = dane.find_all('strong')[1].text
print(tematy)
user = dane.find_all('strong')[2].text
print(user)
print()
csv_writer.writerow([czas, posty, tematy, user])
csv_file.close()
I don't know how to make it run once a day and how to add data to the file once a day. Sorry if my questions are infantile for you pros ;), it's my first training assignment.
Also my reasult csv file looks not nice, i would like that the data will nice formated into columns
Any help and insight will be much appreciated.
thx in advance
Dejvciu

You can use the Schedule library in Python to do this.
First install it using
pip install schedule
Then you can modify your code to run at intervals of your choice
import schedule
import time
def scrape():
# your web scraping code here
print('web scraping')
schedule.every().day.at("10:30").do(scrape) # change 10:30 to time of your choice
while True:
schedule.run_pending()
time.sleep(1)
This will run the web scraping script every day at 10:30 and you can easily host it for free to make it run continually.
Here's how you would save the results to a csv in a nice formatted way with filednames (czas, tematy, posty and user) as column names.
import csv
from os import path
# this will avoid appending the headers (fieldnames or column names) everytime the script runs. Headers will be written to csv only once
file_status = path.isfile('filename.csv')
with open('filename.csv', 'a+', newline='') as csvfile:
fieldnames = ['czas', 'posty', 'tematy', 'user']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if not file_status:
writer.writeheader()
writer.writerow({'czas': czas, 'posty': posty, 'tematy': tematy, 'user': user})

I'm also not very experienced but I think that to do that once a day, you can use the task scheduler of your computer. That will run your script once every day. Maybe this video helps you with the task scheduler: https://www.youtube.com/watch?v=s_EMsHlDPnE

Writing multiple files from websites in Python

import pandas as pd
import requests
from bs4 import BeautifulSoup
df = pd.read_csv('env_sequences.csv')
Namedf = df['Name']
Uniprotdf = df['ID']
for row in Uniprotdf:
theurl = 'https://www.uniprot.org/uniprot/' + row + '.fasta'
page = requests.get(theurl).content
for row in Namedf:
fasta = open(row + '.txt', 'w')
fasta.write(page)
fasta.close()
#Sample website: https://www.uniprot.org/uniprot/P04578.fasta
I have a .csv file, from which I am using the column 'ID' to generate links to websites from which I want to download the content and save it as the corresponding name from the 'Name' column within the same .csv.
The code ceases to work after the second for loop in which I get a TypeError for trying to use the page variable within the fasta.write() function. Yet, If I print(page) I am able to output the text that I'm looking to have in each file. Is this a case of me having to convert html into a string? I am unsure how to proceed from here.

For the given url, if you print the content of the page, you'll notice that it has 'b'' which indicates it's in binary format.
print (page)
b'>sp|P04578|ENV_HV1H2 Envelope glycoprotein gp160 OS=Human immunodeficiency virus type 1 group M subtype B (isolate HXB2) OX=11706 GN=env PE=1 SV=2\nMRVKEKYQHLWRWGWRWGTMLLGMLMICSATEKLWVTVYYGVPVWKEATTTLFCASDAKA\nYDTEVHNVWATHACVPTDPNPQEVVLVNVTENFNMWKNDMVEQMHEDIISLWDQSLKPCV\nKLTPLCVSLKCTDLKNDTNTNSSSGRMIMEKGEIKNCSFNISTSIRGKVQKEYAFFYKLD\nIIPIDNDTTSYKLTSCNTSVITQACPKVSFEPIPIHYCAPAGFAILKCNNKTFNGTGPCT\nNVSTVQCTHGIRPVVSTQLLLNGSLAEEEVVIRSVNFTDNAKTIIVQLNTSVEINCTRPN\nNNTRKRIRIQRGPGRAFVTIGKIGNMRQAHCNISRAKWNNTLKQIASKLREQFGNNKTII\nFKQSSGGDPEIVTHSFNCGGEFFYCNSTQLFNSTWFNSTWSTEGSNNTEGSDTITLPCRI\nKQIINMWQKVGKAMYAPPISGQIRCSSNITGLLLTRDGGNSNNESEIFRPGGGDMRDNWR\nSELYKYKVVKIEPLGVAPTKAKRRVVQREKRAVGIGALFLGFLGAAGSTMGAASMTLTVQ\nARQLLSGIVQQQNNLLRAIEAQQHLLQLTVWGIKQLQARILAVERYLKDQQLLGIWGCSG\nKLICTTAVPWNASWSNKSLEQIWNHTTWMEWDREINNYTSLIHSLIEESQNQQEKNEQEL\nLELDKWASLWNWFNITNWLWYIKLFIMIVGGLVGLRIVFAVLSIVNRVRQGYSPLSFQTH\nLPTPRGPDRPEGIEEEGGERDRDRSIRLVNGSLALIWDDLRSLCLFSYHRLRDLLLIVTR\nIVELLGRRGWEALKYWWNLLQYWSQELKNSAVSLLNATAIAVAEGTDRVIEVVQGACRAI\nRHIPRRIRQGLERILL\n'
Changing the 'w' to 'wb' while opening the file should fix it. Also, using with open () is the more pythonic way of handling files.
for row in Namedf:
with open ('url.txt','wb') as fasta:
file.write(page)

Read data from api and populate .csv bug

I am trying to write a script (Python 2.7.11, Windows 10) to collect data from an API and append it to a csv file.
The API I want to use returns data in json.
It limits the # of displayed records though, and pages them.
So there is a max number of records you can get with a single query, and then you have to run another query, changing the page number.
The API informs you about the nr of pages a dataset is divided to.
Let's assume that the max # of records per page is 100 and the nr of pages is 2.
My script:
import json
import urllib2
import csv
url = "https://some_api_address?page="
limit = "&limit=100"
myfile = open('C:\Python27\myscripts\somefile.csv', 'ab')
def api_iterate():
for i in xrange(1, 2, 1):
parse_url = url,(i),limit
json_page = urllib2.urlopen(parse_url)
data = json.load(json_page)
for item in data['someobject']:
print item ['some_item1'], ['some_item2'], ['some_item3']
f = csv.writer(myfile)
for row in data:
f.writerow([str(row)])
This does not seem to work, i.e. it creates a csv file, but the file is not populated. There is obviously something wrong with either the part of the script which builds the address for the query OR the part dealing with reading json OR the part dealing with writing query to csv. Or all of them.
I have tried using other resources and tutorials, but at some point I got stuck and I would appreciate your assistance.

The url you have given provides a link to the next page as one of the objects. You can use this to iterate automatically over all of the pages.
The script below gets each page, extracts two of the entries from the Dataobject array and writes them to an output.csv file:
import json
import urllib2
import csv
def api_iterate(myfile):
url = "https://api-v3.mojepanstwo.pl/dane/krs_osoby"
csv_myfile = csv.writer(myfile)
cols = ['id', 'url']
csv_myfile.writerow(cols) # Write a header
while True:
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data['Dataobject']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['Links']['next'] # Get the next url
except KeyError as e:
break
with open(r'e:\python temp\output.csv', 'wb') as myfile:
api_iterate(myfile)
This will give you an output file looking something like:
id,url
1347854,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1347854
1296239,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1296239
705217,https://api-v3.mojepanstwo.pl/dane/krs_osoby/705217
802970,https://api-v3.mojepanstwo.pl/dane/krs_osoby/802970

Saving and Resuming on scraperwiki - CPU time

This is my first time doing this, so I better apologize in advance for my rookie mistakes. I'm trying to scrape legacy.com for the first page results from searching for a first and last name within the state. I'm new to programming, and was using scraperwiki to do the code. It worked, but I ran out of cpu time long before the 10,000 ish queries had time to process. Now I'm trying to save progress, catch when it time is running low, and then resume from where it left off.
I can't get the save to work, and any help with the other parts would be appreciated as well. As of now I'm just grabbing links, but if there was a way to save the main content of the linked pages that would be really helpful as well.
Here's my code:
import scraperwiki
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
f = open('/tmp/workfile', 'w')
#read database, find last, start from there
def searchname(fname, lname, id, stateid):
url = 'http://www.legacy.com/ns/obitfinder/obituary-search.aspx?daterange=Last1Yrs&firstname= %s &lastname= %s &countryid=1&stateid=%s&affiliateid=all' % (fname, lname, stateid)
obits=urlopen(url)
soup=BeautifulSoup(obits)
obits_links=soup.findAll("div", {"class":"obitName"})
print obits_links
s = str(obits_links)
id2 = int(id)
f.write(s)
#save the database here
scraperwiki.sqlite.save(unique_keys=['id2'], data=['id2', 'fname', 'lname', 'state_id', 's'])
# Import Data from CSV
import scraperwiki
data = scraperwiki.scrape("https://dl.dropbox.com/u/14390755/legacy.csv")
import csv
reader = csv.DictReader(data.splitlines())
for row in reader:
#scraperwiki.sqlite.save(unique_keys=['id'], 'fname', 'lname', 'state_id', data=row)
FNAME = str(row['fname'])
LNAME = str(row['lname'])
ID = str(row['id'])
STATE = str(row['state_id'])
print "Person: %s %s" % (FNAME,LNAME)
searchname(FNAME, LNAME, ID, STATE)
f.close()
f = open('/tmp/workfile', 'r')
data = f.read()
print data

At the bottom of the CSV loop, write each fname+lname+state combination with save_var. Then, right before that loop, add another loop that goes through the rows without processing them until it passes the saved value.
You should be able to write entire web pages into the datastore, but I haven't tested that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping and outputting to excel - python

Related

Iterate through a html file and extract data to CSV file

How to scrape data once a day and write it to csv

Writing multiple files from websites in Python

Read data from api and populate .csv bug

Saving and Resuming on scraperwiki - CPU time

Categories

Resources