# my scraper script file
#-*- coding: utf-8 -*-
from selenium import webdriver
import csv
browser = webdriver.Firefox()
browser.get("http://web.com")
f = open("result.csv", 'w')
writer = csv.writer(f)
then the first method
element = browser.find_element_by_xpath("xpath_addr")
temp = [element.get_attribute("innerHTML").encode("utf-8")]
print temp # ['\xec\x84\something\xa8']
writer.writerow(temp)
this results in the right csv file with my language.(e.g. 한글)
but the second case, which I think just a little different
element = browser.find_element_by_xpath("xpath_addr")
temp = element.get_attribute("innerHTML").encode("utf-8")
print temp # "한글"
writer.writerow(temp)
then the csv file is full of non-character things. What makes this difference? print also gets different results but why? (It must be the problem because of my little knowledge about encoding)
Firstly, writerow interface is expecting a list-like object, so the first snippet is correct for this interface. But in your second snippet, the method is assuming that the string you have passed as an argument is a list - and iterating it as such - which is probably not what you wanted. You could try writerow([temp]) and see that it should match the output of the first case.
Secondly, I want to warn you that Python csv module is notorious for headaches with unicode, basically it's unsupoorted. Try using unicodecsv as a drop-in replacement for the csv module if you need to support unicode. Then you won't need to encode the strings before writing them to file, you just write the unicode objects directly and let the library handle the encoding.
Related
I'm new to python, zapier and pretty much everything, so forgive me if this is easy or impossible...
I'm trying to import multiple csv's into zapier for an automated workflow, however they contain dot points that aren't formatted using UTF-8, which is all zapier can read.
It consistently errors -
"utf-8' codec can't decode byte 0x95 in position 829: invalid start byte"
After talking to zapier support, they've suggested using python to possibly find an replace these dot points with asterisk or dash, then import this corrected csv into my zapier workflow.
This is what i have written so far as a Python action in Zapier (just trying to read the csv to start with) with no luck:
import csv
with open(input_data['file'], 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Is this possible?
Thanks!
Zapier trying to import CSV with bullet points
My current python code (not working) in an attempt to find & replace bullet points in the CSV's
This is possible, but it's a little tricky. Zapier is confusing when it comes to files. On your computer, files are a series of bytes. But in Zapier, a file is usually a url that points to the actual file. This is great for cross-app compatibility, but tricky to work with in code.
You're trying to open to open a url as a file in Python, which isn't working. Instead, make a request for that file, then read it as a series of bytes. Try this:
import csv
import io
file_data = requests.get(input_data['file'])
reader = csv.reader(file_data.content.decode('utf-8').splitlines(), delimiter=',')
result = io.StringIO() # a string interface to write
writer = csv.writer(result)
for row in reader:
# some modifications here
# row = row.replace(...)
writer.writerow(row)
return [{'data': result.getvalue()}]
The result there is because you want to write out a string that you can then re-package as a CSV in your virtual filesystem of choice (gDrive, Dropbox, etc).
You can also test this locally instead of in the Zapier editor (I find that's a bit easier to iterate with). Simply get the file url from the code step (it'll be something like https://zapier.com/engine/... and make a local python file with:
input_data = {'file': 'https://zapier.com/engine/...'}
...
You'll also need to pip install requests if you don't have it.
So I basically just want to have a list of all the pixel colour values that overlap written in a text file so I can then access them later.
The only problem is that the text file is having (set([ or whatever written with it.
Heres my code
import cv2
import numpy as np
import time
om=cv2.imread('spectrum1.png')
om=om.reshape(1,-1,3)
om_list=om.tolist()
om_tuple={tuple(item) for item in om_list[0]}
om_set=set(om_tuple)
im=cv2.imread('RGB.png')
im=cv2.resize(im,(100,100))
im= im.reshape(1,-1,3)
im_list=im.tolist()
im_tuple={tuple(item) for item in im_list[0]}
ColourCount= om_set & set(im_tuple)
File= open('Weedlist', 'w')
File.write(str(ColourCount))
Also, if I run this program again but with a different picture for comparison, will it append the data or overwrite it? It's kinda hard to tell when just looking at numbers.
If you replace these lines:
im=cv2.imread('RGB.png')
File= open('Weedlist', 'w')
File.write(str(ColourCount))
with:
import sys
im=cv2.imread(sys.argv[1])
open(sys.argv[1]+'Weedlist', 'w').write(str(list(ColourCount)))
you will get a new file for each input file and also you don't have to overwrite the RGB.png every time you want to try something new.
Files opened with mode 'w' will be overwritten. You can use 'a' to append.
You opened the file with the 'w' mode, write mode, which will truncate (empty) the file when you open it. Use 'a' append mode if you want data to be added to the end each time
You are writing the str() conversion of a set object to your file:
ColourCount= om_set & set(im_tuple)
File= open('Weedlist', 'w')
File.write(str(ColourCount))
Don't use str to convert the whole object; format your data to a string you find easy to read back again. You probably want to add a newline too if you want each new entry to be added on a new line. Perhaps you want to sort the data too, since a set lists items in an ordered determined by implementation details.
If comma-separated works for you, use str.join(); your set contains tuples of integer numbers, and it sounds as if you are fine with the repr() output per tuple, so we can re-use that:
with open('Weedlist', 'a') as outputfile:
output = ', '.join([str(tup) for tup in sorted(ColourCount)])
outputfile.write(output + '\n')
I used with there to ensure that the file object is automatically closed again after you are done writing; see Understanding Python's with statement for further information on what this means.
Note that if you plan to read this data again, the above is not going to be all that efficient to parse again. You should pick a machine-readable format. If you need to communicate with an existing program, you'll need to find out what formats that program accepts.
If you are programming that other program as well, pick a format that other programming language supports. JSON is widely supported for example (use the json module and convert your set to a list first; json.dump(sorted(ColourCount), fileobj), then `fileobj.write('\n') to produce newline-separated JSON objects could do).
If that other program is coded in Python, consider using the pickle module, which writes Python objects to a file efficiently in a format the same module can load again:
with open('Weedlist', 'ab') as picklefile:
pickle.dump(ColourCount, picklefile)
and reading is as easy as:
sets = []
with open('Weedlist', 'rb') as picklefile:
while True:
try:
sets.append(pickle.load(output))
except EOFError:
break
See Saving and loading multiple objects in pickle file? as to why I use a while True loop there to load multiple entries.
How would you like the data to be written? Replace the final line by
File.write(str(list(ColourCount)))
Maybe you like that more.
If you run that program, it will overwrite the previous content of the file. If you prefer to apprend the data open the file with:
File= open('Weedlist', 'a')
First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0
There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT
Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.
I am using the following code and it works well except for the fact that my code spits out on to a CSV file from Excel and it skips every other line. I have googled the csv module documentation and other examples in stackoverflow.com and I found that I need to use DictWriter with the lineterminator set at '\n'. My own attempts to write it into the code have been foiled.
So I am wondering is there a way for me to apply this(being the lineterminator) to the whole file so that I do not have any lines skipped? And if so how?
Here is the code:
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
page = urllib2.urlopen('http://finance.yahoo.com/q/ks?s=F%20Key%20Statistics').read()
f = csv.writer(open("pe_ratio.csv","w"))
f.writerow(["Name","PE"])
soup = BeautifulSoup(page)
all_data = soup.findAll('td', "yfnc_tabledata1")
f.writerow([all_data[2].getText()])
Thanks for your help in advance.
You need to open your file with the right options for the csv.writer class to work correctly. The module has universal newline support internally, so you need to turn off Python's universal newline support at the file level.
For Python 2, the docs say:
If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.
For Python 3, they say:
If csvfile is a file object, it should be opened with newline=''.
Also, you should probably use a with statement to handle opening and closing your file, like this:
with open("pe_ratio.csv","wb") as f: # or open("pe_ratio.csv", "w", newline="") in Py3
writer = csv.writer(f)
# do other stuff here, staying indented until you're done writing to the file
First, since Yahoo provides an API that returns CSV files, maybe you can solve your problem that way? For example, this URL returns a CSV file containing prices, market cap, P/E and other metrics for all stocks in that industry. There is some more information in this Google Code project.
Your code only produces a two-row CSV because there are only two calls to f.writerow(). If the only piece of data you want from that page is the P/E ratio, this is almost certainly not the best way to do it, but you should pass to f.writerow() a tuple containing the value for each column. To be consistent with your header row, that would be something like:
f.writerow( ('Ford', all_data[2].getText()) )
Of course, that assumes that the P/E ratio will always be second in the list. If instead you wanted all the statistics provided on that page, you could try:
# scrape the html for the name and value of each metric
metrics = soup.findAll('td', 'yfnc_tablehead1')
values = soup.findAll('td', 'yfnc_tabledata1')
# create a list of tuples for the writerows method
def stripTag(tag): return tag.text
data = zip(map(stripTag, metrics), map(stripTag, values))
# write to csv file
f.writerows(data)
I'm wondering if anyone with a better understanding of python and gae can help me with this. I am uploading a csv file from a form to the gae datastore.
class CSVImport(webapp.RequestHandler):
def post(self):
csv_file = self.request.get('csv_import')
fileReader = csv.reader(csv_file)
for row in fileReader:
self.response.out.write(row)
I'm running into the same problem that someone else mentions here - http://groups.google.com/group/google-appengine/browse_thread/thread/bb2d0b1a80ca7ac2/861c8241308b9717
That is, the csv.reader is iterating over each character and not the line. A google engineer left this explanation:
The call self.request.get('csv') returns a String. When you iterate over a
string, you iterate over the characters, not the lines. You can see the
difference here:
class ProcessUpload(webapp.RequestHandler):
def post(self):
self.response.out.write(self.request.get('csv'))
file = open(os.path.join(os.path.dirname(__file__), 'sample.csv'))
self.response.out.write(file)
# Iterating over a file
fileReader = csv.reader(file)
for row in fileReader:
self.response.out.write(row)
# Iterating over a string
fileReader = csv.reader(self.request.get('csv'))
for row in fileReader:
self.response.out.write(row)
I really don't follow the explanation, and was unsuccessful implementing it. Can anyone provide a clearer explanation of this and a proposed fix?
Thanks,
August
Short answer, try this:
fileReader = csv.reader(csv_file.split("\n"))
Long answer, consider the following:
for thing in stuff:
print thing.strip().split(",")
If stuff is a file pointer, each thing is a line. If stuff is a list, each thing is an item. If stuff is a string, each thing is a character.
Iterating over the object returned by csv.reader is going to give you behavior similar to iterating over the object passed in, only with each item CSV-parsed. If you iterate over a string, you'll get a CSV-parsed version of each character.
I can't think of a clearer explanation than what the Google engineer you mentioned said. So let's break it down a bit.
The Python csv module operates on file-like objects, that is a file or something that behaves like a Python file. Hence, csv.reader() expects to get a file object as it's only required parameter.
The webapp.RequestHandler request object provides access to the HTTP parameters that are posted in the form. In HTTP, parameters are posted as key-value pairs, e.g., csv=record_one,record_two. When you invoke self.request.get('csv') this returns the value associated with the key csv as a Python string. A Python string is not a file-like object. Apparently, the csv module is falling-back when it does not understand the object and simply iterating it (in Python, strings can be iterated over by character, e.g., for c in 'Test String': print c will print each character in the string on a separate line).
Fortunately, Python provides a StringIO class that allows a string to be treated as a file-like object. So (assuming GAE supports StringIO, and there's no reason it shouldn't) you should be able to do this:
class ProcessUpload(webapp.RequestHandler):
def post(self):
self.response.out.write(self.request.get('csv'))
# Iterating over a string as a file
stringReader = csv.reader(StringIO.StringIO(self.request.get('csv')))
for row in stringReader:
self.response.out.write(row)
Which will work as you expect it to.
Edit I'm assuming that you are using something like a <textarea/> to collect the csv file. If you're uploading an attachment, different handling may be necessary (I'm not all that familiar with Python GAE or how it handles attachments).
You need to call csv_file = self.request.POST.get("csv_import") and not csv_file = self.request.get("csv_import").
The second one just gives you a string as you mentioned in your original post. But accessing via self.request.POST.get gives you a cgi.FieldStorage object.
This means that you can call csv_file.filename to get the object’s filename and csv_file.type to get the mimetype.
Furthermore, if you access csv_file.file, it’s a StringO object (a read-only object from the StringIO module), not just a string. As ig0774 mentioned in his answer, the StringIO module allows you to treat a string as a file.
Therefore, your code can simply be:
class CSVImport(webapp.RequestHandler):
def post(self):
csv_file = self.request.POST.get('csv_import')
fileReader = csv.reader(csv_file.file)
for row in fileReader:
# row is now a list containing all the column data in that row
self.response.out.write(row)