Empty array after writing to CSV file python - python

My goal is to convert a 6000-record CSV file into an array, clean and normalize it, so that I can read it into a corpus.Dictionary() to use in doc2bow in Gensim to perform a SparseMatrixSimiliarity query. I was successful in reading in the CSV file at first, and it printed out an array I call "definitions" for each one of the 6000 sub-category record numbers.
f = open('test.csv')
csv_f = csv.reader(f)
definitions = []
for row in csv_f:
definitions.append(row[2])
print(definitions)
But then hit a wall with UTF-8 and ASCII errors. Gensim has "strict" UTF-8 settings.
After several hours spent on Stack Overflow, researching, and trying to apply a few "UTF-8" encoders per the Python CSV documentation, I read that since Python 2.7 doesn't have "out of the box" unicode-encoding using the import csv package, that I could use the codecs package.
I figured that instead of finding every line in my original "definitions" 6000-line array and decoding, that I could take an initial stab at decoding it right off the bat using codecs. However, the below code fails to write anything to my definitions array. Being a newbie, I imagine that I may be using codecs the wrong way, and/or closing the wrong way.
with codecs.open('test.csv', 'rb', encoding='utf-8') as f:
csv_f = csv.reader(f)
definitions= []
for row in csv_f:
definitions.append(np.array((array.float(i) for i in l)))
f.close()
print(definitions)
I am a total newbie, apologies for any errors in my description. Learning as I go, really appreciate any feedback and help. Perhaps I'm going about this the wrong way, and welcome any education. Thank you again.

Related

Find & replace data in a CSV using Python on Zapier

I'm new to python, zapier and pretty much everything, so forgive me if this is easy or impossible...
I'm trying to import multiple csv's into zapier for an automated workflow, however they contain dot points that aren't formatted using UTF-8, which is all zapier can read.
It consistently errors -
"utf-8' codec can't decode byte 0x95 in position 829: invalid start byte"
After talking to zapier support, they've suggested using python to possibly find an replace these dot points with asterisk or dash, then import this corrected csv into my zapier workflow.
This is what i have written so far as a Python action in Zapier (just trying to read the csv to start with) with no luck:
import csv
with open(input_data['file'], 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Is this possible?
Thanks!
Zapier trying to import CSV with bullet points
My current python code (not working) in an attempt to find & replace bullet points in the CSV's
This is possible, but it's a little tricky. Zapier is confusing when it comes to files. On your computer, files are a series of bytes. But in Zapier, a file is usually a url that points to the actual file. This is great for cross-app compatibility, but tricky to work with in code.
You're trying to open to open a url as a file in Python, which isn't working. Instead, make a request for that file, then read it as a series of bytes. Try this:
import csv
import io
file_data = requests.get(input_data['file'])
reader = csv.reader(file_data.content.decode('utf-8').splitlines(), delimiter=',')
result = io.StringIO() # a string interface to write
writer = csv.writer(result)
for row in reader:
# some modifications here
# row = row.replace(...)
writer.writerow(row)
return [{'data': result.getvalue()}]
The result there is because you want to write out a string that you can then re-package as a CSV in your virtual filesystem of choice (gDrive, Dropbox, etc).
You can also test this locally instead of in the Zapier editor (I find that's a bit easier to iterate with). Simply get the file url from the code step (it'll be something like https://zapier.com/engine/... and make a local python file with:
input_data = {'file': 'https://zapier.com/engine/...'}
...
You'll also need to pip install requests if you don't have it.

Need a push to start with a function about text files, I can't figure this out on my own

I don't need the entire code but I want a push to help me on the way, I've been searching on the internet for clues on how to start to write a function like this but I haven't gotten any further then just the name of the function.
So I haven't got the slightest clue on how to start with this, I don't know how to work with text files. Any tips?
These text files are CSV (Comma Separated Values). It is a simple file format used to store tabular data.
You may explore Python's inbuilt module called csv.
Following code snippet an example to load .csv file in Python:
import csv
filename = 'us_population.csv'
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)

How to correctly put extended ASCII characters into a CSV file?

I'm trying to write some data in an array that contains extended ASCII characters to a CSV file. Below I made an small example of the code I'm using on real file.
The array text_array represents an array containing only one row.
import csv
text_array = [["Á","Â","Æ","Ç","Ö","×","Ø","Ù","Þ","ß","á","â","ã","ä","å","æ"]]
with open("/Files/out.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(text_array)
The output I'm getting on CSV file is wrong, showing these characters.
à Â Æ Ç Ö × Ø Ù Þ ß á â ã ä å æ
I found that the code below fixes the issue in Python 3.4 but I'm working on Python 2.7.
c = csv.writer(open("Out.csv", 'w', newline='', encoding='utf-8'))
How can I fix this?
UPDATE
I receive some links as comments, but is a kind of difficult for me to understand what is needed to do to fix this issue. May someone show some example please.

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.
a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Writing unicode data in csv

I know similar kind of question has been asked many times but seriously i have not been able to properly implement the csv writer which writes properly in csv (it shows garbage).
I am trying to use UnicodeWriter as mention in official docs .
ff = open('a.csv', 'w')
writer = UnicodeWriter(ff)
st = unicode('Displaygrößen', 'utf-8') #gives (u'Displaygr\xf6\xdfen', 'utf-8')
writer.writerow([st])
This does not give me any decoding or encoding error. But it writes the word Displaygrößen as Displaygrößen which is not good. Can any one help me what i am doing wrong here??
You are writing a file in UTF-8 format, but you don't indicate that into your csv file.
You should write the UTF-8 header at the beginning of the file. Add this:
ff = open('a.csv', 'w')
ff.write(codecs.BOM_UTF8)
And your csv file should open correctly after that with the program trying to read it.
Opening the file with codecs.open should fix it.

Categories