I was trying to extract a big amount of data from a PostgreSQL database and then write it to a CSV, but I got a MemoryError. For doing that, I've used the Python StringIO module, and if I'd put a limited number of rows generated by the query it worked just fine; the file was written and there was no problem in importing it as text file on Excel.
Since it gives me back that error, I've tried to write the file row by row like this, with a little function that takes care of the decoding/encoding stuff:
def convert_and_write(out_stream, input_value):
output_value = input_value.decode('utf-8').encode('utf-16','ignore')
out_stream.write(output_value)
with open( filename, "w" ) as out_file:
for row in rs:
tmpstr = ""
for c in range( len( row ) ):
xx = str( row[c] ).replace("\"", "").replace( chr(9), " " )
tmpstr += xx + chr(9)
convert_and_write( out_file,
tmpstr.replace( "\n", "-" ).replace( "\r", " " )
)
convert_and_write( out_file, "\r\n" )
Now the file is written, but when I try to import it as a text on Excel, there's a problem: the lines that divide the columns are over the text, like this
(I've hidden some data, but you get the idea)
and when I compare the file before/after on Notepad++, the codification seems to be the same (UCS-2 Little Endian), but I can notice strange marks on the first letter of the words of the newest file
What should I do?
Related
I'm writing a custom script whose first task is to extract a csv's data into a python dictionary. There's some weird behaviour with a variable though: When executing the script below, instead of subsequent inputs, I get "Squeezed text (77 lines)" as output. If I inspect that, I get a white empty screen, so there seems to be nothing. Totally don't get what's happening..
My script:
import os
import io
separator = ";"
source_data_folder = os.path.realpath( __file__ ).replace( "extraction.py", "source_data" )
for source_file in os.listdir( source_data_folder ):
iterated_source_file = io.open( source_data_folder + "/" + source_file, encoding='windows-1252' )
source_data = {}
source_data_key_indexes = {}
line_counter = 0
for iterated_line in iterated_source_file:
iterated_lines_data = iterated_line.split( "" + separator + "" )
column_counter = 0
if line_counter == 0:
for iterated_lines_field in iterated_lines_data:
source_data[iterated_lines_field] = []
source_data_key_indexes[column_counter] = iterated_lines_field
column_counter += 1
else:
for iterated_lines_field in iterated_lines_data:
source_data[source_data_key_indexes[column_counter]].append( iterated_lines_field )
column_counter += 1
line_counter += 1
iterated_source_file.close()
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
When I put this part:
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
Out of the initial for loop, without any indentation, it however works; but I need to call it in the first for loop. I could may due this with a callback, but why is this actually happening??
I'm using Python v. 3.7.3 and am executing the script via the Python Shell v. 3.7.3.
content of a sample CSV file, placed in the source_data folder, which is placed in the same location as the "extraction.py" file, holding the code above:
first;second;third;fourth
this;is;the;1st
this;is;the;2nd
This CSV - file was obtained by creating the according table in a new Microsoft Office Excel datasheet, with the according three lines + four columns, then saving the file as utf-8 csv file via "save as..." and selecting the utf-8 csv file type.
Note: I noticed that when I add the line
print( iterated_line )
below the line line_counter == 0: of my code, I interestingly get the "Squeezed text (77 lines)" again, followed by the visible content of the first line as a simple string. This is only true for the table header line (only the very first one); for the others only the line content is outputted. Interestingly, this happens for any csv - file I create in the above - mentioned way; no matter the amount of rows, columns, or their content. So is this actually some formatting issue with Python + Ms Excel?
import os
import csv
source_data_folder = os.path.realpath( __file__ ).replace("extraction.py", "source_data")
for filename in os.listdir(source_data_folder):
with open(filename, encoding='windows-1252') as fp:
reader = csv.DictReader(fp, delimiter=';')
table = list(reader)
# Convert list of dicts to dict of lists
table = {key: [item[key] for item in table] for key in table[0]}
print(table)
I found the problem, weirdly thanks to this. The problem is that os.listdir() contained that .DS_store - file as first element, which is where the buggy first iteration originates from, so replace:
for source_file in os.listdir( source_data_folder ):
with
# remove the .DS_Store default file which is auto-added in Mac OS
myfiles = os.listdir( source_data_folder )
del myfiles[0]
# iterate through all the source files present in the source folder
for source_file in myfiles:
And the only problem now is that I have the string
\ufeff
At the very start of the very first line only. To not consider it, according to this, use the utf-8-sig encoding instead of utf-8, indeed worked (the encoding change tells the engine to "omit the BOM in the result").
I am trying to split a large CSV file into several parts as files with Python.
as a first try, I read the first 261579 lines from the CSV dataset file using this part of the code:
for c in range(261579):
line = datasetFile.readline()
if len(line) == 0:print("empty line detected at : " ,c)
lines.append(line)
print("SAVING LINES ......")
split = open(outputDirectoryName+"spilt" + str(x+1) +".csv","w")
split.writelines(lines)
print("SPLIT " + str(x+1) + " END with " ,str(len(lines)) , "lines .")
OK, for the moment, the code works well and shows me
"SPLIT 1 END with 261579 lines."
, But the problem is that when I open my file "Split1.csv" with notpad++, I only find 261575 instead of 261579, it's a loss of data for 4 lines somewhere in the file.
With this proportion, I want to know what exactly happens with the "file.writeLines (lines)" method when do we use it to save my data in a split file?
I had same issue and then I found out that I should have closed my file.for you
split.close()
I've got a little bit of a headache here. I'm new to working to CSVs in Python, but here's what I've got: a CSV, generated by airodump (this is an art project/technology demo in which I'm trying to count the number of BSSIDs in a given area). I'm working in Python 3.4 and trying to read the file in. I had a hiccup with some null values, but I seem to have worked around that (hence the perhaps odd way I open the file).
import csv
print ("nibbler starting")
log_initial = open("test.csv", "rU")
logReader = csv.reader((line.replace('\0',' ') for line in log_initial), delimiter=",")
logData = list(logReader)
row_count1 = sum(1 for row in logData)
row_count2 = sum(1 for row in logReader)
print ('Rows = ' + str(row_count1))
print ('Rows = ' + str(row_count2))
for row in logReader:
print('Row #'+ str(logReader.line_num)+' '+str(row))
So here's what happens. The first line printed says there are 52 rows (correct!), the second line printed says the are 0 rows (oh no, that's not what I expected).
No surprise then, the for loop delivers nothing. If I replace logReader with logData in the loop, then I get the whole file listed line by line. I'd stick with the data as a list, but that limits what I can do with it.
So somehow the file isn't getting properly processed as a CSV. I'm not really sure what to do about that. Any ideas?
logData = list(logReader)
reads all the rows in logReader. Reading logReader again will produce no rows as the iterator (the reader object) has already been exhausted.
I am a python newbie and here is my code to extract some numbers from lines of text in a file:
i = 0
path = '/home/vahid/git/simmobility/dev/Basic/pathset/'
output = open(path + 'noTTSectionResult.txt', 'w')
for row in open(path + 'heoutput.txt', 'r'):
if row.find('error: getTravelTimeBySegId') == -1 :
continue
words = row.split(':')
word = words[3]
word = word[1:]
word = word[:-3]
output.write(str(i) + ':' +word + '\n')
i = i+1
print i
output.close
the final output printed on console is 999 (I even printed the result in console to make sure)but the number of lines written into the output file is less-754 lines! even the last line is written partially!!!
Am I missing something?
thanks
The
output.close
should be
output.close()
Otherwise it's a no-op and does not close the file. If the file is not closed, the write buffer is not flushed until later, or at all (depending on how your script terminates).
To avoid having to explicitly close the file, you could use the with statement:
with open(path + 'noTTSectionResult.txt', 'w') as output:
for row in open(path + 'heoutput.txt', 'r'):
...
output.write(...)
...
# no need to explicitly close `output'
This has the added advantage of closing the file even if the for loops raises an exception.
You need to flush your output's buffer with the flush() method on python file objects.
Also, if you want to de sure all the content of the buffer is really written on your hard drive (or something else), you have to use the system call os.fsync()
The 'close() method on file object normally flush the buffer. But I think it's good o know how it's working.
I need to make a program that receives a integer and stores it on a file. When it has 15 (or 20, the exact number doesn't matter) it will overwrite the first one that it wrote. They may be on the same line or each one in a new line.
This program reads temperature from a sensor and then i will show that on a site with a php chart.
I thought about writing a value every half an hour maybe, and when it has 15 values and a new one comes it overwrites the oldest one.
I'm having troubles saving the values, i dont know how to save the list as a string with new lines, it saves double new lines, i'm new at python and i get really lost.
This doesn't work but it is a "sample" of what i want to do:
import sys
import os
if not( sys.argv[1:] ):
print "No parameter"
exit()
# If file doesn't exist, create it and save the value
if not os.path.isfile("tempsHistory"):
data = open('tempsHistory', 'w+')
data.write( ''.join( sys.argv[1:] ) + '\n' )
else:
data = open('tempsHistory', 'a+')
temps = []
for line in data:
temps += line.split('\n')
if ( len( temps ) < 15 ):
data.write( '\n'.join( sys.argv[1:] ) + '\n' )
else:
#Maximum amount reached, save new, delete oldest
del temps[ 0 ]
temps.append( '\n'.join( sys.argv[1:] ) )
data.truncate( 0 )
data.write( '\n'.join(str(e) for e in temps) )
data.close( )
Im getting lost with the ''.join and \n etc... I mean, i have to write with join to make the list save as a string and not with the [ '', '']. If i use '\n'.join, it saves double space, i think.
Thank you in advance!
I think what you want is something like this:
import sys
fileTemps = 'temps'
with open(fileTemps, 'rw') as fd:
temps = fd.readlines()
if temps.__len__() >= 15:
temps.pop(0)
temps.append(' '.join(sys.argv[1:]) + '\n')
with open(fileTemps, 'w') as fd:
for l in temps:
fd.write(l)
First you open the file for reading. The fd.readlines() call will give you the lines in the file. Then you check the size, and if the number of lines is greater than 15, then you pop the first value and append the new line. Then you write everything to a file.
In Python, generally, when you read from a file (e.g. using readline()) gives you the line with an '\n' at the end, that is why you get double line breaks.
Hope this helps.
You want something like
values = open(target_file, "r").read().split("\n")
# ^ this solves your original problem as readline() will keep the \n in returned list items
if len(values) >= 15:
# keep the values at 15
values.pop()
values.insert(0, new_value)
# push new value at the start of the list
tmp_fd, tmp_fn = tempfile.mkstemp()
# ^ this part is important
os.write(tmp_fd, "\n".join(values))
os.close(tmp_fd)
shutil.move(tmp_fn, target_file)
# ^ as here, the operation of actual write to the file, your webserver is reading, is atomic
# this is eg. how text editors save files
But anyway, I'd suggest you to consider using a database, be it postgresql, redis, sqlite or whatever floats your boat
You should try to not confuse storing data in lists with formatting in strings. Data does not require the "\n"s
So just temps.append(sys.argv[1:]) is enough.
In addition you should not serialize / deserialize the data on your own. Have a look into pickle. This is much simpler to use than reading / writing lists on your own.