When my flask application starts up, it needs to add a bunch of data to Postgres via SQLalchemy. I'm looking for a good way to do this. The data is in TSV format, and I already have a SQLalchemy db.model schema for it. Right now:
for datafile in datafiles:
with open(datafile,'rb') as file:
# reader = csv.reader(file, delimiter='\t')
reader = csv.DictReader(file, delimiter='\t', fieldnames=[])
OCRs = # somehow efficiently convert to list of dicts...
db.engine.execute(OpenChromatinRegion.__table__.insert(), OCRs)
Is there a better, more direct way? Otherwise, what is the best way of generating OCRs ?
The solution suggested here seems clunky.
import csv
from collections import namedtuple
fh = csv.reader(open(you_file, "rU"), delimiter=',', dialect=csv.excel_tab)
headers = fh.next()
Row = namedtuple('Row', headers)
OCRs = [Row._make(i)._asdict() for i in fh]
db.engine.execute(OpenChromatinRegion.__table__.insert(), OCRs)
# plus your loop for multiple files and exception handling of course =)
Related
I want to search a column and delete from csv file using python. I cannot dataframes as I need to work with large files and can't load it in RAM. How to do it?
example csv file-
Home,Contact,Adress
abc,123,xyz
I need to find and delete Contact for example. I thought to use csv.reader but cannot figure out how to do it
Check this :
import csv
col = 'Contact'
with open('your_csv.csv') as f:
with open('new_csv.csv', 'w', newline='') as g:
# creating csv reader
reader = csv.reader(f)
# getting the 'col' index in the header, we want to delete it in the next lines
col_index = next(reader).index(col)
for line in reader:
del line[col_index]
# writing to new csv file
writer = csv.writer(g)
writer.writerow(line)
Explanation for using newline='' is here.
If your application prefers to work with pandas still, I'd suggest to play with pandas chunking tactic. See example below:
iterator = pandas.read_csv('/tmp/abc.csv', chunksize=10**5)
df_new = pandas.DataFrame(columns=['your_remaining_columns'])
for df in iterator:
del df['col_b']
df_new = pandas.concat([df_new, df])
print(df_new.shape[0])
print(df_new.columns)
I was able to process a 50GB csv file with complex data (non utf8 encoding, cell contains ,, doing deduplication and filtered out bad rows) by this approach before.
I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?
The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)
I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.
Hi I am trying to iterate through a csv file but I cannot get it to work somehow. I followed the python docs but I am still not able to iterate through it. I have a gzipped csv file that I work with with this format:
2015-01-10 00:00:05;32
As you can see it's delimited with a ';'.
Here is my code to run though it (simplified)
gzip_fd = gzip.decompress(gzip_file).decode(encoding='utf8')
csv_data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')
for data in csv_data:
print(data)
But when I want to work with data it only contains the first character (like: 2) and not the first part of the csv data that I need. Anyone here that had the same issues? I also tried csv.DictReader but with no success.
Even if your snippet was fixed to work, it would buffer all data in the memory, which might not scale well for very large files.
Gzipped data can also be iterated on-the-fly -- the following works for me on CPython 3.8:
import csv
import gzip
with gzip.open('test.csv.gz', 'r') as gzipped:
reader = csv.reader(gzipped, delimiter=';', lineterminator='\n')
for line in reader:
print(line)
['2015-01-10 00:00:05', '32']
<...>
Update: As per comments below, my snippet does not work on older Python versions (reproduced on CPython 3.5).
You can use io.TextIOWrapper to achieve the same effect:
import csv
import io
import gzip
with gzip.open('test.csv.gz', 'rb') as gzipped:
reader = csv.reader(io.TextIOWrapper(gzipped), delimiter=';',
lineterminator='\n')
for line in reader:
print(line)
So I fixed my issue, the issue was that I didn't split the string that I get (can't do gzip.open because it isn't a file but rather a bytes string of the gzipped file
Here is the fix to my problem:
gzip_fd = gzip.decompress(compressed_data).decode(encoding='utf-8').split('\n')
self.data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')
I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!
Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))
Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.
I'm using Python's csv module to do some reading and writing of csv files.
I've got the reading fine and appending to the csv fine, but I want to be able to overwrite a specific row in the csv.
For reference, here's my reading and then writing code to append:
#reading
b = open("bottles.csv", "rb")
bottles = csv.reader(b)
bottle_list = []
bottle_list.extend(bottles)
b.close()
#appending
b=open('bottles.csv','a')
writer = csv.writer(b)
writer.writerow([bottle,emptyButtonCount,100, img])
b.close()
And I'm using basically the same for the overwrite mode(which isn't correct, it just overwrites the whole csv file):
b=open('bottles.csv','wb')
writer = csv.writer(b)
writer.writerow([bottle,btlnum,100,img])
b.close()
In the second case, how do I tell Python I need a specific row overwritten? I've scoured Gogle and other stackoverflow posts to no avail. I assume my limited programming knowledge is to blame rather than Google.
I will add to Steven Answer :
import csv
bottle_list = []
# Read all data from the csv file.
with open('a.csv', 'rb') as b:
bottles = csv.reader(b)
bottle_list.extend(bottles)
# data to override in the format {line_num_to_override:data_to_write}.
line_to_override = {1:['e', 'c', 'd'] }
# Write data to the csv file and replace the lines in the line_to_override dict.
with open('a.csv', 'wb') as b:
writer = csv.writer(b)
for line, row in enumerate(bottle_list):
data = line_to_override.get(line, row)
writer.writerow(data)
You cannot overwrite a single row in the CSV file. You'll have to write all the rows you want to a new file and then rename it back to the original file name.
Your pattern of usage may fit a database better than a CSV file. Look into the sqlite3 module for a lightweight database.