I'm trying to read a csv file in python, so that I can then find the average of the values in one of the columns using numpy.average.
My script looks like this:
import os
import numpy
import csv
listing = os.listdir('/path/to/directory/of/files/i/need')
os.chdir('/path/to/directory/of/files/i/need')
for file in listing[1:]:
r = csv.reader(open(file, 'rU'))
for row in r:
if len(row)<2:continue
if float(row[2]) <=0.05:
avg = numpy.average(float(row[2]))
print avg
but I keep on getting the error ValueError: invalid literal for float(). The csv reader seems to be reading the numbers as string, and won't allow me to convert it to a float. Any suggestions?
Judging by the comments, your program is running into problems with the headers.
Two solutions of this are to use r.next(), which skips a line, before your for loop, or to use the DictReader class. The advantage of the DictReader class is that you can treat each row as a dictionary instead of a tuple, which may make for more readability in some cases, but you do have to pass the list of headers to it in the constructor.
change:
float(row[2])
to:
float(row[2].strip("'\""))
Related
I read large CSV file (millions of records) by this script. How do I detect the file is at end?
import csv
f = open("file.csv", newline='')
csv_reader = csv.reader(f)
while true:
do something with next(csv_reader)[6]
The obvious solution is to loop over csv_reader, as suggested by this answer. If that is not practical, the docs for the next function say:
Retrieve the next item from the iterator by calling its __next__() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
thus giving you two ways of detecting the end.
The csv.reader will read the file entirely and store it in the variable which is also an iterable. For reading "line by line", you need this:
for row in csv_reader:
do something
If you directly want the last line:
with open(‘file_name.csv’,’r’) as file:
data = file.readlines()
lastRow = data[-1]
This will be quite slow and memory consuming. Alternative is using pandas.
I solved it with pandas:
import pandas as pd
import numpy as np
csv_reader = pd.read_csv("file.csv", skiprows=2, usecols=[6])
csv_a = csv_reader.to_numpy()
this script skips first 2 rows and then imports 6th column only and converts to array
I'm new to MapReduce and MRjob, I am trying to read a csv file that I want to process using MRjob in python. But it has about 5 columns with JSON strings(eg. {}) or an array of JSON strings (eg. [{},{}]), some of them are nested.
My mapper so far looks as follows:
from mrjob.job import MRJob
import csv
from io import StringIO
class MRWordCount(MRJob):
def mapper(self, _, line):
l = StringIO(line)
reader = csv.reader(l) # returns a generator.
for cols in reader:
columns = cols
yield None, columns
I get the error -
_csv.Error: field larger than field limit (131072)
But that seems to happen because my code separates the JSON strings into separate columns as well (because of the commas inside).
How do I make this, so that the JSON strings are not split? Maybe I'm overlooking something?
Alternatively, is there any other ways I could read this file with MRjob that would make this process simpler or cleaner?
Your JSON string is not surrounded by quote characters so every comma in that field makes the csv engine think its a new column.
take a look here what you are looking for is quotechar change your data so that you json is surrounded with a special character (The default is ") and adjust your csv reader accordingly
I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.
I am getting below keyError while running my python script which import data from one csv,modify it and write to another csv.
Code snippet:
import csv
Ty = 'testy'
Tx = 'testx'
ifile = csv.DictReader(open('test.csv'))
cdata = [x for x in ifile]
for row in cdata:
row['Test'] = row.pop(Ty)
Error seen while executing :
row['Test'] = row.pop(Ty)
KeyError: 'testy'
Any idea?
Thanks
Probably your csv don't have a header, where the specification of the key is done, since you didn't define the key names. The DictReader requires the parameter fieldnames so it can map accordingly it as keys (header) to values.
So you should do something like to read your csv file:
ifile = csv.DictReader(open('test.csv'), fieldnames=['testx', 'testy'])
If you don't want to pass the fieldnames parameter, try to understand from where the csv define its header, see the wikipedia article:
The first record may be a "header", which contains column names in
each of the fields (there is no reliable way to tell whether a file
does this or not; however, it is uncommon to use characters other than
letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
You can put your 'testy' and 'testx' in your csv and don't pass the fieldnames to DictReader
According to the error message, there is missing testy on the first line of test.csv
Try such content in test.csv
col_name1,col_name2,testy
a,b,c
c,d,e
Note that there should not be any spaces/tabs around the testy.
I have a large tab-delimited csv file with the following format:
#mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment gene_alignment mirna_start mirna_end gene_start gene_end genome_coordinates conservation align_score seed_cat energy mirsvr_score
What I would like to be able to do is iterate through rows and select items based on data (strings) in the "gene_id" field, then copy those rows to a new file.
I am a python noob, and thought it would be a good way to get my feet wet, but it is harder than it looks! I have been trying to use the csv package to manipulate the files, reading and writing basic stuff using dictreader and dictwriter. If anyone can help me out coming up with a template for the iterative searching aspect, I would be greatly indebted. So far I have:
import csv
f = open("C:\Documents and Settings\Administrator\Desktop\miRNA Scripting\mirna_predictions_short.txt", "r")
reader = csv.DictReader(f, delimiter='\t')
writer = open("output.txt",'wb')
writer = csv.writer(writer, delimiter='\t')
Then the iterative bit, bleurgh:
for row in reader:
if reader.gene_id == str(CG11710):
writer.writerow
This obviously doesnt work. Any ideas on better ways to structure this??
You're almost there! The code is nearly correct :)
Accessing dicts goes like this:
some_dict['some_key']
Instead of:
some_object.some_attribute
Creating a string isn't done with str(...) but with quotes like CG11710
In your case:
for row in reader:
if row['gene_id'] == 'CG11710':
writer.writerow(row)
Dictionaries in python are addressed like dictionary['key']. So for you it'd be reader['gene_id']. Also strings are declared in quotes "text", not like str(text). str(text) will try to cast whatever is stored in the variable text to a string, which is not what I think you want. Also writer.writerow is a function, and functions take arguments, so you need to do writer.writerow(row).