I am getting below keyError while running my python script which import data from one csv,modify it and write to another csv.
Code snippet:
import csv
Ty = 'testy'
Tx = 'testx'
ifile = csv.DictReader(open('test.csv'))
cdata = [x for x in ifile]
for row in cdata:
row['Test'] = row.pop(Ty)
Error seen while executing :
row['Test'] = row.pop(Ty)
KeyError: 'testy'
Any idea?
Thanks
Probably your csv don't have a header, where the specification of the key is done, since you didn't define the key names. The DictReader requires the parameter fieldnames so it can map accordingly it as keys (header) to values.
So you should do something like to read your csv file:
ifile = csv.DictReader(open('test.csv'), fieldnames=['testx', 'testy'])
If you don't want to pass the fieldnames parameter, try to understand from where the csv define its header, see the wikipedia article:
The first record may be a "header", which contains column names in
each of the fields (there is no reliable way to tell whether a file
does this or not; however, it is uncommon to use characters other than
letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
You can put your 'testy' and 'testx' in your csv and don't pass the fieldnames to DictReader
According to the error message, there is missing testy on the first line of test.csv
Try such content in test.csv
col_name1,col_name2,testy
a,b,c
c,d,e
Note that there should not be any spaces/tabs around the testy.
Related
I'm new to MapReduce and MRjob, I am trying to read a csv file that I want to process using MRjob in python. But it has about 5 columns with JSON strings(eg. {}) or an array of JSON strings (eg. [{},{}]), some of them are nested.
My mapper so far looks as follows:
from mrjob.job import MRJob
import csv
from io import StringIO
class MRWordCount(MRJob):
def mapper(self, _, line):
l = StringIO(line)
reader = csv.reader(l) # returns a generator.
for cols in reader:
columns = cols
yield None, columns
I get the error -
_csv.Error: field larger than field limit (131072)
But that seems to happen because my code separates the JSON strings into separate columns as well (because of the commas inside).
How do I make this, so that the JSON strings are not split? Maybe I'm overlooking something?
Alternatively, is there any other ways I could read this file with MRjob that would make this process simpler or cleaner?
Your JSON string is not surrounded by quote characters so every comma in that field makes the csv engine think its a new column.
take a look here what you are looking for is quotechar change your data so that you json is surrounded with a special character (The default is ") and adjust your csv reader accordingly
I am trying to take more than 1 file with information such as names, email addresses, and other. To take these files from CSV format and remove everything except the emails. Then output a new file with delimiter of semicolon all on same line.
The final format should look like:
someone#hotmail.com; someoneelse#gmail.com; someone3#university.edu
I must check the emails are in correct format of alphanumeric#alphanumeric.3letters.
I must remove all duplicates.
I must compare this list to another and remove the emails from 1 list that occur in the others.
The final format will be such that someone can copy and paste into Outlook the email recipient addresses.
I have looked at some video. Also here. I found: python csv copy column
But I get an error when trying to write the new file.
I have import csv and re
Here is my code below:
def final_emails(email_list):
with open(email_list) as csv_file:
read_csv = csv.reader(csv_file, delimiter=',')
write_csv = csv.writer(out_emails, delimiter=";")
for row in read_csv:
email = row[2] # only take the emails (from column 3)
if email != '': # remove empties
# remove the header, or anything that doesn't have a '#'
# convert to lowercase and append to list
emails.append(re.findall('\w*#\w*.\w{3}', email.lower()))
write_csv.write([email])
return emails
final_emails(list1)
final_emails(list2)
print(emails)
I have the print to check at bottom.
I added write to make new file, but this error
TypeError: argument 1 must have a "write" method
I'm still trying to learn. I do many things this time I didn't before, like csv and regular expression.
Please any assistance. Thank you.
You need to define out_emails as file handle with writable permissions before you can use it in csv.writer
csv.writer needs an object which has a .write property like file handles to be able to write to it. It seems like out_emails doesn't have a .write property.
I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!
You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.
Normally the csv.DictReader will use the first line of a .csv file as the column headers, i.e. the keys to the dictionary:
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
However, I am faced with something like this for my first line:
#Format: header1 header2 header3 ...etc.
The #Format: needs to be skipped, as it is not a column header. I could do something like:
column_headers = ['header1', 'header2', 'header3']
reader = csv.dictReader(my_file, delimiter='\t', fieldnames=column_headers)
But I would rather have the DictReader handle this for two reason.
There are a lot of columns
The column names may change over time, and this is a quarterly-run process.
Is there some way to have the DictReader still use the first line as the column headers, but skip that first #Format: word? Or really any word that starts with a # would probably suffice.
As DictReader wraps an open file, you could read the first line of the file, parse the headers from there (headers = my_file.readline().split(delimiter)[1:], or something like that), and then pass them to DictReader() as the fieldnames argument. The DictReader constructor does not reset the file, so you don't have to worry about it reading in the header list after you've parsed that.
I'm trying to read a csv file in python, so that I can then find the average of the values in one of the columns using numpy.average.
My script looks like this:
import os
import numpy
import csv
listing = os.listdir('/path/to/directory/of/files/i/need')
os.chdir('/path/to/directory/of/files/i/need')
for file in listing[1:]:
r = csv.reader(open(file, 'rU'))
for row in r:
if len(row)<2:continue
if float(row[2]) <=0.05:
avg = numpy.average(float(row[2]))
print avg
but I keep on getting the error ValueError: invalid literal for float(). The csv reader seems to be reading the numbers as string, and won't allow me to convert it to a float. Any suggestions?
Judging by the comments, your program is running into problems with the headers.
Two solutions of this are to use r.next(), which skips a line, before your for loop, or to use the DictReader class. The advantage of the DictReader class is that you can treat each row as a dictionary instead of a tuple, which may make for more readability in some cases, but you do have to pass the list of headers to it in the constructor.
change:
float(row[2])
to:
float(row[2].strip("'\""))