I have a CSV file with 5 columns and many many rows of data.
I need to delete an entire row based on data in one column which I can do on my own.
My issue is I am unable to print the data back into the CSV format properly.
I'm importing the CSV like so.
data=open(datafile) #this datafile variable has the CSV path
parse=csv.DictReader(data)
newfile= open("validated.csv", "w", newline="")#I'd like to output my changes in this new file and leave the original CSV as is
output = csv.writer(newfile)
Based on what I've read my CSV in interpreted as many different dictionaries.
I've tried many different list, dictionary, and for loop combos but I just can't get it right.
def validate_profits(datafile):#this will remove non numeric profits rows so we can get a count of our useful dataset.
data=open(datafile) #Open then parse data.
parse=csv.DictReader(data)
newfile= open("validated.csv", "w", newline="")#New file for output.
output = csv.writer(newfile)
outputlist=[]
for rows in parse: #Looping through CSV files to check each profit column.
try:
(float(rows["Profit (in millions)"]))#This is the validation for the Profit column
outputlist.append(rows)
except ValueError:
pass
counter=0
while True:
try:
counter+=1
output.writerows([[outputlist[counter]]])#Output the numericaly valid rows to a new file.
except IndexError:
break
count_rows("validated.csv")
validate_profits ("data.csv")
If your sole job is "write only those rows with a valid numeric value in the Profit column", then it's just this:
def validate_profits(datafile):
#this will remove non numeric profits rows so we can get a count of our useful dataset.
data=open(datafile) #Open then parse data.
count = 0
parse=csv.DictReader(data)
newfile= open("validated.csv", "w", newline="")#New file for output.
output = csv.DictWriter(newfile, fieldnames=parse.fieldnames)
for row in parse:
try:
_ = float(rows["Profit (in millions)"])
output.writerow(row)
count += 1
except ValueError:
pass
return count
Many people would use a regular expression to test the contents of that field rather than relying on an exception from float, but this works.
I rewrote your function trying to fix some points and suggesting a somewhat more pythonic way to do what you want:
def validate_profits(datafile):
with open(datafile, 'r', encoding='utf-8')as f: #Open then parse data.
parsed=csv.DictReader(f)
with open("validated.csv", "w", encoding='utf-8'):
output = csv.writer(newfile)
outputlist=[]
for row in parsed:
try:
(float(row["Profit (in millions)"]))
outputlist.append(rows)
except ValueError:
pass
output.writerows(outputlist)
I am new to Python and I attempt to read .csv files sequentially in a for or while loop (I have about 250 .csv). The aim is to read each .csv file and isolate only all the columns, whenever a specific column (let's call it "wanted_column") is non-empty (i.e. its non-empty rows). Then, save all the non-empty rows of the "wanted_column", as well as all its columns, in a new .csv file.
Hence, at the end, I want to have 250 .csv files with all columns for each row that has non-empty elements in the "wanted_column".
Hopefully this is clear. I would appreciate any ideas.
George
I wrote this code below just to give you an idea of how to do it. Beware that this code below does not check for any errors. Its behavior is undefined if one of your CSV files is empty, if it couldn't find the file, and if the column you defined is a none existence column in one of the file. There could be more. Thus, you would want to build a check code around it. Also, your CSV formatting could greatly be depended on python csv package.
So now to the code explanation. For the "paths" variable. You can give that a string, a tuple, or a list. If you give it a string, it will convert that to a tuple with one index. You can give that variable the file(s) that you want to work with.
For the "column" variable, that one should be a string. You need to build an error checking for that if needed.
For code routines, the function will read all the CSV files of the paths list. Each time it read a file, it will read the first line first and save the content to a variable(rowFields).
After that, it generates the dict header(fields) with key(column name) to value(position). That dict is used to search for the column position by using its name. For here, you could also go through each field and if the field matches the column name then you save that value as the column position. Then that position could later be used instead of keep on searching the dict for the position using the name. The later method described in this paragraph should be the fastest.
After that, it goes on and read each row of the CSV file until the end. Each time it read a row, it will check if the length of the string from the column that defined by the "column" variable is larger than zero. If that string length is larger than zero, then it will append that row to the variable contentRows.
After the function done reading the CSV file, it will write the contents of variable "rowFields" and "contentRows" to a CSV file that defined by the "outfile" variable. To make it easy for me, outfile simply equal to input file + ".new". You can just change that.
import csv
def getNoneEmpty( paths, column ):
if isinstance(paths, str):
paths = (paths, )
if not isinstance(paths, list) and not isinstance(paths, tuple):
raise("paths have to be a or string, list, or tuple")
quotechar='"'
delimiter=","
lineterminator="\n"
for f in paths:
outfile = f + ".new" # change this area to how you want to generate the new file
fields = {}
rowFields = None
contentRows = []
with open(f, newline='') as csvfile:
csvReader = csv.reader(csvfile, delimiter=delimiter, quotechar=quotechar, lineterminator=lineterminator)
rowFields = next(csvReader)
for i in range(0, len(rowFields)):
fields[rowFields[i]] = i
for row in csvReader:
if len(row[fields[column]]) != 0:
contentRows.append(row)
with open(outfile, 'w') as csvfile:
csvWriter = csv.writer(csvfile,delimiter=delimiter, quotechar=quotechar,quoting=csv.QUOTE_MINIMAL, lineterminator=lineterminator)
csvWriter.writerow(rowFields)
csvWriter.writerows(contentRows)
getNoneEmpty(["test.csv","test2.csv"], "1958")
test.csv content:
"Month","1958","1959","1960"
"JAN",115,360,417
"FEB",318,342,391
"MAR",362,406,419
"APR",348,396,461
"MAY",11,420,472
"JUN",124,472,535
"JUL",158,548,622
"AUG",505,559,606
"SEP",404,463,508
"OCT",,407,461
"NOV",310,362,390
"DEC",110,405,432
test2.csv content:
"Month","1958","1959","1960"
"JAN",,360,417
"FEB",318,342,391
"MAR",362,406,419
"APR",348,396,461
"MAY",,420,472
"JUN",,472,535
"JUL",,548,622
"AUG",505,559,606
"SEP",404,463,508
"OCT",,407,461
"NOV",310,362,390
"DEC",110,405,432
Hopefully it will work:
def main():
temp = []
with open(r'old_csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
for row in csv_reader:
for x in row:
temp.append(x)
with open(r'new_csv', mode='w') as new_file:
writer = csv.writer(new_file, delimiter=',', lineterminator='\n')
for col in temp:
list_ = col.split(',')
writer.writerow(list_)
I have these huge CSV files that I need to validate; need to make sure they are all delimited by back tick `. I have a reader opening each file and printing it's content. Just wondering the different ways you all would go about validating that each value is delimited by the back tick character
for csvfile in self.fullcsvpathfiles:
#print("file..")
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
for row in csv_reader:
print (row)
Not sure how to go about validating that each value is seperated by a backtick and throw an error if otherwise. These tables are huge (not that thats a problem for electricity ;) )
Method 1
With pandas library you could use pandas.read_csv() function to read the csv file with sep='`' (it specifies the delimiter). If it parses the file to a dataframe in a good shape, then you could almost be sure that's good.
Also, to automate the validation process, you could check if the number of NaN values in the dataframe is within an acceptable level. Assume your csv files do not have many blanks (so only a few NaN values are expected), you could compare the number of NaN values with a threshold you set.
import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
my_df = pd.read_csv(csvfile, sep="`") # if it fails at this step, then something (probably the delimiter) must be wrong
nans = my_df.is_null().sum()
if nans > nan_threshold:
print(csvfile) # make some warning here
Refer to this page for more information about pandas.read_csv().
Method 2
As mentioned in the comments, you could also check if the number of occurrence of the delimiter is equal in each line of the file.
num_of_sep = -1 # initial value
# assume you are at the step of reading a file f
for line in f:
num = line.count("`")
if num_of_sep == -1:
num_of_sep = num
elif num != num_of_sep:
print('Some warning here')
If you don't know how many columns are in a file, you could check to make sure all the rows have the same number of columns - if you expect the header (first) to always be correct use it to determine the number of columns.
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
ncols = len(next(csv_reader))
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as f:
row = next(f)
ncols = row.count('`')
if not all(row.count('`')==ncols for row in f):
#do something
If you know how many columns are in a file...
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
#figure out how many columns it is suppose to have here?
ncols = special_process()
csv_reader = csv.DictReader(csv_file, delimiter = "`")
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
#figure out how many columns it is suppose to have here?
ncols = special_process()
with open(self.fullcsvpathfiles[0], mode='r') as f:
#figure out how many columns it is suppose to have here?
if not all(row.count('`')==ncols for row in f):
#do something
If you know the number of expected elements, you could inspect each line
f=open(filename,'r')
for line in f:
line=line.split("`")
if line!=numElements:
raise Exception("Bad file")
If you know the delimiter that is being accidentally inserted, you could also try to recover instead of throwing exception. Perhaps something like:
line="`".join(line).replace(wrongDelimiter,"`").split("`")
Of course, once you're that far into reading the file, there's no great need for using an external library to read the data. Just go ahead and use it.
I am new to Python. I have a csv file which will generate the file in below format:
Timestamp for usage of CPU
1466707823 1466707828 1466707833
Percent use for CPU# 0
0.590551162 0.588235305 0.59055119
Percent use for CPU# 1
7.874015497 7.843137402 7.67716547
But I need to generate csv file in this format:
Timestamp for usage of CPU Percent use for CPU# 0 Percent use for CPU# 1
1466707823 0.590551162 7.874015497
1466707823 0.588235305 7.843137402
1466707828 0.59055119 7.67717547
I am not getting any idea how to proceed further. Could any one please help me out with this?
It seems like the simplest way to do it would be to first read and convert the data in the input file into a list of lists with each sublist corresponding to a column of data in the output csv file. The sublists will start off with the column's header and then be followed by the values associated with it from the next line.
Once that is done, the built-in zip() function can be used to transpose the data matrix created. This operation effectively turns the columns of data it contains into the rows of data needed for writing out to the csv file:
import csv
def is_numeric_string(s):
""" Determine if argument is a string representing a numeric value. """
for kind in (int, float, complex):
try:
kind(s)
except (TypeError, ValueError):
pass
else:
return True
else:
return False
columns = []
with open('not_a_csv.txt') as f:
for line in (line.strip() for line in f):
fields = line.split()
if fields: # non-blank line?
if is_numeric_string(fields[0]):
columns[-1] += fields # add fields to the new column
else: # start a new column with this line as its header
columns.append([line])
rows = zip(*columns) # transpose
with open('formatted.csv', 'w') as f:
csv.writer(f, delimiter='\t').writerows(rows)
This may seem like an odd thing to do, but I essentially have a csv file that has some values of '0' in quite a number of cells.
How would I, in Python, convert these numbers to read as something like 0.00 instead of just 0? I have a script in ArcMap which needs to read the values as double rather than short integer, and the '0' value really messes that up.
I am new with the CSV module, so I am not sure where to go with this. Any help with making a script convert my values so that when I open the new CSV, it will read a "0.00" rather than '0' would be greatly appreciated.
I would have liked to have some code to give you as an example, but I am at a loss.
Here's a short script that will read a CSV file, convert any numbers to floats and then write it back to the same file again.
import csv
import sys
# These indices won't be converted
dont_touch = [0]
def convert(index, value):
if not index in dont_touch:
try:
return float(value)
except ValueError:
# Not parseable as a number
pass
return value
table = []
with open(sys.argv[1]) as f:
for row in csv.reader(f, delimiter=","):
for i in range(len(row)):
row[i] = convert(i, row[i])
table.append(row)
with open(sys.argv[1], "w") as f:
writer = csv.writer(f, delimiter=",")
writer.writerows(table)
If you have any columns that should not be converted, specify their indices in the dont_touch array.
If you want them to have two trailing zeroes you can play around with format strings instead:
return "{:.02f}".format(float(value))
You can format the 0s and then write them out, you may want to look into the appropriate quoting for your csv (e.g. you may need quoting=csv.QUOTE_NONE in your writer object):
reader = csv.reader(fr)
writer = csv.writer(fw)
for row in reader:
writer.writerow([f if f != '0' else '{:0.2f}'.format(0) for f in row])