I am new to Python and I attempt to read .csv files sequentially in a for or while loop (I have about 250 .csv). The aim is to read each .csv file and isolate only all the columns, whenever a specific column (let's call it "wanted_column") is non-empty (i.e. its non-empty rows). Then, save all the non-empty rows of the "wanted_column", as well as all its columns, in a new .csv file.
Hence, at the end, I want to have 250 .csv files with all columns for each row that has non-empty elements in the "wanted_column".
Hopefully this is clear. I would appreciate any ideas.
George
I wrote this code below just to give you an idea of how to do it. Beware that this code below does not check for any errors. Its behavior is undefined if one of your CSV files is empty, if it couldn't find the file, and if the column you defined is a none existence column in one of the file. There could be more. Thus, you would want to build a check code around it. Also, your CSV formatting could greatly be depended on python csv package.
So now to the code explanation. For the "paths" variable. You can give that a string, a tuple, or a list. If you give it a string, it will convert that to a tuple with one index. You can give that variable the file(s) that you want to work with.
For the "column" variable, that one should be a string. You need to build an error checking for that if needed.
For code routines, the function will read all the CSV files of the paths list. Each time it read a file, it will read the first line first and save the content to a variable(rowFields).
After that, it generates the dict header(fields) with key(column name) to value(position). That dict is used to search for the column position by using its name. For here, you could also go through each field and if the field matches the column name then you save that value as the column position. Then that position could later be used instead of keep on searching the dict for the position using the name. The later method described in this paragraph should be the fastest.
After that, it goes on and read each row of the CSV file until the end. Each time it read a row, it will check if the length of the string from the column that defined by the "column" variable is larger than zero. If that string length is larger than zero, then it will append that row to the variable contentRows.
After the function done reading the CSV file, it will write the contents of variable "rowFields" and "contentRows" to a CSV file that defined by the "outfile" variable. To make it easy for me, outfile simply equal to input file + ".new". You can just change that.
import csv
def getNoneEmpty( paths, column ):
if isinstance(paths, str):
paths = (paths, )
if not isinstance(paths, list) and not isinstance(paths, tuple):
raise("paths have to be a or string, list, or tuple")
quotechar='"'
delimiter=","
lineterminator="\n"
for f in paths:
outfile = f + ".new" # change this area to how you want to generate the new file
fields = {}
rowFields = None
contentRows = []
with open(f, newline='') as csvfile:
csvReader = csv.reader(csvfile, delimiter=delimiter, quotechar=quotechar, lineterminator=lineterminator)
rowFields = next(csvReader)
for i in range(0, len(rowFields)):
fields[rowFields[i]] = i
for row in csvReader:
if len(row[fields[column]]) != 0:
contentRows.append(row)
with open(outfile, 'w') as csvfile:
csvWriter = csv.writer(csvfile,delimiter=delimiter, quotechar=quotechar,quoting=csv.QUOTE_MINIMAL, lineterminator=lineterminator)
csvWriter.writerow(rowFields)
csvWriter.writerows(contentRows)
getNoneEmpty(["test.csv","test2.csv"], "1958")
test.csv content:
"Month","1958","1959","1960"
"JAN",115,360,417
"FEB",318,342,391
"MAR",362,406,419
"APR",348,396,461
"MAY",11,420,472
"JUN",124,472,535
"JUL",158,548,622
"AUG",505,559,606
"SEP",404,463,508
"OCT",,407,461
"NOV",310,362,390
"DEC",110,405,432
test2.csv content:
"Month","1958","1959","1960"
"JAN",,360,417
"FEB",318,342,391
"MAR",362,406,419
"APR",348,396,461
"MAY",,420,472
"JUN",,472,535
"JUL",,548,622
"AUG",505,559,606
"SEP",404,463,508
"OCT",,407,461
"NOV",310,362,390
"DEC",110,405,432
Hopefully it will work:
def main():
temp = []
with open(r'old_csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
for row in csv_reader:
for x in row:
temp.append(x)
with open(r'new_csv', mode='w') as new_file:
writer = csv.writer(new_file, delimiter=',', lineterminator='\n')
for col in temp:
list_ = col.split(',')
writer.writerow(list_)
I have these huge CSV files that I need to validate; need to make sure they are all delimited by back tick `. I have a reader opening each file and printing it's content. Just wondering the different ways you all would go about validating that each value is delimited by the back tick character
for csvfile in self.fullcsvpathfiles:
#print("file..")
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
for row in csv_reader:
print (row)
Not sure how to go about validating that each value is seperated by a backtick and throw an error if otherwise. These tables are huge (not that thats a problem for electricity ;) )
Method 1
With pandas library you could use pandas.read_csv() function to read the csv file with sep='`' (it specifies the delimiter). If it parses the file to a dataframe in a good shape, then you could almost be sure that's good.
Also, to automate the validation process, you could check if the number of NaN values in the dataframe is within an acceptable level. Assume your csv files do not have many blanks (so only a few NaN values are expected), you could compare the number of NaN values with a threshold you set.
import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
my_df = pd.read_csv(csvfile, sep="`") # if it fails at this step, then something (probably the delimiter) must be wrong
nans = my_df.is_null().sum()
if nans > nan_threshold:
print(csvfile) # make some warning here
Refer to this page for more information about pandas.read_csv().
Method 2
As mentioned in the comments, you could also check if the number of occurrence of the delimiter is equal in each line of the file.
num_of_sep = -1 # initial value
# assume you are at the step of reading a file f
for line in f:
num = line.count("`")
if num_of_sep == -1:
num_of_sep = num
elif num != num_of_sep:
print('Some warning here')
If you don't know how many columns are in a file, you could check to make sure all the rows have the same number of columns - if you expect the header (first) to always be correct use it to determine the number of columns.
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
ncols = len(next(csv_reader))
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as f:
row = next(f)
ncols = row.count('`')
if not all(row.count('`')==ncols for row in f):
#do something
If you know how many columns are in a file...
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
#figure out how many columns it is suppose to have here?
ncols = special_process()
csv_reader = csv.DictReader(csv_file, delimiter = "`")
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
#figure out how many columns it is suppose to have here?
ncols = special_process()
with open(self.fullcsvpathfiles[0], mode='r') as f:
#figure out how many columns it is suppose to have here?
if not all(row.count('`')==ncols for row in f):
#do something
If you know the number of expected elements, you could inspect each line
f=open(filename,'r')
for line in f:
line=line.split("`")
if line!=numElements:
raise Exception("Bad file")
If you know the delimiter that is being accidentally inserted, you could also try to recover instead of throwing exception. Perhaps something like:
line="`".join(line).replace(wrongDelimiter,"`").split("`")
Of course, once you're that far into reading the file, there's no great need for using an external library to read the data. Just go ahead and use it.
I am looking to take a CSV file and sort the file using python 2.7 to get an individual value based on two columns for a block and lot. My data looks like now in the link below:
Beginning
I want to be able on the lot value to create extra lines using Python to automate this into a new CSV where the values will look like this when drawn out on the new CSV
End Result
So I know that I need read the row and the column and based on the cell value for the lot column if there is a "," then the row will be copied to the next row in the other csv and all the values before the first column will be copied only and then the second, third etc.
After the Commas are separated out, then the ranges will be managed in a similar way in a third CSV. If there is a single value, the whole row will be copied as is.
Thank you for the help in advanced.
This should work.
On Windows open files in binary mode or else you get double new lines.
I assumed rows are separated by ; because cells contains ,
First split by ,, then check for ranges
print line is for debugging
Error checking is left as an exercise for the reader.
Code:
import csv
file_in = csv.reader(open('input.csv', 'rb'), delimiter=';')
file_out = csv.writer(open('output.csv', 'wb'), delimiter=';')
for i, line in enumerate(file_in):
if i == 0:
# write header
file_out.writerow(line)
print line
continue
for j in line[1].split(','):
if len(j.split('-')) > 1:
# lines with -
start = int(j.split('-')[0])
end = int(j.split('-')[1])
for k in xrange(start, end + 1):
line[1] = k
file_out.writerow(line)
print line
else:
# lines with ,
line[1] = j
file_out.writerow(line)
print line
I need to filter and do some math on data coming from CSV files.
I've wrote a simple Pyhton script to isolate the rows I need to get (they should contain certain keywords like "Kite"), but my script does not work and I can't find why. Can you tell me what is wrong with it? Another thing: once I get to the chosen row/s, how can I point to each (comma separated) column?
Thanks in advance.
R.
import csv
with open('sales-2013.csv', 'rb') as csvfile:
sales = csv.reader(csvfile)
for row in sales:
if row == "Kite":
print ",".join(row)
You are reading the file in bytes. Change the open('filepathAndName.csv, 'r') command or convert your strings like "Kite".encode('UTF-8'). The second mistake could be that you are looking for a line with the word "Kite", but if "Kite" is a substring of that line it will not be found. In this case you have to use if "Kite" in row:.
with open('sales-2013.csv', 'rb') as csvfile: # <- change 'rb' to 'r'
sales = csv.reader(csvfile)
for row in sales:
if row == "Kite": # <- this would be better: if "Kite" in row:
print ",".join(row)
Read this:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
To find the rows than contain the word "Kite", then you should use
for row in sales: # here you iterate over every row (a *list* of cells)
if "Kite" in row:
# do stuff
Now that you know how to find the required rows, you can access the desired cells by indexing the rows. For example, if you want to select the second cell of a row, you simply do
cell = row[1] # remember, indexes start with 0
I am an absolute programming novice trying to work with some csv files. Though what I am trying to do overall is more complex, I am currently stuck on this problem:
The csv files I have contain a fixed number of 'columns' and a variable number of rows. What I want to do is open each csv file in a directory, while in memory store the files values to a 2d list, and then pull one 'column' of data from that list. By doing this in a loop, I could append a list with one column of data from each csv file.
When I do this for a single file, it works:
csvFile = 'testdata.csv'
currentFile = csv.reader(open(csvFile), delimiter=';')
errorValues = []
for data in currentFile:
rows = [r for r in currentFile] #Store current csv file into a 2d list
errorColumn = [row[34] for row in rows] #Get position 34 of each row in 2D list
errorColumn = filter(None, errorColumn) #Filter out empty strings
errorValues.append(errorColumn) #Append one 'column' of data to overall list
When I try to loop it for all files in my directory, I get a 'list index out of range' error:
dirListing = os.listdir(os.getcwd())
errorValues = []
for dataFile in dirListing:
currentFile = csv.reader(open(dataFile), delimiter=';')
for data in currentFile:
rows = [r for r in currentFile] #Store current csv file into a 2d list
errorColumn = [row[34] for row in rows] #Get position 34 of each row in 2D list
errorColumn = filter(None, errorColumn) #Filter out empty strings
errorValues.append(errorColumn) #Append one 'column' of data to overall list
errorColumn = [] #Clear out errorColumn for next iteration
The error occurs at 'errorColumn = [row[34] for row in rows]'. I have tried all sorts of ways to do this, always failing to an index out of range error. The fault is not with my csv files as I have used the working script to test them one by one. What could be the problem?
Many thanks for any help.
I'm a bit surprised that the error you mention is at the [r for r in currentFile]. At worst, your rows list would be empty...
Are you 100% sure all your lines have at least 35 columns ? That you don't have an empty line somewhere ? At the very end ? It'd be worth checking whether
errorColumn = [row[34] for row in rows if row]
still gives an error. Provided that you got rid of the for data in currentFile line first (that you don't use and more important consumes your currentFile, leaving you with rows==[])
The for loop goes through the lines of the CSV file. Each line is converted to the row of element by the reader. This way, the data in the loop is already the row. The next construct also iterates through the open file. This is wrong.
There is a problem with your open(). The file must be opened in binary mode (in Python 2).
Try the following (I did not put everything you wanted inside):
dirListing = os.listdir(os.getcwd())
errorValues = []
rows = [] # empty array of rows initially
for fname in dirListing:
f = open(fname, 'rb') # open in binary mode (see the doc)
reader = csv.reader(f, delimiter=';')
errorColumn = [] # initialized for the file
for row in reader:
rows.append(row) #Store current csv file into a 2d list
if len(row) > 34:
errorColumn.append(row[34]) #Get position 34 of each row in 2D list
errorValues.append(errorColumn)
f.close() # you should always close your files
Beware! The os.listdir() returns also the names of subdirectories. Try to add
if os.path.isfile(fname):
...
By the way, you should clearly describe what is your actual goal. There may be a better way to solve it. You may be mentally fixed to the solution that came first to your mind. Take advantage of this media to have more eyes and more headst to suggest the solution.