I am trying to determine the type of data contained in each column of a .csv file so that I can make CREATE TABLE statements for MySQL. The program makes a list of all the column headers and then grabs the first row of data and determines each data type and appends it to the column header for proper syntax. For example:
ID Number Decimal Word
0 17 4.8 Joe
That would produce something like CREATE TABLE table_name (ID int, Number int, Decimal float, Word varchar());.
The problem is that in some of the .csv files the first row contains a NULL value that is read as an empty string and messes up this process. My goal is to then search each row until one is found that contains no NULL values and use that one when forming the statement. This is what I have done so far, except it sometimes still returns rows that contains empty strings:
def notNull(p): # where p is a .csv file that has been read in another function
tempCol = next(p)
tempRow = next(p)
col = tempCol[:-1]
row = tempRow[:-1]
if any('' in row for row in p):
tempRow = next(p)
row = tempRow[:-1]
else:
rowNN = row
return rowNN
Note: The .csv file reading is done in a different function, whereas this function simply uses the already read .csv file as input p. Also each row is ended with a , that is treated as an extra empty string so I slice the last value off of each row before checking it for empty strings.
Question: What is wrong with the function that I created that causes it to not always return a row without empty strings? I feel that it is because the loop is not repeating itself as necessary but I am not quite sure how to fix this issue.
I cannot really decipher your code. This is what I would do to only get rows without the empty string.
import csv
def g(name):
with open('file.csv', 'r') as f:
r = csv.reader(f)
# Skip headers
row = next(r)
for row in r:
if '' not in row:
yield row
for row in g('file.csv'):
print('row without empty values: {}'.format(row))
Related
I'm trying to create an array in Python, so I can access the last cell in it without defining how many cells there are in it.
Example:
from csv import reader
a = []
i = -1
with open("ccc.csv","r") as f:
csv_reader = reader(f)
for row in csv_reader:
a[i] = row
i = i-1
Here I'm trying to take the first row in the CSV file and put it in the last cell on the array, in order to put it in reverse order on another file.
In this case, I don't know how many rows are in the CSV file, so I can not set the cells in the array as the number of the rows in the file
I tried to use f.append(row), but it inserts the values to the first cell of the array, and I want it to insert the values to the last cell of the array.
Read all the rows in the normal order, and then reverse the list:
from csv import reader
with open('ccc.csv') as f:
a = list(reader(f))
a.reverse()
First up, your current code is going to raise an index error on account of there being no elements, so a[-1] points to nothing at all.
The function you're looking for is list.insert which it inherits from the generic sequence types. list.insert takes two arguments, the index to insert a value in and the value to be inserted.
To rewrite your current code for this, you'd end up with something like
import dbf
from csv import reader
a = []
with open("ccc.csv", "r") as f:
csv_reader = reader(f)
for row in csv_reader:
a.insert(0, row)
This would reverse the contents of the csv file, which you can then write to a new file or use as you need
I'm new to Python but I need help creating a script that will take in three different csv files, combine them together, remove duplicates from the first column as well as remove any rows that are blank, then change a revenue area to a number.
The three CSV files are setup the same.
The first column is a phone number and the second column is a revenue area (city).
The first column will need all duplicates & blank values removed.
The second column will have values like "Macon", "Marceline", "Brookfield", which will need to be changed to a specific value like:
Macon = 1
Marceline = 8
Brookfield = 4
And then if it doesn't match one of those values put a default value of 9.
Welcome to Stack Overflow!
Firstly, you'll want to be using the csv library for the "reader" and "writer" functions, so import the csv module.
Then, you'll want to open the new file to be written to, and use the csv.writer function on it.
After that, you'll want to define a set (I name it seen). This will be used to prevent duplicates from being written.
Write your headers (if you need them) to the new file using the writer.
Open your first old file, using csv module's "reader". Iterate through the rows using a for loop, and add the rows to the "seen" set. If a row has been seen, simply "continue" instead of writing to the file. Repeat this for the next two files.
To assign the values to the cities, you'll want to define a dictionary that holds the old names as the keys, and new values for the names as the values.
So, your code should look something like this:
import csv
myDict = {'Macon' : 1, 'Marceline' : 8, 'Brookfield' : 4}
seen = set()
newFile = open('newFile.csv', 'wb', newline='') #newline argument will prevent the writer from writing extra newlines, preventing empty rows.
writer = csv.writer(newFile)
writer.writerow(['Phone Number', 'City']) #This will write a header row for you.
#Open the first file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('firstFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the second file, read each row, skip if row is empty, skip duplicate rows, change value of "City", write to new file.
with open('secondFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the third file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('thirdFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Close the output file
newFile.close()
I have not tested this myself, but it is very similar to two different programs that I wrote, and I have attempted to combine them into one. Let me know if this helps, or if there is something wrong with it!
-JCoder96
I am looking to take a CSV file and sort the file using python 2.7 to get an individual value based on two columns for a block and lot. My data looks like now in the link below:
Beginning
I want to be able on the lot value to create extra lines using Python to automate this into a new CSV where the values will look like this when drawn out on the new CSV
End Result
So I know that I need read the row and the column and based on the cell value for the lot column if there is a "," then the row will be copied to the next row in the other csv and all the values before the first column will be copied only and then the second, third etc.
After the Commas are separated out, then the ranges will be managed in a similar way in a third CSV. If there is a single value, the whole row will be copied as is.
Thank you for the help in advanced.
This should work.
On Windows open files in binary mode or else you get double new lines.
I assumed rows are separated by ; because cells contains ,
First split by ,, then check for ranges
print line is for debugging
Error checking is left as an exercise for the reader.
Code:
import csv
file_in = csv.reader(open('input.csv', 'rb'), delimiter=';')
file_out = csv.writer(open('output.csv', 'wb'), delimiter=';')
for i, line in enumerate(file_in):
if i == 0:
# write header
file_out.writerow(line)
print line
continue
for j in line[1].split(','):
if len(j.split('-')) > 1:
# lines with -
start = int(j.split('-')[0])
end = int(j.split('-')[1])
for k in xrange(start, end + 1):
line[1] = k
file_out.writerow(line)
print line
else:
# lines with ,
line[1] = j
file_out.writerow(line)
print line
I am trying to read a csv file, and parse the data and return on row (start_date) only if the date is before September 6, 2010. Then print the corresponding values from row (words) in ascending order. I can accomplish the first half using the following:
import csv
with open('sample_data.csv', 'rb') as f:
read = csv.reader(f, delimiter =',')
for row in read:
if row[13] <= '1283774400':
print(row[13]+"\t \t"+row[16])
It returns the correct start_date range, and corresponding word column values, but they are not returning in ascending order which would display a message if done correctly.
I have tried to use the sort() and sorted() functions, after creating an empty list to populate then appending it to the rows, but I am just not sure where or how to incorporate that into the existing code, and have been terribly unsuccessful. Any help would be greatly appreciated.
Just read the list, filter the list according to the < date criteria and sort it according to the 13th row as integer
Note that the common mistake would be to filter as ASCII (which may appear to work), but integer conversion is relaly required to avoid sort problems.
import csv
with open('sample_data.csv', 'r') as f:
read = csv.reader(f, delimiter =',')
# csv has a title, we have to skip it (comment if no title)
title_row = next(read)
# read csv and filter out to keep only earlier rows
lines = filter(lambda row : int(row[13]) < 1283774400,read)
# sort the filtered list according to the 13th row, as numerical
slist = sorted(lines,key=lambda row : int(row[13]))
# print the result, including title line
for row in title_row+slist:
#print(row[13]+"\t \t"+row[16])
print(row)
I am using Python's csv module to read ".csv" files and parse them out to MySQL insert statements. In order to maintain syntax for the statements I need to determine the type of the values listed under each column header. However, I have run into a problem as some of the rows start with a null value.
How can I use the csv module to return the next value under the same column until the value returned is not null? This does not have to be accomplished with the csv module; I am open to all solutions. After looking through the documentation I am not sure that the csv module is capable of doing what I need. I was thinking something along these lines:
if rowValue == '':
rowValue = nextRowValue(row)
Obviously the next() method simply returns the next value in the csv "list" rather than returning the next value under the same column like I want, and the nextRowValue() object does not exist. I am just demonstrating the idea.
Edit: Just to add some context, here is an example of what I am doing and the problems I am running into.
If the table is as follows:
ID Date Time Voltage Current Watts
0 7/2 11:15 0 0
0 7/2 11:15 0 0
0 7/2 11:15 380 1 380
And here is a very slimmed down version of the code that I am using to read the table, get the column headers and determine the type of the values from the first row. Then put them into separate lists and then use deque to add them to insert statements in a separate function. Not all of the code is featured and I might have left some crucial parts out, but here is an example:
import csv, os
from collections import deque
def findType(rowValue):
if rowValue == '':
rowValue =
if '.' in rowValue:
try:
rowValue = type(float(rowValue))
except ValueError:
pass
else:
try:
rowValue = type(int(rowValue))
except:
rowValue = type(str(rowValue))
return rowValue
def createTable():
inputPath = 'C:/Users/user/Desktop/test_input/'
outputPath = 'C:/Users/user/Desktop/test_output/'
for file in os.listdir(inputPath):
if file.endswith('.csv'):
with open(inputPath + file) as inFile:
with open(outputPath + file[:-4] + '.sql', 'w') as outFile:
csvFile = csv.reader(inFile)
columnHeader = next(csvFile)
firstRow = next(csvFile)
cList = deque(columnHeader)
rList = deque(firstRow)
hList = []
for value in firstRow:
valueType = findType(firstRow)
if valueType == str:
try:
val = '`' + cList.popleft() + 'varchar(255)'
hList.append(val)
except IndexError:
pass
etc.
And so forth for the rest of the value types returned from the findType function. The problem is that when adding the values to rList using deque it skips over null values so that the number of items in the list for column headers would be 6, for example, and the number of items in the list for rows would be 5 so they would not line up.
A somewhat drawn out solution would be to scan each row for null values until one was found using something like this:
for value in firstRow:
if value == '':
firstRow = next(csvFile)
And continuing this loop until a row was found with no null values. However this seems like a somewhat drawn out solution that would slow down the program, hence why I am looking for a different solution.
Rather than pull the next value from the column as the title suggests, I found it easier to just skip rows that contained any null values. There are two different ways to do this:
Use a loop to scan each row and see if it contains a null value, and jump to the next row until one is found that contains no null values. For example:
tempRow = next(csvFile)
for value in tempRow:
if value == '':
tempRow = next(csvFile)
else:
row = tempRow