I have a csv file that has 2 columns. I am simply trying to figure if each row[0] value is in some row[1] and if so, to print row.
Items in csv file:
COL1, COL2
1-A, 1-A
1-B, 2-A
2-A, 1-B
2565, 2565
51Bc, 51Bc
5161, 56
811, 65
681, 11
55, 3
3, 55
Code:
import csv
doc= csv.reader(open('file.csv','rb'))
for row in doc:
if row[0] in row[1]:
print row[0]
The end result should be:
1-A
1-B
2-A
2565
51Bc
55
3
Instead, it is giving me:
1-A
2565
51Bc
It prints those numbers because they are right next to each other side by side but what I need it to do is get the first item in COL1 and see if it finds it in the entire COL2 list and print if it does. Not see if its beside each other and print it.
When you say for row in doc, it's only getting one pair of elements and putting them in row. So there's no possible way row[1] can hold that entire column, at any point in time. You need to do an initial loop to get that column as a list, then loop through the csv file again to do the comparison. Actually, you could store both columns in separate lists, and only have to open the file once.
import csv
doc= csv.reader(open('file.csv','rb'))
# Build the lists.
first_col = []
second_col = set()
for row in doc:
first_col.append(row[0])
second_col.add(row[1])
# Now actually do the comparison.
for item in first_col:
if item in second_col:
print item
As per abarnert's suggestion, we're using a set() for the second column. sets are optimized for looking up values inside it, which is all we're doing with it. A list is optimized for looping through every element, which is what we do with first_col, so that makes more sense there.
Related
I got two .csv files. One that has info1 and one that has info2. Files look like this
File1:
20170101,,,d,4,f,SWE
20170102,a,,,d,f,r,RUS <-
File2:
20170102,a,s,w,,,,RUS <-
20170103,d,r,,,,FIN
I want to combine these two lines (marked as "<-") and make a combined line like this:
20170102,a,s,w,d,f,r,RUS
I know that I could do script similar to this:
for row1 in csv_file1:
for row2 in csv_file2:
if (row1[0] == row2[0] and row1[1] == row2[1]):
do something
Is there any other way to find out which rows have the same items in the beginning or is this the only way? This is pretty slow way to find out the similarities and it takes several minutes to run on 100 000 row files.
Your implementation is O(n^2), comparing all lines in one file with all lines in another. Even worse if you re-read the second file for each line in the first file.
You could significantly speed this up by building an index from the content of the first file. The index could be as simple as a dictionary, with the first column of the file as key, and the line as value.
You can build that index in one pass on the first file.
And then make one pass on the second file,
checking for each line if the id is in the index.
If yes, then print the merged line.
index = {row[0]: row for row in csv_file1}
for row in csv_file2:
if row[0] in index:
# do something
Special thanks to #martineau for the dict comprehension version of building the index.
If there can be multiple items with the same id in the first file,
then the index could point to a list of those rows:
index = {}
for row in csv_file1:
key = row[0]
if key not in index:
index[key] = []
index[key].append(row)
This could be simplified a bit using defaultdict:
from collections import defaultdict
index = defaultdict(list)
for row in csv_file1:
index[rows[0]].append(row)
Hi I am having some problems looking at specific rows and columns in an csv file.
My current goal is look 3 different columns out the several that are there. And on top of that, I want to look at the data values (ex. 0.26) and sort through the ones that are betweeen 0.21 and 0.31 in a specific column. My issue is that I dont know how to do both of those at the same time. I keep getting errors that tell me i cant use '<=' with float and str.
Heres my code:
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\\Users\\AdamStoer\\Documents\\practicedata.csv') as f:
reader = csv.DictReader(f,delimiter=',') # read rows into a dictionary format
for row in reader:
for columns['pitch'] in row:
for v in columns['pitch']:
p=float(v)
if p <= 0.5:
columns['pitch'].append(v)
print(columns['pitch'])
This code was working before for the last part
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['pitch'])
Looks like you're confusing a couple things. If you know the specific column you want (pitch) you do not have to loop over all the columns in each row. You can access it directly like so:
for row in reader:
p = float(row['pitch'])
if p <= 0.5:
print p
It's hard for me to tell what output you want, but here's an example that looks at just the pitch in each row and if it is a match appends all the target values for that row to the columns dictionary.
targets = ('pitch', 'roll', 'yaw')
columns = defaultdict(list)
for row in reader:
p = float(row['pitch'])
if p >= 0.21 and p <= 0.31:
for target in targets:
column[target].append(row[target])
Im trying to print only the 17th element of all the rows. my code is something like this
csvFile = open('Workbook1.csv')
fileread = csv.reader(csvFile)
dataRead = list(fileread)
x=1
for items in dataRead[x][17]:
x += dataRead.__len__()
print items
with open('Workbook1.csv', 'rt') as finput:
for row in csv.reader(finput):
print(row[16])
Provided your CSV is always properly formatted (ie. the number of columns is consistent) you can simply do this :
csvFile = open('Workbook1.csv')
fileread = csv.reader(csvFile)
for rows in fileread:
print rows[16] # assuming you DID mean 17th item, then the array starts from 0
I did something like this
i=0
while i <len(dataRead):
i += 1
print dataRead[i][0],dataRead[i][17]
Now it gives me first element of row 1 along with the 16th element of row 1 and loops in through all the rows. Is this a safe priactise to code? Sorry im new to python..
This:
for items in dataRead[x][17]:
x += dataRead.__len__()
print items
loops through the items in the 18th column of only the second row. You want to loop through every row, and access the 17th column in each:
for items in dataRead:
print items[16]
Remember that Python is zero-indexed, so you need 16 instead of 17. Also, when you need the length of an item, use len(item) rather than item.__len__().
I am trying to determine the type of data contained in each column of a .csv file so that I can make CREATE TABLE statements for MySQL. The program makes a list of all the column headers and then grabs the first row of data and determines each data type and appends it to the column header for proper syntax. For example:
ID Number Decimal Word
0 17 4.8 Joe
That would produce something like CREATE TABLE table_name (ID int, Number int, Decimal float, Word varchar());.
The problem is that in some of the .csv files the first row contains a NULL value that is read as an empty string and messes up this process. My goal is to then search each row until one is found that contains no NULL values and use that one when forming the statement. This is what I have done so far, except it sometimes still returns rows that contains empty strings:
def notNull(p): # where p is a .csv file that has been read in another function
tempCol = next(p)
tempRow = next(p)
col = tempCol[:-1]
row = tempRow[:-1]
if any('' in row for row in p):
tempRow = next(p)
row = tempRow[:-1]
else:
rowNN = row
return rowNN
Note: The .csv file reading is done in a different function, whereas this function simply uses the already read .csv file as input p. Also each row is ended with a , that is treated as an extra empty string so I slice the last value off of each row before checking it for empty strings.
Question: What is wrong with the function that I created that causes it to not always return a row without empty strings? I feel that it is because the loop is not repeating itself as necessary but I am not quite sure how to fix this issue.
I cannot really decipher your code. This is what I would do to only get rows without the empty string.
import csv
def g(name):
with open('file.csv', 'r') as f:
r = csv.reader(f)
# Skip headers
row = next(r)
for row in r:
if '' not in row:
yield row
for row in g('file.csv'):
print('row without empty values: {}'.format(row))
I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)
Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)