Skip Rows in CSV Containing Specific String - python

I have a list of strings (longer than in this example). If one of the strings exists in a row of data, I want to skip that row. This is what I have so far but I get an index error, which leads me to believe I'm not looping correctly.
stringList = ["ABC", "AAB", "AAA"]
with open('filename.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
next(filereader, None) #Skip header row
for row in filereader:
for k in stringList:
if k not in row:
data1 = column[1]
The error I get: IndexError: list index out of range. I realize I'm reading by row, but I need to extract the data by column.

The error is because row is a list and you are using/accessing it as a normal variable.
You can access certain columns by using appropriate indexing of the list row. Eg: in the first iteration row[0] will be the element in the first-row first-column, row[1] the second column entry and so on. On subsequent iterations of row, you can access entries of subsequent column downwards.
Here's a simple loop to do it.
for row in filereader:
for k in stringList:
for i in range(len(row)):
if k not in row[i]:
someVar=row[i]

With pandas you can do it easily, with a mask. See more: link
import pandas as pd
data = pd.read_csv('filename.csv')
data = data.loc[data['column_name'] not in stringList]

Related

Unique elements of all the columns of CSV file in Python without using Pandas

I am trying to get the unique values of all the columns in the CSV. I am getting the column number and creating sets for all the columns and trying to go through the csv data and find the unique columns. But the second loop executes only once.
decoded_file = data_file.read().decode('utf-8')
reader = csv.reader(decoded_file.splitlines(),
delimiter=',')
list_reader = list(reader)
data = iter(list_reader)
next(data) #skipping the header
col_number = len(next(data))
col_sets = [set() for i in range(col_number)]
for col in range(col_number):
for new_row in data:
col_sets[col].add(new_row[col])
print(col_sets[col])
I need to get all the unique values for each column and add it to col_sets to access it. What is the best way to do this?
Everything is good, but you should just change the order of iterations.
for new_row in data:
for col in range(col_number):
col_sets[col].add(new_row[col])
print(col_sets)

accessing the values of collections.defaultdict

I have a csv file that I want to read column wise, for that I've this code :
from collections import defaultdict
from csv import DictReader
columnwise_table = defaultdict(list)
with open("Weird_stuff.csv",'rU') as f:
reader = DictReader(f)
for row in reader:
for col,dat in row.items():
columnwise_table[col].append(dat)
#print(columnwise_table.items()) # this gives me everything
print(type(columnwise_table[2]) # I'm look for smt like this
my question is how can get all the element of only one specific column ? and I'm not using conda and the matrix is big 2400x980
UPDATE
I have 980 columns and over 2000 rows I need to work with the file using the columns say 1st column[0]: feature1 2nd column[0]: j_ss01 50th column:Abs2 and so on
since I can't access the dict using the column names I would like to use an index for that. is this possible ?
import csv
import collections
col_values = collections.defaultdict(list)
with open('Wierd_stuff.csv', 'rU') as f:
reader = csv.reader(f)
# skip field names
next(reader)
for row in csv.reader(f):
for col, value in enumerate(row):
col_values[col].append(value)
# for each numbered column you want...
col_index = 33 # for example
print(col_values[col_index])
If you know the columns you want in advance, only storing those columns could save you some space...
cols = set(1, 5, 6, 234)
...
for col, value in enumerate(row):
if col in cols:
col_values[col].append(value)
By iterating on row.items, you get all columns.
If you want only one specific column via index number, use csv.reader and column index instead.
from csv import reader
col_values = []
# Column index number to get values from
col = 1
with open("Weird_stuff.csv",'rU') as f:
reader = reader(f)
for row in reader:
col_val = row[col]
col_values.append(col_val)
# contains only values from column index <col>
print(col_values)

How can I merge CSV rows that have the same value in the first cell?

This is the file: https://drive.google.com/file/d/0B5v-nJeoVouHc25wTGdqaDV1WW8/view?usp=sharing
As you can see, there are duplicates in the first column, but if I were to combine the duplicate rows, no data would get overridden in the other columns. Is there any way I can combine the rows with duplicate values in the first column?
For example, turn "1,A,A,," and "1,,,T,T" into "1,A,A,T,T".
Plain Python:
import csv
reader = csv.Reader(open('combined.csv'))
result = {}
for row in reader:
idx = row[0]
values = row[1:]
if idx in result:
result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
else:
result[idx] = values
How this magic works:
iterate over rows in the CSV file
for every record, we check if there was a record with the same index before
if this is the first time we see this index, just copy the row values
if this is a duplicate, assign row values to empty cells.
The last step is done via or trick: None or value will return value. value or anything will return value. So, result[idx][i] or v will return existing value if it is not empty, or row value.
To output this without loosing the duplicated rows, we need to keep index, then iterate and output corresponding result entries:
indices = []
for row in reader:
# ...
indices.append(idx)
writer = csv.writer(open('outfile.csv', 'w'))
for idx in indices:
writer.writerow([idx] + result[idx])

Python3.4 - enumeration through columns then rows in csv to obtain Max lengths

I would like to find the Max length for each column in a tab delimited csv file.
I can find the max value of one column by using this:
import csv
oldlen=0
with open(mfile) as csvfile:
test = csv.reader(csvfile,dialect='excel-tab')
for row in test:
if len(row[0]) > oldlen:
newlen = len(row[0])
print (newlen)
If I wish to do all columns (and count them), I could just change row[] manually, but I wish to learn so I tried this:
with open(mfile) as csvfile:
test = csv.reader(csvfile,dialect='excel-tab')
ncol=len(test[0])
for column in test:
for row in test:
if len(row[column]) > oldlen:
newlen = len(row[0])
print (column,newlen)
Which, of course, doesnt make programatic sense. But it indicates, I hope, what my intention is. I have to do the columns first so I can get the max length out of each column, across all rows.
You can use a dict to store a column number->max length lookup and assign to that by looping over each column of each row.
lengths = {}
with open(mfile) as csvfile:
test = csv.reader(csvfile, dialect='excel-tab')
for row in test:
for colno, col in enumerate(row):
lengths[colno] = max(len(col), lengths.get(colno, 0))
The number of columns will be len(lengths), and the maximum length of each will be accessible as lengths[0] for the first column lengths[1] for the second etc...
You can transpose the rows into columns with the zip() function:
with open(mfile) as csvfile:
test = csv.reader(csvfile, dialect='excel-tab')
columns = list(zip(*test))
and then get the maximum value per column:
for col in columns:
print(max(col))

How to find specific row in Python CSV module

I need to find the third row from column 4 to the end of the a CSV file. How would I do that? I know I can find the values from the 4th column on with
row[3]
but how do I get specifically the third row?
You could convert the csv reader object into a list of lists... The rows are stored in a list, which contains lists of the columns.
So:
csvr = csv.reader(file)
csvr = list(csvr)
csvr[2] # The 3rd row
csvr[2][3] # The 4th column on the 3rd row.
csvr[-4][-3]# The 3rd column from the right on the 4th row from the end
You could keep a counter for counting the number of rows:
counter = 1
for row in reader:
if counter == 3:
print('Interested in third row')
counter += 1
You could use itertools.islice to extract the row of data you wanted, then index into it.
Note that the rows and columns are numbered from zero, not one.
import csv
from itertools import islice
def get_row_col(csv_filename, row, col):
with open(csv_filename, 'rb') as f:
return next(islice(csv.reader(f), row, row+1))[col]
This one is a very basic code that will do the job and you can easily make a function out of it.
import csv
target_row = 3
target_col = 4
with open('yourfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
n = 0
for row in reader:
if row == target_row:
data = row.split()[target_col]
break
print data

Categories