Dynamically remove a column from a CSV - python

I want to dynamically remove a column from a CSV, this is what I have so far. I have no idea where to go from here though:
# Remove column not needed.
column_numbers_to_remove = 3,2,
file = upload.filepath
#I READ THE FILE
file_read = csv.reader(file)
REMOVE 3 and 2 column from the CSV
UPDATE SAVE CSV

Use enumerate to get the column index, and create a new row without the columns you don't want... eg:
for row in file_read:
new_row = [col for idx, col in enumerate(row) if idx not in (3, 2)]
Then write out your rows using csv.writer somewhere...

Read the csv and write into another file after removing the columns.
import csv
creader = csv.reader(open('csv.csv'))
cwriter = csv.writer(open('csv2.csv', 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline) if col not in (2,3)]
cwriter.writerow(new_line)

Related

Merge rows in a CSV to a column

I am new in python, I have one CSV file, it has more than 1000 rows, I want to merge particular rows and move those rows to another column, can any one help?
This is the source csv file I have:
I want to move emails under members column with comma separator, like this image:
To read csv files in Python, you can use the csv module. This code does the merging you're looking for.
import csv
output = [] # this will store a list of new rows
with open('test.csv') as f:
reader = csv.reader(f)
# read the first line of the input as the headers
header = next(reader)
output.append(header)
# we will build up groups and their emails
emails = []
group = []
for row in reader:
if len(row) > 1 and row[1]: # "UserGroup" is given
if group:
group[-1] = ','.join(emails)
group = row
output.append(group)
emails = []
else: # it isn't, assume this is an email
emails.append(row[0])
group[-1] = ','.join(emails)
# now write a new file
with open('new.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(output)

Python CSV Reader

I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out.
I used to use:
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file.
The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count.
so now I have a CSV like:
sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product
etc..
I tried:
idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)
but I think that's a Pandas thing and it just fails.
Any ideas on how i can get to my header row?
CODE:
lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
# Section gets row number where headers start
with open(lmf, 'r') as fin:
csvreader = csv.reader(fin)
print(csvreader)
input('hold')
idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)
# Reopens file parsing the number for the row headers
lmkcsv = pd.read_csv(lmf, skiprows=idx)
lm = lm.append(lmkcsv)
print(lm)
Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly.
import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)
If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code:
idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)
I'm not sure in what case a None would return, so the set(row) should do. If your headers for whatever are duplicates as well, do this:
from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)
And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. Something like:
f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
if "sadjfhasdkljfhasd" not in i:
f.write(i)
f.truncate()
f.close()
after that, read normaly the file.

accessing the values of collections.defaultdict

I have a csv file that I want to read column wise, for that I've this code :
from collections import defaultdict
from csv import DictReader
columnwise_table = defaultdict(list)
with open("Weird_stuff.csv",'rU') as f:
reader = DictReader(f)
for row in reader:
for col,dat in row.items():
columnwise_table[col].append(dat)
#print(columnwise_table.items()) # this gives me everything
print(type(columnwise_table[2]) # I'm look for smt like this
my question is how can get all the element of only one specific column ? and I'm not using conda and the matrix is big 2400x980
UPDATE
I have 980 columns and over 2000 rows I need to work with the file using the columns say 1st column[0]: feature1 2nd column[0]: j_ss01 50th column:Abs2 and so on
since I can't access the dict using the column names I would like to use an index for that. is this possible ?
import csv
import collections
col_values = collections.defaultdict(list)
with open('Wierd_stuff.csv', 'rU') as f:
reader = csv.reader(f)
# skip field names
next(reader)
for row in csv.reader(f):
for col, value in enumerate(row):
col_values[col].append(value)
# for each numbered column you want...
col_index = 33 # for example
print(col_values[col_index])
If you know the columns you want in advance, only storing those columns could save you some space...
cols = set(1, 5, 6, 234)
...
for col, value in enumerate(row):
if col in cols:
col_values[col].append(value)
By iterating on row.items, you get all columns.
If you want only one specific column via index number, use csv.reader and column index instead.
from csv import reader
col_values = []
# Column index number to get values from
col = 1
with open("Weird_stuff.csv",'rU') as f:
reader = reader(f)
for row in reader:
col_val = row[col]
col_values.append(col_val)
# contains only values from column index <col>
print(col_values)

Replace value of specific column in all non header rows

Below is some python code that runs on a file similar to this (old_file.csv).
A,B,C,D
1,2,XX,3
11,22,XX,33
111,222,XX,333
How can I iterate through all lines in the old_file.csv (if I don't know the length of the file) and replace all values in column C or index 2 or cells[row][2] (based on cells[row][col]). But I'd like to ignore the header row. In the new_file.csv, all values containing 'XX' will become 'YY' for example.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
cells[1][2] = 'YY'
cells[2][2] = 'YY'
cells[3][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
Just small change in #Soviut ans, try this I think this will help you
import csv
rows = csv.reader(open('old_file.csv'))
newRows=[]
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
newRows.append(row)
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(newRows)
You can very easily loop over the array of rows and replace values in the target cell.
# get rows from old CSV file
rows = csv.reader(open('old_file.csv'))
# iterate over each row and replace target cell
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(rows)
csv reader makes arrays, so you could just run it on r[1:]
len(cells) is the number of rows. Iterating from 1 makes it skip the header line. Also the lines should be cells.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
for i in range(1, len(cells)):
cells[i][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
read_handle = open('old_file.csv', 'r')
data = read_handle.read().split('\n')
read_handle.close()
new_data = []
new_data.append(data[0])
for line in data[1:]:
if not line:
new_data.append(line)
continue
line = line.split(',')
line[2] = 'YY'
new_data.append(','.join(line))
write_handle = open('new_file.csv', 'w')
write_handle.writelines('\n'.join(new_data))
write_handle.close()

Python read CSV file, and write to another skipping columns

I have CSV input file with 18 columns
I need to create new CSV file with all columns from input except column 4 and 5
My function now looks like
def modify_csv_report(input_csv, output_csv):
begin = 0
end = 3
with open(input_csv, "r") as file_in:
with open(output_csv, "w") as file_out:
writer = csv.writer(file_out)
for row in csv.reader(file_in):
writer.writerow(row[begin:end])
return output_csv
So it reads and writes columns number 0 - 3, but i don't know how skip column 4,5 and continue from there
You can add the other part of the row using slicing, like you did with the first part:
writer.writerow(row[:4] + row[6:])
Note that to include column 3, the stop index of the first slice should be 4. Specifying start index 0 is also usually not necessary.
A more general approach would employ a list comprehension and enumerate:
exclude = (4, 5)
writer.writerow([r for i, r in enumerate(row) if i not in exclude])
If your CSV has meaningful headers an alternative solution to slicing your rows by indices, is to use the DictReader and DictWriter classes.
#!/usr/bin/env python
from csv import DictReader, DictWriter
data = '''A,B,C
1,2,3
4,5,6
6,7,8'''
reader = DictReader(data.split('\n'))
# You'll need your fieldnames first in a list to ensure order
fieldnames = ['A', 'C']
# We'll also use a set for efficient lookup
fieldnames_set = set(fieldnames)
with open('outfile.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames)
writer.writeheader()
for row in reader:
# Use a dictionary comprehension to iterate over the key, value pairs
# discarding those pairs whose key is not in the set
filtered_row = dict(
(k, v) for k, v in row.iteritems() if k in fieldnames_set
)
writer.writerow(filtered_row)
This is what you want:
import csv
def remove_csv_columns(input_csv, output_csv, exclude_column_indices):
with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
reader = csv.reader(file_in)
writer = csv.writer(file_out)
writer.writerows(
[col for idx, col in enumerate(row)
if idx not in exclude_column_indices]
for row in reader)
remove_csv_columns('in.csv', 'out.csv', (3, 4))

Categories