I have 2 CSV files which have a list of unique words. After I complete my intersection on them I get the results, but when I try to write it to a new file it creates a very large sized file of almost 155MB, when it should be well below 2MB.
Code:
alist, blist = [], []
with open("SetA-unique.csv", "r") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist += row
with open("SetB-unique.csv", "r") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist += row
first_set = set(alist)
second_set = set(blist)
res = (first_set.intersection(second_set))
writer = csv.writer(open("SetA-SetB.csv", 'w'))
for row in res:
writer.writerow(res)
You're writing the entire set res to the file on each iteration. You probably want to write the rows instead:
for row in res:
writer.writerow([row])
Apart from writing the whole set each iteration you also don't need to create multiple sets and lists, you can use itertools.chain:
from itertools import chain
with open("SetA-unique.csv") as file_a, open("SetB-unique.csv") as file_b,open("SetA-SetB.csv", 'w') as inter :
r1 = csv.reader(file_a)
r2 = csv.reader(file_b)
for word in set(chain.from_iterable(r1)).intersection(chain.from_iterable(r2)):
inter.write(word)+"\n"
If you are just writing words there is also no need to use csv.writer just use file.write as above.
If you are actually trying do the comparison row wise, you should not be creating a flat iterable of words, you can imap to tuples:
from itertools import imap
with open("SetA-unique.csv") as file_a, open("SetB-unique.csv") as file_b,open("SetA-SetB.csv", 'w') as inter :
r1 = csv.reader(file_a)
r2 = csv.reader(file_b)
writer = csv.writer(inter)
for row in set(imap(tuple, r1).intersection(imap(tuple, r2)):
writer.writerow(row)
And if you only have one word per line you don't need the csv lib at all.
from itertools import imap
with open("SetA-unique.csv") as file_a, open("SetB-unique.csv") as file_b,open("SetA-SetB.csv", 'w') as inter :
for word in set(imap(str.strip, file_a)).intersection(imap(str.strip, file_b)):
inter.write(word) + "\n"
Related
Every time when i am reading CSv file as list by using this long method, can we simplify this?
Creating empty List
Reading file row-wise and appending to the list
filename = 'mtms_excelExtraction_m_Model_Definition.csv'
Ana_Type = []
Ana_Length = []
Ana_Text = []
Ana_Space = []
with open(filename, 'rt') as f:
reader = csv.reader(f)
try:
for row in reader:
Ana_Type.append(row[0])
Ana_Length.append(row[1])
Ana_Text.append(row[2])
Ana_Space.append(row[3])
except csv.Error as e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
This is a good opportunity for you to start using pandas and working with DataFrames.
import pandas as pd
df = pd.read_csv(path_to_csv)
1-2 (depending on if you count the import) lines of code and you're done!
This one is essentially the numpy way of processing the csv file, without using numpy.
Whether it is better than your original method is close to a matter of taste. It has in common with the numpy or Pandas method the fact of loading the whole file in memory and than transposing it into lists:
with open(filename, 'rt') as f:
reader = csv.reader(f)
tmp = list(reader)
Ana_Type, Ana_Length, Ana_Text, Ana_Space = [[tmp[i][j] for i in range(len(tmp))]
for j in range(len(tmp[0]))]
It uses less code, and build arrays with comprehensions instead of repeated appends, but more memory (as would numpy or pandas).
Depending on how you later process the data, numpy or Pandas could be a nice option. Because IMHO using them only to load a csv file into list is not worth it.
You can use a DictReader
import csv
with open(filename, 'rt') as f:
data = list(csv.DictReader(f, fieldnames=["Type", "Length", "Text", "Space"]))
print(data)
This will give you a single list of dict objects, one per row.
Try this
import csv
from collections import defaultdict
d = defaultdict(list)
with open(filename, mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
for k,v in row.items():
d[k].append(v)
next
d.keys()
dict_keys(['Ana_Type', 'Ana_Length', 'Ana_Text', 'Ana_Space'])
next
d.get('Ana_Type')
['bla','bla1','df','ccc']
The repetitive calls to list.append can be avoided by reading the csv and using the zip builtin function to transpose the rows.
import io, csv
# Create an example file
buf = io.StringIO('type1,length1,text1,space1\ntype2,length2,text2,space2\ntype3,length3,text3,space3')
reader = csv.reader(buf)
# Uncomment the next line if there is a header row
# next(reader)
Ana_Types, Ana_Length, Ana_Text, Ana_Space = zip(*reader)
print(Ana_Types)
('type1', 'type2', 'type3')
print(Ana_Length)
('length1', 'length2', 'length3')
...
If you need lists rather than tuples you can use a list or generator comprehension to convert them:
Ana_Types, Ana_Length, Ana_Text, Ana_Space = [list(x) for x in zip(*reader)]
This could be useful:
import numpy as np
# read the rows with Numpy
rows = np.genfromtxt('data.csv',dtype='str',delimiter=';')
# call numpy.transpose to convert the rows to columns
cols = np.transpose(rows)
# get the stuff as lists
Ana_Type = list(cols[0])
Ana_Length = list(cols[1])
Ana_Text = list(cols[2])
Ana_Space = list(cols[0])
Edit : note that the first element will be the name of the columns (example with test data):
['Date', '2020-03-03', '2020-03-04', '2020-03-05', '2020-03-06']
I have a large csv file, containing multiple values, in the form
Date,Dslam_Name,Card,Port,Ani,DownStream,UpStream,Status
2020-01-03 07:10:01,aart-m1-m1,204,57,302xxxxxxxxx,0,0,down
I want to extract the Dslam_Name and Ani values, sort them by Dslam_name and write them to a new csv in two different columns.
So far my code is as follows:
import csv
import operator
with open('bad_voice_ports.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
for row in sortedlist:
bad_port = row[1][:4],row[4][2::]
print(bad_port)
f = open("bad_voice_portsnew20200103SORTED.csv","a+")
f.write(row[1][:4] + " " + row[4][2::] + '\n')
f.close()
But my Dslam_Name and Ani values are kept in the same column.
As a next step I would like to count how many times the same value appears in the 1st column.
You are forcing them to be a single column. Joining the two into a single string means Python no longer regards them as separate.
But try this instead:
import csv
import operator
with open('bad_voice_ports.csv') as readfile, open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
readCSV = csv.reader(readfile)
writeCSV = csv.writer(writefile)
for row in sorted(readCSV, key=operator.itemgetter(1)):
bad_port = row[1][:4],row[4][2::]
print(bad_port)
writeCSV.writerow(bad_port)
If you want to include the number of times each key occurred, you can easily include that in the program, too. I would refactor slightly to separate the reading and the writing.
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as readfile:
readCSV = csv.reader(readfile)
rows = []
counts = Counter()
for row in readCSV:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
with open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
writeCSV = csv.writer(writefile)
for row in sorted(rows):
print(row)
writeCSV.writerow([counts[row[0]]] + row)
I would recommend to remove the header line from the CSV file entirely; throwing away (or separating out and prepending back) the first line should be an easy change if you want to keep it.
(Also, hard-coding input and output file names is problematic; maybe have the program read them from sys.argv[1:] instead.)
So my suggestion is failry simple. As i stated in a previous comment there is good documentation on CSV read and write in python here: https://realpython.com/python-csv/
As per an example, to read from a csv the columns you need you can simply do this:
>>> file = open('some.csv', mode='r')
>>> csv_reader = csv.DictReader(file)
>>> for line in csv_reader:
... print(line["Dslam_Name"] + " " + line["Ani"])
...
This would return:
aart-m1-m1 302xxxxxxxxx
Now you can just as easilly create a variable and store the column values there and later write them to a file or just open up a new file wile reading lines and writing the column values in there. I hope this helps you.
After the help from #tripleee and #marxmacher my final code is
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as csv_file:
readCSV = csv.reader(csv_file, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
line_count = 0
rows = []
counts = Counter()
for row in sortedlist:
Dslam = row[1][:4]
Ani = row[4][2:]
if line_count == 0:
print(row[1], row[4])
line_count += 1
else:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
print(Dslam, Ani)
line_count += 1
for row in sorted(rows):
f = open("bad_voice_portsnew202001061917.xls","a+")
f.write(row[0] + '\t' + row[1] + '\t' + str(counts[row[0]]) + '\n')
f.close()
print('Total of Bad ports =', str(line_count-1))
As with this way the desired values/columns are extracted from the initial csv file and a new xls file is generated with the desired values stored in different columns and the total values per key are counted, along with the total of entries.
Thanks for all the help, please feel free for any improvement suggestions!
You can use sorted:
import csv
_h, *data = csv.reader(open('filename.csv'))
with open('new_csv.csv', 'w') as f:
write = csv.writer(f)
csv.writerows([_h, *sorted([(i[1], i[4]) for i in data], key=lambda x:x[0])])
I have a csv file where I wish to perform a sentiment analysis on this dataset containing survey data.
So far this is what I have tried (thanks to Rupin from a previous question!):
import csv
from collections import Counter
with open('myfile.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
alist = []
iterreader = iter(reader)
next(iterreader, None)
for row in iterreader:
clean_rows = row[0].replace(",", " ").rsplit()
alist.append(clean_rows)
word_count = Counter(clean_rows)
mostWcommon = word_count.most_common(3)
print(mostWcommon)
The output is nearly okay, the only problem that I have is that Python is splitting in different rows of a list, hence I have something like this as my output:
['experienced', 1]
['experienced, 1]
['experienced, 1]
I wish to split everything in one row so that I can have the real word frequency... Any suggestions?
Thanks!
You are creating a new Counter for each row and printing only that result. If you want a total count, you can create the counter outside the rows loop and update it with data from each row:
import csv
from collections import Counter
with open('myfile.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
alist = []
iterreader = iter(reader)
next(iterreader, None)
c = Conter()
for row in iterreader:
clean_rows = row[0].replace(",", " ").rsplit()
alist.append(clean_rows)
c.update(clean_rows)
mostWcommon = word_count.most_common(3)
print(mostWcommon)
I have CSV input file with 18 columns
I need to create new CSV file with all columns from input except column 4 and 5
My function now looks like
def modify_csv_report(input_csv, output_csv):
begin = 0
end = 3
with open(input_csv, "r") as file_in:
with open(output_csv, "w") as file_out:
writer = csv.writer(file_out)
for row in csv.reader(file_in):
writer.writerow(row[begin:end])
return output_csv
So it reads and writes columns number 0 - 3, but i don't know how skip column 4,5 and continue from there
You can add the other part of the row using slicing, like you did with the first part:
writer.writerow(row[:4] + row[6:])
Note that to include column 3, the stop index of the first slice should be 4. Specifying start index 0 is also usually not necessary.
A more general approach would employ a list comprehension and enumerate:
exclude = (4, 5)
writer.writerow([r for i, r in enumerate(row) if i not in exclude])
If your CSV has meaningful headers an alternative solution to slicing your rows by indices, is to use the DictReader and DictWriter classes.
#!/usr/bin/env python
from csv import DictReader, DictWriter
data = '''A,B,C
1,2,3
4,5,6
6,7,8'''
reader = DictReader(data.split('\n'))
# You'll need your fieldnames first in a list to ensure order
fieldnames = ['A', 'C']
# We'll also use a set for efficient lookup
fieldnames_set = set(fieldnames)
with open('outfile.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames)
writer.writeheader()
for row in reader:
# Use a dictionary comprehension to iterate over the key, value pairs
# discarding those pairs whose key is not in the set
filtered_row = dict(
(k, v) for k, v in row.iteritems() if k in fieldnames_set
)
writer.writerow(filtered_row)
This is what you want:
import csv
def remove_csv_columns(input_csv, output_csv, exclude_column_indices):
with open(input_csv) as file_in, open(output_csv, 'w') as file_out:
reader = csv.reader(file_in)
writer = csv.writer(file_out)
writer.writerows(
[col for idx, col in enumerate(row)
if idx not in exclude_column_indices]
for row in reader)
remove_csv_columns('in.csv', 'out.csv', (3, 4))
I am new at handling csv files with python and I want to write code that allows me to do the following: I have a pattern as:
pattern="3-5;7;10-16"(which may vary)
and I want to delete (in that case) rows 3 to 5 , 7 and 10 to 16
does any one have an idea how to do that?
You cannot simply delete lines from a csv. Instead, you have to read it in and then write it back with the accepted values. The following code works:
import csv
pattern="3-5;7;10-16"
off = []
for i in pattern.split(';'):
if '-' in i:
off += range(int(i.split('-')[0]),int(i.split('-')[1])+1)
else:
off += [int(i)]
with open('test.txt') as f:
reader = csv.reader(f)
reader = [','.join(item) for i,item in enumerate(reader) if i+1 not in off]
print reader
with open('input.txt', 'w') as f2:
for i in reader:
f2.write(i+'\n')