I am trying to read some numbers from a .csv file and store them into a matrix using Python. The input file looks like this
Input File
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
The input is to be manipulated to a matrix like -
Output File
1 2 3
A 2 1 1
B 3 2 0
Here, the first column of the input file becomes the row, second column becomes the column and the value is the count of the occurrence. How should I implement this? The size of my input file is huge (1000000 rows) and hence there can be large number of rows (anywhere between 50 to 10,000) and columns (from 1 to 50)
With pandas, it becomes easy, almost in just 3 lines
import pandas as pd
df = pd.read_csv('example.csv', names=['label', 'value'])
# >>> df
# label value
# 0 B 1
# 1 A 1
# 2 A 1
# 3 B 1
# 4 A 3
# 5 A 2
# 6 B 1
# 7 B 2
# 8 B 2
s = df.groupby(['label', 'value']).size()
# >>> s
# label value
# A 1 2
# 2 1
# 3 1
# B 1 3
# 2 2
# dtype: int64
# ref1: http://stackoverflow.com/questions/15751283/converting-a-pandas-multiindex-dataframe-from-rows-wise-to-column-wise
# ref2: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html
m = s.unstack()
# >>> m
# value 1 2 3
# label
# A 2 1 1
# B 3 2 NaN
# Below are optional: just to make it look more like what you want
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
# 1 2 3
# A 2 1 1
# B 3 2 0
My solution does not seem to be very effective for a huge amout of input data since I am doing a lot of stuff manually which could be done by some of pandas DataFrame methods, probably.
However, this does the job:
#!/usr/bin/env python3
# coding: utf-8
import pandas as pd
from collections import Counter
with open('foo.txt') as f:
l = f.read().splitlines()
numbers_list = []
letters_list = []
for element in l:
letter = element.split(',')[0]
number = element.split(',')[1]
if number not in numbers_list:
numbers_list.append(number)
if letter not in letters_list:
letters_list.append(letter)
c = Counter(l)
d = dict(c)
output = pd.DataFrame(columns=sorted(numbers_list), index=sorted(letters_list))
for col in numbers_list:
for row in letters_list:
key = '{},{}'.format(row, col)
if key in d:
output[col][row] = d[key]
else:
output[col][row] = 0
The output is as desired:
1 2 3
A 2 1 1
B 3 2 0
The following solution uses just standard Python modules:
import csv, collections, itertools
with open('my.csv', 'r') as f_input:
counts = collections.Counter()
for cols in csv.reader(f_input):
counts[(cols[0], cols[1])] += 1
keys = set(key[0] for key in counts.keys())
values = set(counts.values())
d = {}
for k in itertools.product(keys, values):
d[(k[0], str(k[1]))] = 0
d.update(dict(counts))
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
# Write the header, 'X' is whatever you want the first column called
csv_output.writerow(['X'] + sorted(values))
# Write the rows
for k, g in itertools.groupby(sorted(d.items()), key=lambda x: x[0][0]):
csv_output.writerow([k] + [col[1] for col in g])
This gives you an output CSV file looking like:
X,1,2,3
A,2,1,1
B,3,2,0
Here is another variation using standard modules:
import csv
import re
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
Given data.csv has the contents of the input file shown in the question, this script prints the following as output:
1 2 3
A 2 1 1
B 3 2 0
Thanks to #zyxue for a pure pandas solution. It takes a lot less code up front with the problem being selection of it. However, extra coding is not necessarily in vain regarding run time performance. Using timeit in IPython to measure the run time difference between my code and that of &zyxue using pure pandas, I found that my method ran 36 times faster excluding imports and input IO and 121 times faster when also excuding output IO (print statements). These tests were done with functions to encapsulate code blocks. Here are the functions that were tested using Python 2.7.10 and Pandas 0.16.2:
def p(): # 1st pandas function
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
def p1(): # 2nd pandas function - omitting print statement
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
def q(): # first std mods function
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
def q1(): # 2nd std mods function - omitting print statements
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
Prior to testing the following code was run to import modules, input IO and initialize variables for all functions:
import pandas as pd
df = pd.read_csv('data.csv', names=['label', 'value'])
import csv
from collections import defaultdict
from itertools import chain
import re
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
The contents of the data.csv input file was:
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
The test command line for each function was of the form:
%timeit fun()
Here are the test results:
p(): 100 loops, best of 3: 4.47 ms per loop
p1(): 1000 loops, best of 3: 1.88 ms per loop
q(): 10000 loops, best of 3: 123 µs per loop
q1(): 100000 loops, best of 3: 15.5 µs per loop
These results are only suggestive and for one small dataset. In particular I would expect pandas to perform comparatively better for larger datasets up to a point.
Here is a way to do it with MapReduce using Hadoop streaming where the mapper and reducer scripts both read stdin.
The mapper script is mostly an input mechanism and filters input to remove improper data with advantages that the input can be split over multiple mapper processes with the total output automatically sorted and forwarded to a reducer plus the possibility of running combiners locally on mapper nodes. Combiners are essentially intermediate reducers useful for speeding up reduction through parallelism over a cluster.
# mapper script
import sys
import re
# mapper
for line in sys.stdin:
line = line.strip()
word = line.split()[0]
if word and re.match(r'\A[a-zA-Z]+,[0-9]+',word):
print '%s\t%s' % (word)
The reducer script gets sorted output over all mappers, builds an intermediate dict for each input key such as A or B, which is called 'prefix' in the code and outputs results to a file in csv format.
# reducer script
from collections import defaultdict
import sys
def output(s,d):
"""
this function takes a string s and dictionary d with int keys and values
and sorts the keys then creates a string of comma-separate values ordered
by the keys with appropriate insertion of comma-separate zeros equal in
number to the difference between successive keys minus one
"""
v = sorted(d.keys())
o = str(s) + ','
lastk = 0
for k in v:
o += '0,'*(k-lastk-1) + str(d[k]) + ','
lastk = k
return o
prefix = ''
current_prefix = ''
d = defaultdict(int)
maxkey = 0
for line in sys.stdin:
line = line.strip()
prefix,value = line.split(',')
try:
value = int(value)
except ValueError:
continue
if current_prefix == prefix:
d[value] += 1
else:
if current_prefix:
if len(d) > 0:
print output(current_prefix,d)
t = max(d.keys())
if t > maxkey:
maxkey = t
d = defaultdict(int)
current_prefix = prefix
d[value] += 1
# output info for last prefix if needed
if current_prefix == prefix:
print output(prefix,d)
t = max(d.keys())
if t > maxkey:
maxkey = t
# output csv list of keys from 1 through maxkey
h = ' ,'
for i in range(1,maxkey+1):
h += str(i) + ','
print h
To run through data streaming process, given that the mapper gets:
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
It directly outputs the same content which then all gets sorted (shuffled) and sent to a reducer. In this example, what the reducer gets is:
A,1
A,1
A,2
A,3
B,1
B,1
B,1
B,2
B,2
Finally the output of the reducer is:
A,2,1,1,
B,3,2,
,1,2,3,
For larger data sets, the input file would be split with portions containing all data for some sets of keys going to separate mappers. Using a combiner on each mapper node would save overall sorting time. There would still be a need for a single reducer so that the output is totally sorted by key. If that's not a requirement, multiple reducers could be used.
For practical reasons I made a couple of choices. First, each line of output only goes up to the highest integer for a key and trailing zeros are not printed because there is no way to know how many to write until all the input has been processed, which for large input means storing a large amount of intermediate data in memory or slowing down processing by writing it out to disk and reading it back in to complete the job. Second and for the same reason, the header line cannot be written until just before the end of the reduce job so that's when its written. It may be possible to prepend it to the output file, or the first one if output has been split, and that can be investigated in due course. However, provided a great speedup of performance from parallel processing, for massive input, these are minor issues.
This method will work with relatively minor but crucial modifications on a Spark cluster and can be converted to Java or Scala to improve performance if necessary.
Related
I'm having performance issues with a python function that I'm using loading two 5+ GB tab delineated txt files that are the same format with different values and using a third text file as a key to determine which values should be kept for output. I'd like some help for speed gains if possible.
Here is the code:
def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False
# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
reader = csv.reader(csvfile)
for item in reader:
lst.append([int(item[0]), int(item[1])])
# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
for x in range(3): # skip the first 3 lines that are the file header
b.readline()
c.readline()
while True: # loop until end of file, this should loop here 1,025 times
if end_loop_break == True: break
for x in range(2): # skip the first 2 lines that are the stress period header
b.readline()
c.readline()
for rw in range(1, 477):
if end_loop_break == True: break
for cl in range(52):
# read both files at the same time to get the different data and split the 10 values in the row
b_row = b.readline().split()
c_row = c.readline().split()
if not b_row:
end_loop_break == True
break
for x in range(1, 11):
# search for the cell address in the key file to find which files datat to keep
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
if not testval: # cell address not in key file
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 1: # cell address value == 1
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 2: # cell address value == 2
out.append(c_row[x - 1])
print(cl * 10 + x + tcel) # test output for cell location
tcel += 520
print('success')`
The key file looks like:
37794, 1
37795, 0
37796, 2
The data files are large ~5GB each and complex from a counting standpoint, but are standard in format and look like:
0 0 0 0 0 0 0 0 0 0
1.5 1.5 0 0 0 0 0 0 0 0
This process is taking a very long time and was hoping someone could help speed it up.
I believe your speed problem is coming from this line:
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
You are iterating over the whole key list for every single value in the HUGE output files. This is not good.
It looks like cl * 10 + x + tcel is the formula you are looking for in lst[n][0].
May I suggest you use a dict instead of a list for storing the data in lst.
lst = {}
for item in reader:
lst[int(item[0])] = int(item[1])
Now, lst is a mapping, which means you can simply use the in operator to check for the presence of a key. This is a near instant lookup because the dict type is hash based and very efficient for key lookups.
something in lst
# for example
(cl * 10 + x) in lst
And you can grab the value by:
lst[something]
#or
lst[cl * 10 + x]
A little bit of refactoring and your code should PROFOUNDLY speed up.
I want to update the numbers in idx field if there is any matching letter between the vals from two consecutive row.
Input data = '''pos\tidx\tvals
23\t4\tabc
25\t7\tatg
29\t8\tctb
35\t1\txyz
37\t2\tmno
39\t3\tpqr
41\t6\trtu
45\t5\tlfg'''
‘’’ Explantion: Since there is letter `a` matching between idx 4 and 7
the idx at pos 25 will be updated to 4, but again there is `t`
matching between vals at pos 25 and 29 we update the idx at 29 also to
4 instead of just 7. ‘’’
#Expected output to a file (tab separated):
pos idx vals
23 4 abc
25 4 atg
29 4 ctb
35 1 xyz
37 2 mno
39 3 pqr
41 3 rtu
45 5 lfg
I have written the given code so far, and would also like to
write the expected output to a file
optimize the code for the work I am doing.
the answer has to follow my method of reading two consecutive rows (as keys, values) pairs at a time. The reason is this question is just a trial of other problem I am trying to solve.
Code:
import csv
import itertools
import collections
import io
from itertools import islice
data_As_Dict = csv.DictReader(io.StringIO(data), delimiter='\t')
grouped = itertools.groupby(data_As_Dict, key=lambda x: x['idx'])
write_to = open(“updated_idx.txt”, “w”)
write_to.write(‘\t’.join([‘pos’, ‘idx’, ‘vals’]))
write_to.close()
# Make a function to read the data as keys,values and also keep the order
def accumulate(data):
acc = collections.OrderedDict()
for d in data:
for k, v in d.items():
acc.setdefault(k, []).append(v)
return acc
''' Store data as keys,values '''
grouped_data = collections.OrderedDict()
for k, g in grouped:
grouped_data[k] = accumulate(g)
# Now, read as keys, values pairs for two consecutive keys
for (k1, v1), (k2, v2) in zip(grouped_data.items(), islice(grouped_data.items(), 1, None)):
pos1 = v1[‘pos’]
pos2 = v2[‘pos’]
v1_vals = ''.join(v1['vals'])
v2_vals = ''.join(v2['vals'])
v1_vals = list(v1_vals)
v2_vals = list(v2_vals)
# find if there are any matching letters between two vals
commons = [x for x in v1_vals if x in v2_vals]
# start updating the idx values if there is a match
if len(commons) > 0:
k2_new = k1
write_to = open(“updated_idx.txt”, “a”)
write_to.write(‘\t’.join([pos1, k1 , v1[‘vals’]))
write_to.write(‘\t’.join([pos2, k2_new , v2[‘vals’]))
# Problem: This (above) method updates the k2 for one consecutive match ..
# but, I want to keep this value (k1) and update it if..
# .. elements keep matching.
# this may also be improved using lambda
# any other alternatives ??
If I'm not misunderstanding and you just care for consecutive rows you probably can just do it with something like this:
data = '''pos\tidx\tvals
23\t4\tabc
25\t7\tatg
29\t8\tctb
35\t1\txyz
37\t2\tmno
39\t3\tpqr
41\t6\trtu
45\t5\tlfg'''
def is_one_char_in_string(stringa, stringb):
for char in stringa:
if char in stringb:
return True
return False
prev_idx=''
prev_val=''
with open("out.txt", "a") as of:
for i, line in enumerate(data.split("\n")):
line=line.strip().split("\t")
# Header and first row doesn't need to be considered for reindexing
if i<2:
prev_idx=line[1]
prev_val=line[2]
of.write("\t".join(line)+"\n")
else:
if is_one_char_in_string(line[2], prev_val):
line[1]= prev_idx
of.write("\t".join(line)+"\n")
prev_val=line[2]
else:
prev_idx=line[1]
prev_val=line[2]
of.write("\t".join(line)+"\n")
edit to follow the same method as the original question - updated
Turned out that when I copied the input data I left the tabulation at the start of the line, which made the csv reader consider it as a column, messing up the keys. So this should be correct.
f = open ("out.txt", "a")
f.write("pos\tidx\tvals\n")
for (k1, v1), (k2, v2) in zip(grouped_data.items(), islice(grouped_data.items(), 1, None)):
# find if there are any matching letters between two vals
commons = [x for x in v1['vals'][0] if x in v2['vals'][0]]
# start updating the idx values if there is a match
if len(commons) > 0:
# Update the dictionary with the new key
grouped_data[k2]['idx'] = grouped_data[k1]['idx']
f.write("{}\t{}\t{}\n".format(v1['pos'][0], v1['idx'][0], v1['vals'][0]))
# write the last row, previously updated
last_row = list(grouped_data.items())[-1][1]
f.write("{}\t{}\t{}\n".format(last_row['pos'][0], last_row['idx'][0], last_row['vals'][0]))
f.close()
answer to OP comment
I corrected the code above. Else is not needed because you want to update (or "carry on") the index only if the next string matches. You can add else: pass if it makes the code more readable for you.
Optimization
For the optimization, using sets, as suggested by Raymond Zheng, could speed up things a bit in case of long strings.
To check for common elements using sets:
commons = list(set(v1['vals'][0]).intersection(set(v2['vals'][0])))
But depending on the length of your strings it could degrade performance (albeit both of them are quite fast).
Just for the record for 100 iterations timed with timeit:
-on strings of length 4
lists: 0.00011 sec.
sets: 0.00022 sec.
-on strings of length 200
lists: 0.00222 sec.
sets: 0.00123 sec.
-on strings of length 2000
lists: 0.02354 sec.
sets: 0.00930 sec.
Writing file:
f = open(filename, 'w')
f.write("Beginning code execution)
for (k1, v1), (k2, v2)...
# your code here
...
f.write(...)
f.write("End code execution")
Optimizations:
Some optimizations can come from the actual logic of the problem. Just looking at your code though, change v2_vals = list(v2_vals) to v2_vals = set(v2_vals). Prevents iterations over long strings, and also has a max size of 26 (or whatever the valid character set is for values).
The problem you specified in your comment:
Unfortunately there's no easy way to "redo" a loop iteration. You can, however, iterate manually:
i = 0
while i < len(...):
...
if len(commons) > 0:
k2_new = k1
continue # <-- skips the i += 1. You can additionally save the k1 value so as to not have to recalculate.
i += 1
Hope this helps!
I am using python 2.7, and I have a text file that looks like this:
id value
--- ----
1 x
2 a
1 z
1 y
2 b
I am trying to get an ouput that looks like this:
id value
--- ----
1 x,z,y
2 a,b
Much appreciated!
The simplest solution would be to use collections.defaultdict and collections.OrderedDict. If you don't care about order you could also use sets instead of OrderedDict.
from collections import defaultdict, OrderedDict
# Keeps all unique values for each id
dd = defaultdict(OrderedDict)
# Keeps the unique ids in order of appearance
ids = OrderedDict()
with open(yourfilename) as f:
f = iter(f)
# skip first two lines
next(f), next(f)
for line in f:
id_, value = list(filter(bool, line.split())) # split at whitespace and remove empty ones
dd[id_][value] = None # dicts need a value, but here it doesn't matter which one...
ids[id_] = None
print('id value')
print('--- ----')
for id_ in ids:
print('{} {}'.format(id_, ','.join(dd[id_])))
Result:
id value
--- ----
1 x,z,y
2 a,b
In case you want to write it to another file just concatenate what I printed with \n and write it to a file.
I think this could also work, although the other answer seems more sophisticated:
input =['1,x',
'2,a',
'1,z',
'1,y',
'2,b',
'2,a', #added extra values to show duplicates won't be added
'1,z',
'1,y']
output = {}
for row in input:
parts = row.split(",")
id_ = parts[0]
value = parts[1]
if id_ not in output:
output[id_] = value
else:
a_List = list(output[id_])
if value not in a_List:
output[id_] += "," + value
else:
pass
You end up with a dictionary similar to what you requested.
#read
fp=open('','r')
d=fp.read().split("\n")
fp.close()
x=len(d)
for i in range(len(d)):
n= d[i].split()
d.append(n)
d=d[x:]
m={}
for i in d:
if i[0] not in m:
m[i[0]]=[i[1]]
else:
if i[1] not in m[i[0]]:
m[i[0]].append(i[1])
for i in m:
print i,",".join(m[i])
I am working on a data analysis using a CSV file that I got from a datawarehouse(Cognos). The CSV file has the last row that sums up all the rows above, but I do not need this line for my analysis, so I would like to skip the last row.
I was thinking about adding "if" statement that checks a column name within my "for" loop like below.
import CSV
with open('COGNOS.csv', "rb") as f, open('New_COGNOS.csv', "wb") as w:
#Open 2 CSV files. One to read and the other to save.
CSV_raw = csv.reader(f)
CSV_new = csv.writer(w)
for row in CSV_raw:
item_num = row[3].split(" ")[0]
row.append(item_num)
if row[0] == "All Materials (By Collection)": break
CSV_new.writerow(row)
However, this looks like wasting a lot of resource. Is there any pythonian way to skip the last row when iterating through CSV file?
You can write a generator that'll return everything but the last entry in an input iterator:
def skip_last(iterator):
prev = next(iterator)
for item in iterator:
yield prev
prev = item
then wrap your CSV_raw reader object in that:
for row in skip_last(CSV_raw):
The generator basically takes the first entry, then starts looping and on each iteration yield the previous entry. When the input iterator is done, there is still one line left, that is never returned.
A generic version, letting you skip the last n elements, would be:
from collections import deque
from itertools import islice
def skip_last_n(iterator, n=1):
it = iter(iterator)
prev = deque(islice(it, n), n)
for item in it:
yield prev.popleft()
prev.append(item)
A generalized "skip-n" generator
from __future__ import print_function
from StringIO import StringIO
from itertools import tee
s = '''\
1
2
3
4
5
6
7
8
'''
def skip_last_n(iterator, n=1):
a, b = tee(iterator)
for x in xrange(n):
next(a)
for line in a:
yield next(b)
i = StringIO(s)
for x in skip_last_n(i, 1):
print(x, end='')
1
2
3
4
5
6
7
i = StringIO(s)
for x in skip_last_n(i, 3):
print(x, end='')
1
2
3
4
5
i have multiple files each containing 8/9 columns.
for a single file : I have to read last column containing some value and count the number of occurrence of each value and then generate an outfile.
I have done it like:
inp = open(filename,'r').read().strip().split('\n')
out = open(filename,'w')
from collections import Counter
C = Counter()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
now the problem is it works fine if I have to generate one output for one input. But I need to scan a directory using glob.iglobfunction for all files to be used as input. And then have to perform above said program on each file to gather result for each file and then of course have to write all of the analyzed results for each file into a single OUTPUT file.
NOTE: During generating single OUTPUT file if any value is found to be getting repeated then instead of writing same entry twice it is preferred to sum up the 'count' only. e.g. analysis of 1st file generate:
123 6
111 5
0 6
45 5
and 2nd file generate:
121 9
111 7
0 1
22 2
in this case OUTPUT file must be written such a way that it contain:
123 6
111 12 #sum up count no. in case of similar value entry
0 7
45 5
22 2
i have written prog. for single file analysis BUT i'm stuck in mass analysis section.
please help.
from collections import Counter
import glob
out = open(filename,'w')
g_iter = glob.iglob('path_to_dir/*')
C = Counter()
for filename in g_iter:
f = open(filename,'r')
inp = f.read().strip().split('\n')
f.close()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
After de-uglification:
from collections import Counter
import glob
def main():
# create Counter
cnt = Counter()
# collect data
for fname in glob.iglob('path_to_dir/*.dat'):
with open(fname) as inf:
cnt.update(line.split()[-1] for line in inf)
# dump results
with open("summary.dat", "w") as outf:
outf.writelines("{:5s} {:>5d}\n".format(val,num) for val,num in cnt.iteritems())
if __name__=="__main__":
main()
Initialise a empty dictionary at the top of the program,
lets say, dic=dict()
and for each Counter update the dic so that the values of similar keys are summed and the new keys are also added to the dic
to update dic use this:
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
where C is the current Counter, and after all files are finished write the dic to the output file.
import glob
from collections import Counter
dic=dict()
g_iter = glob.iglob(r'c:\\python32\fol\*')
for x in g_iter:
lis=[]
with open(x) as f:
inp = f.readlines()
for line in inp:
num=line.split()[-1]
lis.append(num)
C=Counter(lis)
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
for x in dic:
print(x,'\t',dic[x])
I did like this.
import glob
out = open("write.txt",'a')
from collections import Counter
C = Counter()
for file in glob.iglob('temp*.txt'):
for line in open(file,'r').read().strip().split('\n'):
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()