Parsing specific fields and counting the occurrence with python - python

I have a file separated by delimiter '|' like this:
age=None|sex=M|DEPT=ID1|YEAR=1995|
age=10|sex=M|DEPT=None|YEAR=1992|
age=None|sex=None|DEPT=ID1|YEAR=1991|
age=20|sex=F|DEPT=ID2|YEAR=1990|
age=20|sex=M|DEPT=ID3|YEAR=1991|
In python, how do I get the output of how many times each field is repeated.
Do we have any built-in functions? I looked into collections.update() but my environment
uses python-2.6. Unfortunately I can't use that option(and won't be able to copy
new module files into that environment manually too).
Thanks for any help or pointers.
example output:
1 times Sex=F
3 times Sex=M
1 times age=10
2 times age=None
2 times age=20
2 times YEAR=1991
...
2 times DEPT=ID1
etc

from collections import defaultdict
import csv
with open('path/to/file') as infile:
answer = defaultdict(int)
for row in csv.reader(infile, delimiter="|"):
for field in row:
answer[field] += 1
for k in sorted(answer, key=lambda k: answer[k]):
print answer[k], "times", k
Or:
from collections import Counter
import csv
import itertools
with open('path/to/file') as infile:
answer = Counter(itertools.chain.from_iterable(csv.reader(infile, delimiter="|")))
for k in sorted(answer, key=lambda k:answer[k]):
print answer[k], "times", k

Use get in dictionary may help:
with open('file.txt') as f:
dict = dict()
for line in f:
line = line.strip().split('|')
for item in line:
dict[item] = dict.get(item,0) + 1
for k in dict:
print dict[k], 'times', k

Related

read from a text file, tally the results and print them out (python)

I am trying to gather a list from a .txt file, tally the results and then print them out like this
Bones found:
Ankylosaurus: 3
Pachycephalosaurus: 1
Tyrannosaurus Rex: 1
Struthiomimus: 2
An eample of the dot text file is
Ankylosaurus
Pachycephalosaurus
Ankylosaurus
Tyrannosaurus Rex
Ankylosaurus
Struthiomimus
Struthiomimus
My current code pulls all the names from the .txt file, except im completely stuck from there
frequency = {}
for line in open('bones.txt'):
bones = line.split()
print(bones)
Any help please?
Using collections.defaultdict
Ex:
from collections import defaultdict
d = defaultdict(int)
with open(filename) as infile: #Open File
for line in infile: #Iterate Each Line
d[line.strip()] += 1
print("Bones found:")
for k, v in d.items():
print("{}: {}".format(k, v))
Output:
Bones found:
Struthiomimus: 2
Pachycephalosaurus: 1
Ankylosaurus: 3
Tyrannosaurus Rex: 1
Store each line as a dictionary key (initialized to 0 if line isn't in dict) and increment the value:
frequency = {}
for line in open('bones.txt'):
if line not in frequency:
frequency[line] = 0
frequency[line] += 1
for k, v in frequency.items():
print('{}: {}'.format(k, v))
You can use Counter function from collections library to get this.
from collections import Counter
f=open('bones.txt','r')
ls=f.readlines()
s=dict(Counter(ls))
for k in s.keys():
print ('{}:{}'.format(k,s[k]))

Counting number of occurrence of a string in a text file

I have a text file containing:
Rabbit:Grass
Eagle:Rabbit
Grasshopper:Grass
Rabbit:Grasshopper
Snake:Rabbit
Eagle:Snake
I want to count the number of occurrence of a string, say, the number of times the animals occur in the text file and print the count. Here's my code:
fileName = input("Enter the name of file:")
foodChain = open(fileName)
table = []
for line in foodChain:
contents = line.strip().split(':')
table.append(contents)
def countOccurence(l):
count = 0
for i in l:
#I'm stuck here#
count +=1
return count
I'm unsure about how will python count the occurrence in a text file. The output i wanted is:
Rabbit: 4
Eagle: 2
Grasshopper: 2
Snake: 2
Grass: 2
I just need some help on the counting part and I will be able to manage the rest of it. Regards.
what you need is a dictionary.
dictionary = {}
for line in table:
for animal in line:
if animal in dictionary:
dictionary[animal] += 1
else:
dictionary[animal] = 1
for animal, occurences in dictionary.items():
print(animal, ':', occurences)
The solution using str.split(), re.sub() functions and collections.Counter subclass:
import re, collections
with open(filename, 'r') as fh:
# setting space as a common delimiter
contents = re.sub(r':|\n', ' ', fh.read()).split()
counts = collections.Counter(contents)
# iterating through `animal` counts
for a in counts:
print(a, ':', counts[a])
The output:
Snake : 2
Rabbit : 4
Grass : 2
Eagle : 2
Grasshopper : 2
Use in to judge if an array is an element of another array, in Python, you can use a string as array:
def countOccurence(l):
count = 0
#I'm stuck here#
if l in table:
count +=1
return count
from collections import defaultdict
dd = defaultdict(int)
with open(fpath) as f:
for line in f:
words = line.split(':')
for word in words:
dd[word] += 1
for k,v in dd.items():
print(k+': '+str(v))

Find the number and percentage of occurance in Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a very big file like in GB and it has 4 columns. From that i have to find the number of occurrences of first 2 columns.
Col[1] Col[2] Col[3] Col[4]
So here i have to consider the pairs from Col[1] and Col[2]
I have to find the number of occurrences of that particular pair in entire file
Eg:
Col[1] Col[2]
1234 5678
8901 3456
1234 5678
0987 2345
1234 5678
So we see that 1234 5678 has occurred 3 times so far.
I did refer some code from another post here and tried to implement with my data file and find some errors.
from itertools import combinations
from collections import Counter
import ast
def collect_pairs('FileName.txt'):
pair_counter = Counter()
for line in open('FileName.txt'):
unique_tokens = sorted(set(ast.literal_eval(lines)))
combos = combination(unique_token, 2)
pair_counter += Counter(combos)
return pair_counter
outfile = open('Outputfile.txt', 'w')
p = collect_pairs(outfile)
print p.most_common(10)
I suggest using a defaultdict and reading the file line by line.
from collections import defaultdict
d = defaultdict(int)
# get number of occurences for the first two columns
with open('file', 'r') as f:
f.readline() # discard the header line
for numlines, line in enumerate(f,1):
line = line.strip().split()
c = line[0], line[1]
d[c] += 1
# compute 100*(occurences/numlines) for each key in d
d = {k:(v, 100*float(v)/numlines) for k,v in d.iteritems()}
for k in d:
print k, d[k]
For your sample file, this will print:
('0987', '2345') (1, 20.0)
('8901', '3456') (1, 20.0)
('1234', '5678') (3, 60.0)
where the format is (column1, column2) (occurrences, percentage).
If you just need the occurrences for a single pair, e.g. '1234' and '5678', you can do it like this:
find = '1234', '5678'
counter = 0
with open('file', 'r') as f:
f.readline() # discard the header line
for numlines, line in enumerate(f,1):
line = line.strip().split()
c = line[0], line[1]
if c == find:
counter += 1
print counter, 100*float(counter)/numlines
Output for your sample file:
3 60.0
I have always assumed that the headerline does not count when computing the percentage-value. If it does count, change enumerate(f,1) to enumerate(f,2).

Python file preprocessing (convert column from discrete ranges of values to contiguous range of values.)

I have a dataset of the form:
user_id::item_id1::rating::timestamp
user_id::item_id2::rating::timestamp
user_id::item_id3::rating::timestamp
user_id::item_id4::rating::timestamp
I require the item_ids (there are n distinct item ids in sorted order. Subsequent rows could have the same item ids or different but its guaranteed to be sorted) to be contiguous from 1 to n and they are currently ranging from 1 to k
for k >> n
I have the following code but it isn't quite correct and have been at it for a couple of hours so would really appreciate any help regarding this or if there is a simpler way to do this in python I would really appreciate guidance regarding that as well.
I currently have the following code:
def reOrderItemIds(inputFile,outputFile):
#This is a list in the range of 1 to 10681.
itemIdsRange = set(range(1,10682))
#currKey = 1
currKey = itemIdsRange.pop()
lastContiguousKey=1
#currKey+1
contiguousKey=itemIdsRange.pop()
f = open(inputFile)
g = open(outputFile,"w")
oldKeyToNewKeyMap = dict()
for line in f:
if int(line.split(":")[1]) == currKey and int(line.split(":")[1])==lastContiguousKey:
g.write(line)
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])!=contiguousKey:
oldKeyToNewKeyMap[line.split(":")[1]]=contiguousKey
lastContiguousKey=contiguousKey
#update current key to the value of the current key.
currKey=int(line.split(":")[1])
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])==currKey and int(line.split(":")[1])!=contiguousKey:
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])==contiguousKey:
currKey = int(line.split(":")[1])
lastContiguousKey=contiguousKey
oldKeyToNewKeyMap[line.split(":")[1]] = lastContiguousKey
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
f.close()
g.close()
Example:
1::1::3::100
10::1::5::104
20::2::3::110
1::5::2::104
I require the output to be of the form:
1::1::3::100
10::1::5::104
20::2::3::110
1::3::2::104
so only the item_ids column changes and everything else remains the same.
Any help would be much appreciated!
Because your data is already sorted by item_id - you can use itertools.groupby() which makes easy work of the solution.
from operator import itemgetter
from itertools import groupby
item_id = itemgetter(1)
def reOrderItemIds(inputFile,outputFile):
n = 1
with open(inputFile)as infile, open(outputFile,"w") as outfile:
dataset = (line.split('::') for line in infile)
for key, group in groupby(dataset, item_id):
for line in group:
line[1] = str(n)
outfile.write('::'.join(line))
n += 1
With my apologies for grossly misreading your question the first time, suppose data is a file containing
1::1::3::100
10::1::5::104
20::2::3::110
30::5::3::121
40::9::7::118
50::10::2::104
(If your data cannot all be cast to integers, this could be modified.)
>>> with open('data', 'r') as datafile:
... dataset = datafile.read().splitlines()
...
>>> ids = {0}
>>> for i, line in enumerate(dataset):
... data = list(map(int, line.split('::')))
... if data[1] not in ids:
... data[1] = max(ids) + 1
... ids.add(data[1])
... dataset[i] = '::'.join((str(d) for d in data))
...
>>> print('\n'.join(dataset))
1::1::3::100
10::1::5::104
20::2::3::110
30::3::3::121
40::4::7::118
50::5::2::104
Again, if your dataset is large, faster solutions can be devised.

generating a single outfile after analyzing multiple files in python

i have multiple files each containing 8/9 columns.
for a single file : I have to read last column containing some value and count the number of occurrence of each value and then generate an outfile.
I have done it like:
inp = open(filename,'r').read().strip().split('\n')
out = open(filename,'w')
from collections import Counter
C = Counter()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
now the problem is it works fine if I have to generate one output for one input. But I need to scan a directory using glob.iglobfunction for all files to be used as input. And then have to perform above said program on each file to gather result for each file and then of course have to write all of the analyzed results for each file into a single OUTPUT file.
NOTE: During generating single OUTPUT file if any value is found to be getting repeated then instead of writing same entry twice it is preferred to sum up the 'count' only. e.g. analysis of 1st file generate:
123 6
111 5
0 6
45 5
and 2nd file generate:
121 9
111 7
0 1
22 2
in this case OUTPUT file must be written such a way that it contain:
123 6
111 12 #sum up count no. in case of similar value entry
0 7
45 5
22 2
i have written prog. for single file analysis BUT i'm stuck in mass analysis section.
please help.
from collections import Counter
import glob
out = open(filename,'w')
g_iter = glob.iglob('path_to_dir/*')
C = Counter()
for filename in g_iter:
f = open(filename,'r')
inp = f.read().strip().split('\n')
f.close()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
After de-uglification:
from collections import Counter
import glob
def main():
# create Counter
cnt = Counter()
# collect data
for fname in glob.iglob('path_to_dir/*.dat'):
with open(fname) as inf:
cnt.update(line.split()[-1] for line in inf)
# dump results
with open("summary.dat", "w") as outf:
outf.writelines("{:5s} {:>5d}\n".format(val,num) for val,num in cnt.iteritems())
if __name__=="__main__":
main()
Initialise a empty dictionary at the top of the program,
lets say, dic=dict()
and for each Counter update the dic so that the values of similar keys are summed and the new keys are also added to the dic
to update dic use this:
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
where C is the current Counter, and after all files are finished write the dic to the output file.
import glob
from collections import Counter
dic=dict()
g_iter = glob.iglob(r'c:\\python32\fol\*')
for x in g_iter:
lis=[]
with open(x) as f:
inp = f.readlines()
for line in inp:
num=line.split()[-1]
lis.append(num)
C=Counter(lis)
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
for x in dic:
print(x,'\t',dic[x])
I did like this.
import glob
out = open("write.txt",'a')
from collections import Counter
C = Counter()
for file in glob.iglob('temp*.txt'):
for line in open(file,'r').read().strip().split('\n'):
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()

Categories