Count word appear in csv

Count word appear in csv - python

I have a plain csv file that starts with these 2 rows:
1.Clubhouse,Fibre Ready,.......
2.Clubhouse,Aircon,.........
3....
I want use python to write out a program that count how many times each column appear in csv file . I have tried several ways but it did not work out.
My output should be like this:
Clubhouse: .... times
Fibre Ready: .... times

You can use collections.Counter:
from collections import Counter
import csv
counter = Counter()
with open('furniture.csv') as fobj:
reader = csv.reader(fobj)
for row in reader:
counter.update(row)
for k, v in counter.items():
print('{}: {} times'.format(k, v))
Output for your two lines:
Clubhouse: 2 times
Fibre Ready: 1 times
Fitness Corner: 2 times
Aircon: 2 times
...
You can also access single items::
>>> counter['Clubhouse']
2
>>> counter['Fibre Ready']
1
collections.Counter is useful for this type of tasks:
Dict subclass for counting hashable items. Sometimes called a bag
or multiset. Elements are stored as dictionary keys and their counts
are stored as dictionary values.

Related

How to sum values of an identical key

I need Python to read a .txt file and sum up the hours each student has attended school for the year. I need help understanding how to do this when the same student has multiple lines in the file. The .txt file looks something like this:
John0550
John0550
Sally1007
And the ultimate result I'm looking for in Python is to print out a list like:
John has attended 1100 hours
Sally has attended 1007 hours
I know I can't rely on a dict() because it won't accommodate identical keys. So what is the best way to do this?

Suppose you already have a function named split_line that returns the student's name / hours attented pair for each. Your algorithm would look like :
hours_attented_per_student = {} # Create an empty dict
with open("my_file.txt", "r") as file:
for line in file.readlines():
name, hour = split_line(line)
# Check whether you have already counted some hours for this student
if name not in hours_attented_per_student.keys():
# Student was not encountered yet, set its hours to 0
hours_attented_per_student[name] = 0
# Now that student name is in dict, increase the amount of hours attented for the student
hours_attented_per_student[name] += hours

A defaultdict could be helpful here:
import re
from collections import defaultdict
from io import StringIO
# Simulate File
with StringIO('''John0550
John0550
Sally1007''') as f:
# Create defaultdict initialized at 0
d = defaultdict(lambda: 0)
# For each line in the file
for line in f.readlines():
# Split Name from Value
name, value = re.split(r'(^[^\d]+)', line)[1:]
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))
Output:
{'John': 1100, 'Sally': 1007}
Assuming values are already split and parsed:
from collections import defaultdict
entries = [('John', 550), ('John', 550), ('Sally', 1007)]
d = defaultdict(int)
for name, value in entries:
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))

Comparing key from first dictionary to values from second dictionary

Please I need some help again.
I have a big data base file (let's call it db.csv) containing many informations.
Simplified database file to illustrate:
I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1
0 ['VFG003386', 'VFG034084', 'VFG003381']
1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']
2 ['VFG018349', 'VFG018485', 'VFG043567']
...
14471 ['VFG015743', 'VFG002143']
So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2
VFG044259 IA027
VFG044258 IA027
VFG011941 CVF397
VFG012016 CVF399
...
What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:
IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...
So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.

First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.
You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:
cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
5: ['VFG011941'],
7949: ['VFG003386'],
14471: ['VFG015743', 'VFG002143', 'VFG012016']}
And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:
gene_id_to_vf_accession = {'VFG044259': 'IA027',
'VFG044258': 'IA027',
'VFG011941': 'CVF397',
'VFG012016': 'CVF399',
'VFG000676': 'VF0142',
'VFG002231': 'VF0369',
'VFG003386': 'CVF051'}
And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.
We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.
So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value
from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
vf_accession_to_gene_ids[vf_accession].append(gene_id)
For the sample data I posted above, vf_accession_to_gene_ids now looks like:
defaultdict(<class 'list'>, {'VF0142': ['VFG000676'],
'CVF051': ['VFG003386'],
'IA027': ['VFG044258', 'VFG044259'],
'CVF399': ['VFG012016'],
'CVF397': ['VFG011941'],
'VF0369': ['VFG002231']})
Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:
vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
gene_ids = vf_accession_to_gene_ids[vf_accession]
cluster_group = []
for gene_id in gene_ids:
for cluster_nr in cluster_nr_to_gene_ids:
if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
cluster_group.append(cluster_nr)
vf_accession_to_cluster_groups[vf_accession] = cluster_group
The end result for the above sample data now is:
{'VF0142': [],
'CVF051': [0, 7949],
'IA027': [0],
'CVF399': [2, 14471],
'CVF397': [5],
'VF0369': []}

Caveat: I don't do much Python development, so there's likely a better way to do this. You can first map your VFG... gene_ids to their cluster numbers, and then use that to process the second dictionary:
from collections import defaultdict
import sys
import ast
# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists
vfg_cluster_map = defaultdict(list)
# map all of the vfg... keys to their cluster numbers first
with open(sys.argv[1], 'r') as dict_1:
for line in dict_1:
# split the line at the first space to separate the cluster number and gene ID list
# e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']",
# cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']"
cluster_group_num, vfg_list = line.strip().split(' ', 1)
cluster_group_num = int(cluster_group_num)
# convert "['VFG...', 'VFG...']" from a string to an actual list
vfg_list = ast.literal_eval(vfg_list)
for vfg in vfg_list:
vfg_cluster_map[vfg].append(cluster_group_num)
# you now have a dictionary mapping gene IDs to the clusters they
# appear in, e.g
# {'VFG003386': [0],
# 'VFG034084': [0],
# ...}
# you can look in that dictionary to find the cluster numbers corresponding
# to your vfg... keys in dict_2 and add them to the list for that vf_accession
vf_accession_cluster_map = defaultdict(list)
with open(sys.argv[2], 'r') as dict_2:
for line in dict_2:
vfg, vf_accession = line.strip().split(' ')
# add the list of cluster numbers corresponding to this vfg... to
# the list of cluster numbers corresponding to this vf_accession
vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg])
for vf_accession, cluster_list in vf_accession_cluster_map.items():
print vf_accession + ' ' + str(cluster_list)
Then save the above script and invoke it like python <script name> dict1_file dict2_file > output (or you could write the strings to a file instead of printing them and redirecting).
EDIT: After looking at #BioGeek's answer, I should note that it would make more sense to process this all in one shot than to create dict_1 and dict_2 files, read them in, parse the lines back into numbers and lists, etc. If you don't need to write the dictionaries to a file first, then you can just add your other code to the script and use the dictionaries directly.

python code error in mapping file values to list

I have two lists structured somehow like this:
A= [[1,27],[2,27],[3,27],[4,28],[5,29]]
B= [[6,30],[7,31],[8,31]]
and i have a file that has numbers:
1 5
2 3
3 1
4 2
5 5
6....
i want a code that reads this file and maps it to the list. e.g if the file has 1, it should read A list and output 27, if it has 6, it should read B and print 30, such that I get
27 29
27 27
27 27
28 27
29 29
30 31
The problem is, that my code gives index error, i read the file line by line and have an if condition that checks if the number i read from the file is less than the maximum number in list A, if so, it outputs the second character of that list and otherwise move on. The problem is, that instead of moving on to list B, my program still reads A and gives index error.
with open(filename) as myfile:
for line in myfile.readlines():
parts=line.split()
if parts[0]< maxnumforA:
print A[int(parts[0])-1]
else:
print B[int(parts[0]-1)

You should turn that lists into dictionaries. For example:
_A = dict(A)
_B = dict(B)
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in _A:
print _A[part]
elif part in _B:
print _B[part]
If the action that will take place does not need to know if it comes from A or B, both can be turned into a single dictionary:
d = dict(A + B) # Creating the dictionary
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in d:
print d[part]
Creating the dictionary can be acomplished in many different ways, I will list some of them:
d = dict(A + B): First joins both lists into a single list (without modifying A or B) and then turns the result into a dictionary. It's the most clear way to do it.
d = {**dict(A), **dict(B)}: Turns both lists into two separates dictionaries (without mmodifying A or B), unpacks them and pack both of them into a single dictionary. Slighlty (and I mean really slightly) faster than the previous method and less clear. Proposed by #Nf4r
d = dict(A) & d.update(B): Turns the first list into a dictionary and updates that dictionary with the content of the second list. Fastest method, 1 line of code per list instead of 1 line for any list and no temporary objects generation so more efficient memory-wise.

As everyone stated before, dict would be much better. I don't know if the left-sided values in each lists are unique, but if yes, you could just go for:
d = {**dict(A), **dict(B)} # available in Python 3.x
with open(filename) as file:
for line in file.readlines():
for num in line.split():
if int(num) in d:
print(d[int(num)])

Discover different lines across similar files

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?

If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times

if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

adding the results of a sorted dictionary in python

I have a script which loops over multiple files.
For each file, I count how often a specific combination in the file occurs.
I do this using the following code:
with open("%s" %files) as f:
freqs = {}
sortedFreqs = []
# read lines of csv file
for l in f.readlines():
# some code here (not added) which fills the mutationList value
# this dict stores how often which mutation occurs.
freqs = Counter(mutationList)
# same list, only sorted.
sortedFreqs = sorted(freqs.iteritems(), key=operator.itemgetter(1), reverse=True)
so the freqs variable contains a long list of entries.
example:
'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1, 'CSMD3p.T1137': 3
I now want to sort them based on the second value, which are stored in sortedFreqs.
example:
'CSMD3p.T1137': 3, 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1
This is all going fine, however I now want to loop over multiple files, and add all found frequencies together. So if I find the 'CSMD3p.T1137' value 2 more times, I want to store 'CSMD3p.T1137': 5.
wanted output:
totalFreqs = 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1, 'CSMD3p.T1137': 5, 'TRPM1p.R551': 2
totalFreqsSorted = 'CSMD3p.T1137': 5,'TRPM1p.R551': 2 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1'
how do I "add" the key values of the dictionaries in python ? (how do I correctly file the values of totalFreqs and totalFreqsSorted)

Use one Counter() object for all counts and update it for each file:
freqs = Counter()
for file in files:
with open(...) as f:
#
freqs.update(mutationList)
or you can add up counters simply by summing them:
total_freqs = Counter()
for file in files:
with open(...) as f:
#
freqs = Counter(mutationList)
total_freqs += freqs
Note that Counter() objects already provide a reverse-sorted list of frequencies; just use the Counter.most_common() method instead of sorting yourself:
sortedFreqs = freqs.most_common()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count word appear in csv - python

Related

How to sum values of an identical key

Comparing key from first dictionary to values from second dictionary

python code error in mapping file values to list

Discover different lines across similar files

adding the results of a sorted dictionary in python

Categories

Resources