I have a script which loops over multiple files.
For each file, I count how often a specific combination in the file occurs.
I do this using the following code:
with open("%s" %files) as f:
freqs = {}
sortedFreqs = []
# read lines of csv file
for l in f.readlines():
# some code here (not added) which fills the mutationList value
# this dict stores how often which mutation occurs.
freqs = Counter(mutationList)
# same list, only sorted.
sortedFreqs = sorted(freqs.iteritems(), key=operator.itemgetter(1), reverse=True)
so the freqs variable contains a long list of entries.
example:
'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1, 'CSMD3p.T1137': 3
I now want to sort them based on the second value, which are stored in sortedFreqs.
example:
'CSMD3p.T1137': 3, 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1
This is all going fine, however I now want to loop over multiple files, and add all found frequencies together. So if I find the 'CSMD3p.T1137' value 2 more times, I want to store 'CSMD3p.T1137': 5.
wanted output:
totalFreqs = 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1, 'CSMD3p.T1137': 5, 'TRPM1p.R551': 2
totalFreqsSorted = 'CSMD3p.T1137': 5,'TRPM1p.R551': 2 'FAM123Ap.Y550': 1, 'SMARCB1p.D192': 1'
how do I "add" the key values of the dictionaries in python ? (how do I correctly file the values of totalFreqs and totalFreqsSorted)
Use one Counter() object for all counts and update it for each file:
freqs = Counter()
for file in files:
with open(...) as f:
#
freqs.update(mutationList)
or you can add up counters simply by summing them:
total_freqs = Counter()
for file in files:
with open(...) as f:
#
freqs = Counter(mutationList)
total_freqs += freqs
Note that Counter() objects already provide a reverse-sorted list of frequencies; just use the Counter.most_common() method instead of sorting yourself:
sortedFreqs = freqs.most_common()
Related
Please I need some help again.
I have a big data base file (let's call it db.csv) containing many informations.
Simplified database file to illustrate:
I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1
0 ['VFG003386', 'VFG034084', 'VFG003381']
1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']
2 ['VFG018349', 'VFG018485', 'VFG043567']
...
14471 ['VFG015743', 'VFG002143']
So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2
VFG044259 IA027
VFG044258 IA027
VFG011941 CVF397
VFG012016 CVF399
...
What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:
IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...
So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.
First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.
You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:
cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
5: ['VFG011941'],
7949: ['VFG003386'],
14471: ['VFG015743', 'VFG002143', 'VFG012016']}
And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:
gene_id_to_vf_accession = {'VFG044259': 'IA027',
'VFG044258': 'IA027',
'VFG011941': 'CVF397',
'VFG012016': 'CVF399',
'VFG000676': 'VF0142',
'VFG002231': 'VF0369',
'VFG003386': 'CVF051'}
And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.
We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.
So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value
from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
vf_accession_to_gene_ids[vf_accession].append(gene_id)
For the sample data I posted above, vf_accession_to_gene_ids now looks like:
defaultdict(<class 'list'>, {'VF0142': ['VFG000676'],
'CVF051': ['VFG003386'],
'IA027': ['VFG044258', 'VFG044259'],
'CVF399': ['VFG012016'],
'CVF397': ['VFG011941'],
'VF0369': ['VFG002231']})
Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:
vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
gene_ids = vf_accession_to_gene_ids[vf_accession]
cluster_group = []
for gene_id in gene_ids:
for cluster_nr in cluster_nr_to_gene_ids:
if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
cluster_group.append(cluster_nr)
vf_accession_to_cluster_groups[vf_accession] = cluster_group
The end result for the above sample data now is:
{'VF0142': [],
'CVF051': [0, 7949],
'IA027': [0],
'CVF399': [2, 14471],
'CVF397': [5],
'VF0369': []}
Caveat: I don't do much Python development, so there's likely a better way to do this. You can first map your VFG... gene_ids to their cluster numbers, and then use that to process the second dictionary:
from collections import defaultdict
import sys
import ast
# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists
vfg_cluster_map = defaultdict(list)
# map all of the vfg... keys to their cluster numbers first
with open(sys.argv[1], 'r') as dict_1:
for line in dict_1:
# split the line at the first space to separate the cluster number and gene ID list
# e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']",
# cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']"
cluster_group_num, vfg_list = line.strip().split(' ', 1)
cluster_group_num = int(cluster_group_num)
# convert "['VFG...', 'VFG...']" from a string to an actual list
vfg_list = ast.literal_eval(vfg_list)
for vfg in vfg_list:
vfg_cluster_map[vfg].append(cluster_group_num)
# you now have a dictionary mapping gene IDs to the clusters they
# appear in, e.g
# {'VFG003386': [0],
# 'VFG034084': [0],
# ...}
# you can look in that dictionary to find the cluster numbers corresponding
# to your vfg... keys in dict_2 and add them to the list for that vf_accession
vf_accession_cluster_map = defaultdict(list)
with open(sys.argv[2], 'r') as dict_2:
for line in dict_2:
vfg, vf_accession = line.strip().split(' ')
# add the list of cluster numbers corresponding to this vfg... to
# the list of cluster numbers corresponding to this vf_accession
vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg])
for vf_accession, cluster_list in vf_accession_cluster_map.items():
print vf_accession + ' ' + str(cluster_list)
Then save the above script and invoke it like python <script name> dict1_file dict2_file > output (or you could write the strings to a file instead of printing them and redirecting).
EDIT: After looking at #BioGeek's answer, I should note that it would make more sense to process this all in one shot than to create dict_1 and dict_2 files, read them in, parse the lines back into numbers and lists, etc. If you don't need to write the dictionaries to a file first, then you can just add your other code to the script and use the dictionaries directly.
I have two lists structured somehow like this:
A= [[1,27],[2,27],[3,27],[4,28],[5,29]]
B= [[6,30],[7,31],[8,31]]
and i have a file that has numbers:
1 5
2 3
3 1
4 2
5 5
6....
i want a code that reads this file and maps it to the list. e.g if the file has 1, it should read A list and output 27, if it has 6, it should read B and print 30, such that I get
27 29
27 27
27 27
28 27
29 29
30 31
The problem is, that my code gives index error, i read the file line by line and have an if condition that checks if the number i read from the file is less than the maximum number in list A, if so, it outputs the second character of that list and otherwise move on. The problem is, that instead of moving on to list B, my program still reads A and gives index error.
with open(filename) as myfile:
for line in myfile.readlines():
parts=line.split()
if parts[0]< maxnumforA:
print A[int(parts[0])-1]
else:
print B[int(parts[0]-1)
You should turn that lists into dictionaries. For example:
_A = dict(A)
_B = dict(B)
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in _A:
print _A[part]
elif part in _B:
print _B[part]
If the action that will take place does not need to know if it comes from A or B, both can be turned into a single dictionary:
d = dict(A + B) # Creating the dictionary
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in d:
print d[part]
Creating the dictionary can be acomplished in many different ways, I will list some of them:
d = dict(A + B): First joins both lists into a single list (without modifying A or B) and then turns the result into a dictionary. It's the most clear way to do it.
d = {**dict(A), **dict(B)}: Turns both lists into two separates dictionaries (without mmodifying A or B), unpacks them and pack both of them into a single dictionary. Slighlty (and I mean really slightly) faster than the previous method and less clear. Proposed by #Nf4r
d = dict(A) & d.update(B): Turns the first list into a dictionary and updates that dictionary with the content of the second list. Fastest method, 1 line of code per list instead of 1 line for any list and no temporary objects generation so more efficient memory-wise.
As everyone stated before, dict would be much better. I don't know if the left-sided values in each lists are unique, but if yes, you could just go for:
d = {**dict(A), **dict(B)} # available in Python 3.x
with open(filename) as file:
for line in file.readlines():
for num in line.split():
if int(num) in d:
print(d[int(num)])
def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)
I have a list of hdf5 files which I would like to open and read in the appropriate values into a new dictionary and eventually write to a text file. I don't necessarily know the values, so the user defines them in an array as an input into the code. The number of files needed is defined by the number of days worth of data the user wants to look at.
new_data_dic = {}
for j in range(len(values)):
new_data_dic[values[j]] = rbsp_ephm[values[j]]
for i in (np.arange(len(filenames_a)-1)+1):
rbsp_ephm = h5py.File(filenames_a[i])
for j in range(len(values)):
new_data_dic[values[j]].append(rbsp_ephm[values[j]])
This works fine if I only have one file, but if I have two or more it seems to close the key? I'm not sure if this is exactly what is happening, but when I ask what new_data_dic is, for values it gives {'Bfs_geo_a': <Closed HDF5 dataset>,... which will not write to a text file. I've tried closing the hdf5 file before opening the next (rbsp_ephm.close()) but I get the same error.
Thanks for any and all help!
I don't really understand your problem... you are trying to create a list of hdf5 dataset?
Or did you just forget the [()] to acces the values in the dataset itself?
Here is a simple standalone example that works just fine :
import h5py
# File creation
filenames_a = []
values = ['values/toto', 'values/tata', 'values/tutu']
nb_file = 5
tmp = 0
for i in range(nb_file):
fname = 'file%s.h5' % i
filenames_a.append(fname)
file = h5py.File(fname, 'w')
grp = file.create_group('values')
for value in values:
file[value] = tmp
tmp += 1
file.close()
# the thing you want
new_data_dict = {value: [] for value in values}
for fname in filenames_a:
rbsp_ephm = h5py.File(fname, 'r')
for value in values:
new_data_dict[value].append(rbsp_ephm[value][()])
print new_data_dict
It returns :
{'values/tutu': [2, 5, 8, 11, 14], 'values/toto': [0, 3, 6, 9, 12], 'values/tata': [1, 4, 7, 10, 13]}
Does it answer your question?
Maybe not directly the good solution, but you could try to extract data as numpy arrays which are a more flexible format rather than the h5py dataset one. See below how to do it:
>>> print type(file['Average/u'])
<class 'h5py.highlevel.Dataset'>
>>> print type(file['Average/u'][:])
<type 'numpy.ndarray'>
And just in case, you should try to use a more "pythonic" way for your loop, that is:
for j in values:
new_data_dic[j] = rbsp_ephm[j]
instead of:
for j in range(len(values)):
new_data_dic[values[j]] = rbsp_ephm[values[j]]
I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.
My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.
# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
line = map(int, line.split())
while line[0] == list[pointer].key:
list[pointer].value = line[1]
pointer += 1
while line[0] > list[pointer].key:
pointer += 1
if pointer >= len(list) - 1:
break # end of list; -1 is due to sentinel
Coded binary search + seek solution (thanks kigurai!):
entries = 24935502 # number of entries
width = 18 # fixed width of an entry in the file padded with spaces
# at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, entries-1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid * width)
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.
Slight optimization of S.Lotts answer:
from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
key, value = line.split()
if key in targetKeys:
keyValues[key].append( value )
Since we're using a dictionary rather than a list, the keys don't have to be numbers. This saves the map() operation and a string to integer conversion for each line. If you want the keys to be numbers, do the conversion a the end, when you only have to do it once for each key, rather than for each of 50 million lines.
It's not clear what "list[pointer]" is all about. Consider this, however.
from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys
for line in fin:
key, value = map( int, line.split())
if key in targetKeys:
keyValues[key].append( value )
I would use memory-maping: http://docs.python.org/library/mmap.html.
This way you can use the file as if it's stored in memory, but the OS decides which pages should actually be read from the file.
Here is a recursive binary search on the text file
import os, stat
class IntegerKeyTextFile(object):
def __init__(self, filename):
self.filename = filename
self.f = open(self.filename, 'r')
self.getStatinfo()
def getStatinfo(self):
self.statinfo = os.stat(self.filename)
self.size = self.statinfo[stat.ST_SIZE]
def parse(self, line):
key, value = line.split()
k = int(key)
v = int(value)
return (k,v)
def __getitem__(self, key):
return self.findKey(key)
def findKey(self, keyToFind, startpoint=0, endpoint=None):
"Recursively search a text file"
if endpoint is None:
endpoint = self.size
currentpoint = (startpoint + endpoint) // 2
while True:
self.f.seek(currentpoint)
if currentpoint <> 0:
# may not start at a line break! Discard.
baddata = self.f.readline()
linestart = self.f.tell()
keyatpoint = self.f.readline()
if not keyatpoint:
# read returned empty - end of file
raise KeyError('key %d not found'%(keyToFind,))
k,v = self.parse(keyatpoint)
if k == keyToFind:
print 'key found at ', linestart, ' with value ', v
return v
if endpoint == startpoint:
raise KeyError('key %d not found'%(keyToFind,))
if k > keyToFind:
return self.findKey(keyToFind, startpoint, currentpoint)
else:
return self.findKey(keyToFind, currentpoint, endpoint)
A sample text file created in jEdit seems to work:
>>> i = integertext.IntegerKeyTextFile('c:\\sampledata.txt')
>>> i[1]
key found at 0 with value 345
345
It could definitely be improved by caching found keys and using the cache to determine future starting seek points.
If you have any control over the format of the file, the "sort and binary search" responses are correct. The detail is that this only works with records of a fixed size and offset (well, I should say it only works easily with fixed length records).
With fixed length records, you can easily seek() around the sorted file to find your keys.
One possible optimization is to do a bit of buffering using the sizehint option in file.readlines(..). This allows you to load multiple lines in memory totaling to approximately sizehint bytes.
You need to implement binary search using seek()