Indicating a population structure in EggLib Python - python

In Python, I am using EggLib. I am trying to calculate Jost's D value per SNP found in a VCF file.
Data
Data is here in VCF format. The data set is small, there are 2 populations, 100 individuals per population and 6 SNPs (all on chromosome 1).
Each individual is named Pp.Ii, where p is the population index it belongs to and i is the individual index.
Code
My difficulties concern the specification of the population structure. Here is my trial
### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf")
### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}
# Loop through each population
for popIndex in [0,1]:
# dictionnary for a given population. There are two populations
dpop = {}
# Loop through each individual
for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):
# A single list to define an individual
dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
dcluster[popIndex] = dpop
struct = {0: dcluster}
### Define the population structure ###
Structure = egglib.stats.make_structure(struct, None)
### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D
### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)
### Calculate Jost's D ###
cs.process_site(site, struct=Structure)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
self._frq.process_site(site, struct=struct)
File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)
The documentation indicates here
[The Structure object] is a tuple containing two items, each being a dict. The first one represents the ingroup and the second represents the outgroup.
The ingroup dictionary is itself a dictionary holding more dictionaries, one for each cluster of populations. Each cluster dictionary is a dictionary of populations, populations being themselves represented by a dictionary. A population dictionary is, again, a dictionary of individuals. Fortunately, individuals are represented by lists.
An individual list contains the index of all samples belonging to this individual. For haploid data, individuals will be one-item lists. In other cases, all individual lists are required to have the same number of items (consistent ploidy). Note that, if the ploidy is more than one, nothing enforces that samples of a given individual are grouped within the original data.
The keys of the ingroup dictionary are the labels identifying each cluster. Within a cluster dictionary, the keys are population labels. Finally, within a population dictionary, the keys are individual labels.
The second dictionary represents the outgroup. Its structure is simpler: it has individual labels as keys, and lists of corresponding sample indexes as values. The outgroup dictionary is similar to any ingroup population dictionary. The ploidy is required to match over all ingroup and outgroup individuals.
but I fail to make sense of it. The example provided is for fasta format and I don't understand to extend the logic to VCF format.

There are two errors
First error
The function make_structure returns the Structure object but does not save it within stats. You therefore have to save this output and use it in the function process_site.
Structure = egglib.stats.make_structure(struct, None)
Second error
The Structure object must designate haploids. Therefore, create the dictionary as
dcluster = {}
for popIndex in [0,1]:
dpop = {}
for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):
dpop[IndIndex] = [IndIndex]
dcluster[popIndex] = dpop
struct = {0: dcluster}

Related

Count occurances of a specific string within multi-valued elements in a set

I have generated a list of genes
genes = ['geneName1', 'geneName2', ...]
and a set of their interactions:
geneInt = {('geneName1', 'geneName2'), ('geneName1', 'geneName3'),...}
I want to find out how many interactions each gene has and put that in a vector (or dictionary) but I struggle to count them. I tried the usual approach:
interactionList = []
for gene in genes:
interactions = geneInt.count(gene)
interactionList.append(ineractions)
but of course the code fails because my set contains elements that are made out of two values while I need to iterate over the single values within.
I would argue that you are using the wrong data structure to hold interactions. You can represent interactions as a dictionary keyed by gene name, whose values are a set of all the genes it interacts with.
Let's say you currently have a process that does something like this at some point:
geneInt = set()
...
geneInt.add((gene1, gene2))
Change it to
geneInt = collections.defaultdict(set)
...
geneInt[gene1].add(gene2)
If the interactions are symmetrical, add a line
geneInt[gene2].add(gene1)
Now, to count the number of interactions, you can do something like
intCounts = {gene: len(ints) for gene, ints in geneInt.items()}
Counting your original list is simple if the interactions are one-way as well:
intCounts = dict.fromkeys(genes, 0)
for gene, _ in geneInt:
intCounts[gene] += 1
If each interaction is two-way, there are three possibilities:
Both interactions are represented in the set: the above loop will work.
Only one interaction of a pair is represented: change the loop to
for gene1, gene2 in geneInt:
intCounts[gene1] += 1
if gene1 != gene2:
intCounts[gene2] += 1
Some reverse interactions are represented, some are not. In this case, transform geneInt into a dictionary of sets as shown in the beginning.
Try something like this,
interactions = {}
for gene in genes:
interactions_count = 0
for tup in geneInt:
interactions_count += tup.count(gene)
interactions[gene] = interactions_count
Use a dictionary, and keep incrementing the value for every gene you see in each tuple in the set geneInt.
interactions_counter = dict()
for interaction in geneInt:
for gene in interaction:
interactions_counter[gene] = interactions_counter.get(gene, 0) + 1
The dict.get(key, default) method returns the value at the given key, or the specified default if the key doesn't exist. (More info)
For the set geneInt={('geneName1', 'geneName2'), ('geneName1', 'geneName3')}, we get:
interactions_counter = {'geneName1': 2, 'geneName2': 1, 'geneName3': 1}

How to to remove a double/nested for-loop? String to float transformation in python

I have the following polygon of a geographic area that I fetch via a request in CAP/XML format from an API
The raw data looks like this:
<polygon>22.3243,113.8659 22.3333,113.8691 22.4288,113.8691 22.4316,113.8742 22.4724,113.9478 22.5101,113.9951 22.5099,113.9985 22.508,114.0017 22.5046,114.0051 22.5018,114.0085 22.5007,114.0112 22.5007,114.0125 22.502,114.0166 22.5038,114.0204 22.5066,114.0245 22.5067,114.0281 22.5057,114.0371 22.5051,114.0409 22.5041,114.0453 22.5025,114.0494 22.5023,114.0511 22.5035,114.0549 22.5047,114.0564 22.5059,114.057 22.5104,114.0576 22.512,114.0584 22.5144,114.0608 22.5163,114.0637 22.517,114.0657 22.5172,114.0683 22.5181,114.0717 22.5173,114.0739</polygon>
I store the requested items in a dictionary and then work through them to transform to a GeoJSON list object that is suitable for ingestion into Elasticsearch according to the schema I'm working with. I've removed irrelevant code here for ease of reading.
# fetches and store data in a dictionary
r = requests.get("https://alerts.weather.gov/cap/ny.php?x=0")
xpars = xmltodict.parse(r.text)
json_entry = json.dumps(xpars['feed']['entry'])
dict_entry = json.loads(json_entry)
# transform items if necessary
for entry in dict_entry:
if entry['cap:polygon']:
polygon = entry['cap:polygon']
polygon = polygon.split(" ")
coordinates = []
# take the split list items swap their positions and enclose them in their own arrays
for p in polygon:
p = p.split(",")
p[0], p[1] = float(p[1]), float(p[0]) # swap lon/lat
coordinates += [p]
# more code adding fields to new dict object, not relevant to the question
The output of the p in polygon loop looks like:
[ [113.8659, 22.3243], [113.8691, 22.3333], [113.8691, 22.4288], [113.8742, 22.4316], [113.9478, 22.4724], [113.9951, 22.5101], [113.9985, 22.5099], [114.0017, 22.508], [114.0051, 22.5046], [114.0085, 22.5018], [114.0112, 22.5007], [114.0125, 22.5007], [114.0166, 22.502], [114.0204, 22.5038], [114.0245, 22.5066], [114.0281, 22.5067], [114.0371, 22.5057], [114.0409, 22.5051], [114.0453, 22.5041], [114.0494, 22.5025], [114.0511, 22.5023], [114.0549, 22.5035], [114.0564, 22.5047], [114.057, 22.5059], [114.0576, 22.5104], [114.0584, 22.512], [114.0608, 22.5144], [114.0637, 22.5163], [114.0657, 22.517], [114.0683, 22.5172], [114.0717, 22.5181], [114.0739, 22.5173] ]
Is there a way to do this that is better than O(N^2)? Thank you for taking the time to read.
O(KxNxM)
This process involves three obvious loops. These are:
Checking each entry (K)
Splitting valid entries into points (MxN) and iterating through those points (N)
Splitting those points into respective coordinates (M)
The amount of letters in a polygon string is ~MxN because there are N points each roughly M letters long, so it iterates through MxN characters.
Now that we know all of this, let's pinpoint where each occurs.
ENTRIES (K):
IF:
SPLIT (MxN)
POINTS (N):
COORDS(M)
So, we can finally conclude that this is O(K(MxN + MxN)) which is just O(KxNxM).

Comparing key from first dictionary to values from second dictionary

Please I need some help again.
I have a big data base file (let's call it db.csv) containing many informations.
Simplified database file to illustrate:
I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1
0 ['VFG003386', 'VFG034084', 'VFG003381']
1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']
2 ['VFG018349', 'VFG018485', 'VFG043567']
...
14471 ['VFG015743', 'VFG002143']
So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2
VFG044259 IA027
VFG044258 IA027
VFG011941 CVF397
VFG012016 CVF399
...
What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:
IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...
So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.
First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.
You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:
cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
5: ['VFG011941'],
7949: ['VFG003386'],
14471: ['VFG015743', 'VFG002143', 'VFG012016']}
And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:
gene_id_to_vf_accession = {'VFG044259': 'IA027',
'VFG044258': 'IA027',
'VFG011941': 'CVF397',
'VFG012016': 'CVF399',
'VFG000676': 'VF0142',
'VFG002231': 'VF0369',
'VFG003386': 'CVF051'}
And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.
We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.
So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value
from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
vf_accession_to_gene_ids[vf_accession].append(gene_id)
For the sample data I posted above, vf_accession_to_gene_ids now looks like:
defaultdict(<class 'list'>, {'VF0142': ['VFG000676'],
'CVF051': ['VFG003386'],
'IA027': ['VFG044258', 'VFG044259'],
'CVF399': ['VFG012016'],
'CVF397': ['VFG011941'],
'VF0369': ['VFG002231']})
Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:
vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
gene_ids = vf_accession_to_gene_ids[vf_accession]
cluster_group = []
for gene_id in gene_ids:
for cluster_nr in cluster_nr_to_gene_ids:
if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
cluster_group.append(cluster_nr)
vf_accession_to_cluster_groups[vf_accession] = cluster_group
The end result for the above sample data now is:
{'VF0142': [],
'CVF051': [0, 7949],
'IA027': [0],
'CVF399': [2, 14471],
'CVF397': [5],
'VF0369': []}
Caveat: I don't do much Python development, so there's likely a better way to do this. You can first map your VFG... gene_ids to their cluster numbers, and then use that to process the second dictionary:
from collections import defaultdict
import sys
import ast
# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists
vfg_cluster_map = defaultdict(list)
# map all of the vfg... keys to their cluster numbers first
with open(sys.argv[1], 'r') as dict_1:
for line in dict_1:
# split the line at the first space to separate the cluster number and gene ID list
# e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']",
# cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']"
cluster_group_num, vfg_list = line.strip().split(' ', 1)
cluster_group_num = int(cluster_group_num)
# convert "['VFG...', 'VFG...']" from a string to an actual list
vfg_list = ast.literal_eval(vfg_list)
for vfg in vfg_list:
vfg_cluster_map[vfg].append(cluster_group_num)
# you now have a dictionary mapping gene IDs to the clusters they
# appear in, e.g
# {'VFG003386': [0],
# 'VFG034084': [0],
# ...}
# you can look in that dictionary to find the cluster numbers corresponding
# to your vfg... keys in dict_2 and add them to the list for that vf_accession
vf_accession_cluster_map = defaultdict(list)
with open(sys.argv[2], 'r') as dict_2:
for line in dict_2:
vfg, vf_accession = line.strip().split(' ')
# add the list of cluster numbers corresponding to this vfg... to
# the list of cluster numbers corresponding to this vf_accession
vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg])
for vf_accession, cluster_list in vf_accession_cluster_map.items():
print vf_accession + ' ' + str(cluster_list)
Then save the above script and invoke it like python <script name> dict1_file dict2_file > output (or you could write the strings to a file instead of printing them and redirecting).
EDIT: After looking at #BioGeek's answer, I should note that it would make more sense to process this all in one shot than to create dict_1 and dict_2 files, read them in, parse the lines back into numbers and lists, etc. If you don't need to write the dictionaries to a file first, then you can just add your other code to the script and use the dictionaries directly.

Algorithmic / coding help for a PySpark markov model

I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions

Reading files in parallel in python

I have a bunch of files (almost 100) which contain data of the format:
(number of people) \t (average age)
These files were generated from a random walk conducted on a population of a certain demographic. Each file has 100,000 lines, corresponding to the average age of populations of sizes from 1 to 100,000. Each file corresponds to a different locality in a third world country. We will be comparing these values to the average ages of similar sized localities in a developed country.
What I want to do is,
for each i (i ranges from 1 to 100,000):
Read in the first 'i' values of average-age
perform some statistics on these values
That means, for each run i (where i ranges from 1 to 100,000), read in the first i values of average-age, add them to a list and run a few tests (like Kolmogorov-Smirnov or chi-square)
In order to open all these files in parallel, I figured the best way would be a dictionary of file objects. But I am stuck in trying to do the above operations.
Is my method the best possible one (complexity-wise)?
Is there a better method?
Actually, it would be possible to hold 10,000,000 lines in memory.
Make a dictionary where the keys are number of people and values are lists of average age where each element of the list comes a different file. Therefore, if there are 100 files, each of your lists will have 100 elements.
This way, you don't need to store the file objects in a dict
Hope this helps
Why not take a simple approach:
Open each file sequentially and read its lines to fill an in-memory data structure
Perform statistics on the in-memory data structure
Here is a self-contained example with 3 "files", each containing 3 lines. It uses StringIO for convenience instead of actual files:
#!/usr/bin/env python
# coding: utf-8
from StringIO import StringIO
# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'
files = [f1, f2, f3]
# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []
for i,filename in enumerate(files):
f = StringIO(filename)
# f = open(filename, 'r')
data.append(dict())
for line in f:
population, average_age = (int(s) for s in line.split('\t'))
data[i][population] = average_age
print data
# gather custom statistics on the data
# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'
The output is:
[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old
I ... don't know if I like this approach, but it's possible that it could work for you. It has the potential to consume large amounts of memory, but may do what you need it to. I make the assumption that your data files are numbered. If that's not the case this may need adaptation.
# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]
# loop for the number of lines.
for line in range(100000):
lines = [fh.readline() for fh in handles]
# Some sort of processing for the list of lines.
That may get close to what you need, but again, I don't know that I like it. If you have any files that don't have the same number of lines this could run into trouble.

Categories