Permutations with very large list

Permutations with very large list - python

I'm trying to run a very large permutation using Python. The goal is to pair items in groups of four or less, separated by 1) periods, 2) dashes, and 3) without any separation. The order is important.
# input
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
# ideal output would be a list that contains strings like the following:
apple-banana-bread (no dashes before or after!)
apple.banana.bread (using periods)
applebananabread (no spaces)
apple-banana (by combining with the first item in the list, I also get shorter groups but need to delete empty items before joining)
... for all the possible groups of 4, order is important
# Requirements:
# Avoiding a symbol at the beginning or end of a resulting string
# Also creating groups of length 1, 2, and 3
I've used itertools.permutations to create an itertools.chain (perms). But then, this fails with a MemoryError when removing empty elements after converting to a list. Even when using a machine with a large amount of RAM.
food = ['', 'apple', 'banana', 'bread', 'tomato', 'yogurt', ...] `
perms_ = itertools.permutations(food, 4)
perms = [list(filter(None, tup)) for tup in perms] # remove empty nested elements, to prevent two symbols in a row or a symbol before/after
perms = filter(None, perms) # remove empty lists, to prevent two symbols in a row or a symbol before/after
names_t = (
['.'.join(group) for group in perms_t] + # join using dashes
['-'.join(group) for group in perms_t] + # join using periods
[''.join(group) for group in perms_t] # join without spaces
)
names_t = list(set(names_t)) # remove all duplicates
How can I make this code more memory efficient so that it doesn't crash for a large list? If I need to, I can run the code separately for each item separator (commas, periods, directly joined).

Given that I'm not too sure what you would do with a saved list of 6B things, but I think you have 2 strategies if you want to go forward.
First, you could reduce the size of the things in the list by substituting something like a numpy unit8 for each item, which would reduce the size of the resulting list by a LOT, but you would not have the format you want.
In [15]: import sys
In [16]: import numpy as np
In [17]: list_of_strings = ['dog food'] * 1000000
In [18]: list_of_uint8s = np.ones(1000000, dtype=np.uint8)
In [19]: sys.getsizeof(list_of_strings)
Out[19]: 8000056
In [20]: sys.getsizeof(list_of_uint8s)
Out[20]: 1000096
Second, if you just want to "save" the items to some kind of massive file, you do NOT need to realize the list in memory. Just use itertools.permutations and write the objects to the file on-the-fly. No need to create the list in memory if you just want to push it to a file...
In [48]: from itertools import permutations
In [49]: stuff = ['dog', 'cat', 'mouse']
In [50]: perms = permutations(stuff, 2)
In [51]: with open('output.csv', 'w') as tgt:
...: for p in perms:
...: line = '-'.join(p)
...: tgt.write(line)
...: tgt.write('\n')
...:
In [52]: %more output.csv
dog-cat
dog-mouse
cat-dog
cat-mouse
mouse-dog
mouse-cat

Related

How to split a list of lists dynamically with next line separator

My list of lists looks like this:
list_of_lists = [['England', '90.0%'], ['Scotland', '10.0%']]
I would like to have this output:
England, 90.0%
Scotland, 10.0%
I have tried unpacking the list of lists and printing using the following:
a,b = list_of_lists
print(a,'\n',b)
but I would like to print them dynamically based on the length of the list_of_lists. So if my list_of_lists is len(x) then I want to print(x[0],'\n\',...,x[i])

You can join the elements per row using ', ' as separator, then join the rows with a newline and print:
list_of_lists = [['England', '90.0%'], ['Scotland', '10.0%']]
print('\n'.join(map(', '.join, list_of_lists)))
output:
England, 90.0%
Scotland, 10.0%

Simple for loop?
for row in list_of_lists:
print(', '.join(row))

You could use unpacking of a generator that formats each entry and use a new line as the print separator:
print(*(f'{c}, {p}' for c,p in list_of_lists),sep='\n')
This supports different data types and gives you full control over the order, spacing, alignment and delimiters
If your data is only made up of strings that you want to separate with commas, you can use join instead of a format string:
print(*map(', '.join,list_of_lists),sep='\n') # only works with string data

Python list extend, append every extend value with new line

Please help, I am using extend list to append multiple values to list.
I need to extend to list as a new line for every extend.
>>> list1 = []
>>> list1 = (['Para','op','qa', 'reason'])
>>> list1.extend(['Power','pass','ok', 'NA'])
>>> print list1
['Para', 'op', 'qa', 'reason', 'Power', 'pass', 'ok', 'NA']
I need to provide this list to csv and It has to print like two lines.
Para, op, qa, reason
Power, pass, ok, NA

If you wanted separate lists, make them separate. Don't use list.extend(), use appending:
list1 = [['Para','op','qa', 'reason']] # brackets, creating a list with a list
list1.append(['Power','pass','ok', 'NA'])
Now list1 is a list with two objects, each itself a list:
>>> list1
[['Para', 'op', 'qa', 'reason'], ['Power', 'pass', 'ok', 'NA']]
If you are using the csv module to write out your CSV file, use the csvwriter.writerows() method to write each row into a separate line:
>>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> writer.writerows(list1)
Para,op,qa,reason
Power,pass,ok,NA

Your desired result, list1, should be a list of two elements, that each one of them is a list by itself.
list1 = ['Para','op','qa', 'reason']
# wrapping list1 with [] crates a new list which its first element is the original list1.
# In your case, this action gives a list of lines with only one single line
# Only after that I can add a new list of lines that contains another single line
list1 = [list1] + [['Power','pass','ok', 'NA']]
print (list1)

Convert a Python list of lists to a string

How can I best convert a list of lists in Python to a (for example) comma-delimited-value, newline-delimited-row, string? Ideally, I'd be able to do something like this:
>import csv
>matrix = [["Frodo","Baggins","Hole-in-the-Ground, Shire"],["Sauron", "I forget", "Mordor"]]
> csv_string = csv.generate_string(matrix)
>print(csv_string)
Frodo,Baggins,"Hole-in-the-Ground, Shire"
Sauron,I forget,Mordor
I know that Python has a csv module, as seen in SO questions like this but all of its functions seem to operate on a File object. The lists are small enough that using a file is overkill.
I'm familiar with the join function, and there are plenty of SO answers about it. But this doesn't handle values that contain a comma, nor does it handle multiple rows unless I nest a join within another join.

Combine the csv-module with StringIO:
import io, csv
result = io.StringIO()
writer = csv.writer(result)
writer.writerow([5,6,7])
print(result.getvalue())

The approach in the question you link to as a reference for join, together with a nested joins (what's wrong with that?) works as long as you can convert all of the objects contained in your list of lists to a string:
list_of_lists = [[1, 'a'], [2, 3, 'b'], ['c', 'd']]
joined = '\n'.join(','.join(map(str, row)) for row in list_of_lists)
print(join)
Output:
1,a
2,3,b
c,d
EDIT:
If the string representation of your objects may contain commas, here are a couple of things you could do to achieve an output that can recover the original list of lists:
escape those commas, or
wrap said string representations in some flavor of quotes (then you have to escape the occurrences of that character inside your values). This is precisely what the combination of io.StringIO and csv does (see Daniel's answer).
To achieve the first, you could do
import re
def escape_commas(obj):
return re.sub(',', '\,', str(obj))
joined = '\n'.join(','.join(map(escape_commas, row)) for row in list_of_lists)
For the second,
import re
def wrap_in_quotes(obj):
return '"' + re.sub('"', '\"', str(obj)) + '"'
joined = '\n'.join(','.join(map(wrap_in_quotes, row)) for row in list_of_lists)

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)

I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!

I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))

As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.

So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()

I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

sorting a list and separating the different features

So I am given a list and I am supposed to sort it down into two lists, one with the names of the companies and one with the prices in a nested list.
['Acer 481242.74\n', 'Beko 966071.86\n', 'Cemex 187242.16\n', 'Datsun 748502.91\n', 'Equifax 146517.59\n', 'Gerdau 898579.89\n', 'Haribo 265333.85\n']
I used the following code to separate the names properly:
print('\n'.join(data))
namelist = [i.split(' ', 1)[0] for i in data]
print(namelist)
But now it wants me to seperate all the prices from the list and put them in a single list nested together and I don't know how to do that.

To build two separate lists, just use a regular loop:
names = []
prices = []
for entry in data:
name, price = entry.split()
names.append(name)
prices.append(price)
If you needed the entries together in one list, each entry a list containing the name and the price separately, just split in a list comprehension like you did, but don't pick one or the other value from the result:
names_and_prices = [entry.split() for entry in data]
I used str.split() without arguments to split on arbitrary whitespace. This assumes you always have exactly two entries in your strings. You can still limit the split, but then use None as the first argument, and strip the line beforehand to get rid of the \n separately:
names_and_prices = [entry.strip().split(None, 1) for entry in data]
Demo for the 'nested' approach:
>>> data = ['Acer 481242.74\n', 'Beko 966071.86\n', 'Cemex 187242.16\n', 'Datsun 748502.91\n', 'Equifax 146517.59\n', 'Gerdau 898579.89\n', 'Haribo 265333.85\n']
>>> [entry.split() for entry in data]
[['Acer', '481242.74'], ['Beko', '966071.86'], ['Cemex', '187242.16'], ['Datsun', '748502.91'], ['Equifax', '146517.59'], ['Gerdau', '898579.89'], ['Haribo', '265333.85']]

split() is the right approach, as it will give you everything you need if you don't limit it to just one split (the , 1) in your code). If you provide no arguments to it at all, it'll split on any size of whitespace.
>>> data = ['Acer 481242.74\n', 'Beko 966071.86\n', 'Cemex 187242.16\n', 'Datsun 748502.91\n', 'Equifax 146517.59\n', 'Gerdau 898579.89\n', 'Haribo 265333.85\n']
>>> nested_list = [i.split() for i in data]
>>> nested_list
[['Acer', '481242.74'], ['Beko', '966071.86'], ['Cemex', '187242.16'], ['Datsun', '748502.91'], ['Equifax', '146517.59'], ['Gerdau', '898579.89'], ['Haribo', '265333.85']]
>>> print(*nested_list, sep='\n')
['Acer', '481242.74']
['Beko', '966071.86']
['Cemex', '187242.16']
['Datsun', '748502.91']
['Equifax', '146517.59']
['Gerdau', '898579.89']
['Haribo', '265333.85']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Permutations with very large list - python

Related

How to split a list of lists dynamically with next line separator

Python list extend, append every extend value with new line

Convert a Python list of lists to a string

Trying to split a txt file into multiple variables

sorting a list and separating the different features

Categories

Resources