Python - Shuffling a list with constraints - python

I've been working for a couple of months, on and off, on a script to shuffle a list in a textfile. I am a beginner in Python (the only language I sort of understand a bit), and after a while I have managed to come up with a few lines of code which do sort of what I need.
The input file I have is a tabbed list. it has 5 words per row, but I'll make it numbers so it looks clearer in the example:
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Now, after a few efforts and a huge amount of work from SO users, I've managed to shuffle these elements so that they don't appear in the same line as their original "partners". This is the code I'm using:
import csv,StringIO
import random
from random import shuffle
datalist = open('lista.txt', 'r')
leyendo = datalist.read()
separando = csv.reader(StringIO.StringIO(leyendo), delimiter = '\t')
macrolist = list(separando)
l = [group[:] for group in macrolist]
random.shuffle(l)
nicendone = []
prev_i = -1
while any(a for a in l):
new_i = max(((i,a) for i,a in enumerate(l) if i != prev_i), key=lambda x: len(x[1]))[0]
nicendone.append(l[new_i].pop(random.randint(0, len(l[new_i]) - 1)))
prev_i = new_i
with open('randolista.txt', 'w') as newdoc:
for i, m in enumerate(nicendone, 1):
newdoc.write(m + [', ', '\n'][i % 5 == 0])
datalist.close()
This does the job, but what I actually need is a bit more complicated. I need to shuffle the list with the following restrictions:
The words in the first and second column should be shuffled ONLY within their own column.
The new randomised list should have no two elements appearing in the same line again.
What I'd like to get is something like the following:
01 17 25 19 13
16 22 13 03 20
etc
So that items in the first and second column are only shuffled within their own columns, and no two items are in the same row in the output that were in the same row in the input. I realise in a 5 row example this last constraint is constantly broken, but the real input file has 100 rows.
I really don't know how to even start doing this. My programming abilities are limited, but the problem is that I can't even come up with a pseudocode for it. How can I make Python identify the elements of the first two columns so that it only shuffles them vertically?
Thanks in advance

Shuffling the first two columns in such a way that two values that used to be on the same row do not appear on the same row can be accomplished by transposing the the columns with a random number. For example: you could push the first column 20 rows down and the second column 10 rows down where 20 and 10 are random integers less than the numbers of rows.
A sample code that randomizes the first two columns:
from random import sample
text = \
"""a b c d e
f g h i j
k l m n o
p q r s t"""
# Translate file to matrix (list of lists)
matrix = map(lambda x: x.split(" "), text.split("\n"))
# Determine height and height of matrix
height = len(matrix)
width = len(matrix[0])
# Choose two (unique) numbers for transposing the first two columns
transpose_list = sample(xrange(0, height), 2)
# Now build a new matrix, transposing only the first two
# columns.
new_matrix = []
for y in range(0, height):
row = []
for x in range(0, 2):
transpose = (y + transpose_list[x]) % height
row.append(matrix[transpose][x])
for x in range(2, width):
row.append(matrix[y][x])
new_matrix.append(row)
# And create a list again
new_text = "\n".join(map(lambda x: " ".join(x), new_matrix))
print new_text
This results in something like:
a l c d e
f q h i j
k b m n o
p g r s t
If I understand you post correctly, you already have an algorithm for randomizing the rest of the table?
I hope this is of any help :-).
Wout

Related

Find combinations (without "of size=r") from a set with decreasing sum value using Python

(Revised for clarity 02-08-2021)
This is similar to the question here:
Find combinations of size r from a set with decreasing sum value
This is different from the answer posted in the link above because I am looking for answers without "size r=3".
I have a set (array) of numbers.
I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row.
Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases.
If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned.
Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]
Desired Output Example #1 format where the last number in each row is the total (sum) of the row:
Beginning of list
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
...(all number combinations in between)
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
...(all number combinations in between)
0,0,0,15,0,0,1,16
0,0,0,15,0,0,0,15
0,0,0,0,10,5,0,15
0,0,0,0,10,0,1,11
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1
End of list
Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row.
For Example #1:
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
For example this is one row of output based on the Input Example #1 above:
30,25,0,0,0,5,1,61
Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total.
Input Example #2 with 5 numbers in the array: [20,15,10,5,1]
Desired Output Example #2 format where the last number in each row is the total (sum) of the row:
Beginning of list
20,15,10,5,1,51
20,15,10,5,0,50
20,15,10,0,1,46
20,15,10,0,0,45
...(all number combinations in between)
20,0,10,0,0,30
0,15,10,5,0,30
...(all number combinations in between)
0,15,0,0,1,16
0,15,0,0,0,15
0,0,10,5,0,15
0,0,10,0,1,11
0,0,10,0,0,10
0,0,0,5,1,6
0,0,0,5,0,5
0,0,0,0,1,1
End of list
Input Example #1: [30,25,20,15,10,5,1]
Every row of the output should show each number in the array used only once at most per row to get the total for the row.
The rows must be sorted in decreasing order by the sums of the numbers used to get the total.
The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106
The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105
The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101
...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1...
The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6
The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5
The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1
I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here):
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
So this is what I need as an Input (in this example):
[30,25,20,15,10,5,1]
size=4 in the above code limits the output to 4 of the numbers in the array.
If I take out size=4 I get an error. I need to use the entire array of numbers.
I can manually change size=4 to size=1 and run it then size=2 then run it and so on.
Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs.
I could then manually put the lists together but that won't work for larger sets (arrays) of numbers.
Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow:
Import the following:
import pandas as pd
import numpy as np
The beginning of the code as in the questions:
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
The code, I think could help to obtain the final option:
array_len = array.__len__()
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 0 0 0 90
1 30 25 20 0 10 0 0 85
2 30 25 0 15 10 0 0 80
3 30 25 20 0 0 5 0 80
4 30 25 20 0 0 0 1 76
Now generalization for all sizes
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 10 5 1 106
1 30 25 20 15 10 5 0 105
2 30 25 20 15 10 0 1 101
3 30 25 20 15 10 0 0 100
4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output.
Here is the final code with some extra lines left in for reference but commented out:
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
# Commented out line below as it was giving extra information
# print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
# Commented out two lines below as it was from the original code and giving extra information
#for key in order:
# print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
# Commented out line below as it was giving extra information
# print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
# Commented out line below as it was giving extra information
# print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
# Update: removed this line below as I didn't need a header
# aux.columns=array + ['total']
aux = aux.astype(int)
# Tried option below first but it was not necessary when using to_csv
# pd.set_option('display.max_rows', None)
print(aux.to_csv(index=False,header=None))
Searched references:
Similar question:
Find combinations of size r from a set with decreasing sum value
Pandas references:
https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html
Online compiler used:
https://www.programiz.com/python-programming/online-compiler/
Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]:
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
30,25,20,15,0,5,1,96
30,25,20,15,0,5,0,95
30,25,20,0,10,5,1,91
30,25,20,15,0,0,1,91
30,25,20,0,10,5,0,90
30,25,20,15,0,0,0,90
30,25,0,15,10,5,1,86
30,25,20,0,10,0,1,86
30,25,0,15,10,5,0,85
30,25,20,0,10,0,0,85
30,0,20,15,10,5,1,81
30,25,0,15,10,0,1,81
30,25,20,0,0,5,1,81
30,0,20,15,10,5,0,80
30,25,0,15,10,0,0,80
30,25,20,0,0,5,0,80
0,25,20,15,10,5,1,76
30,0,20,15,10,0,1,76
30,25,0,15,0,5,1,76
30,25,20,0,0,0,1,76
0,25,20,15,10,5,0,75
30,0,20,15,10,0,0,75
30,25,0,15,0,5,0,75
30,25,20,0,0,0,0,75
0,25,20,15,10,0,1,71
30,0,20,15,0,5,1,71
30,25,0,0,10,5,1,71
30,25,0,15,0,0,1,71
0,25,20,15,10,0,0,70
30,0,20,15,0,5,0,70
30,25,0,0,10,5,0,70
30,25,0,15,0,0,0,70
0,25,20,15,0,5,1,66
30,0,20,0,10,5,1,66
30,0,20,15,0,0,1,66
30,25,0,0,10,0,1,66
0,25,20,15,0,5,0,65
30,0,20,0,10,5,0,65
30,0,20,15,0,0,0,65
30,25,0,0,10,0,0,65
0,25,20,0,10,5,1,61
30,0,0,15,10,5,1,61
0,25,20,15,0,0,1,61
30,0,20,0,10,0,1,61
30,25,0,0,0,5,1,61
0,25,20,0,10,5,0,60
30,0,0,15,10,5,0,60
0,25,20,15,0,0,0,60
30,0,20,0,10,0,0,60
30,25,0,0,0,5,0,60
0,25,0,15,10,5,1,56
0,25,20,0,10,0,1,56
30,0,0,15,10,0,1,56
30,0,20,0,0,5,1,56
30,25,0,0,0,0,1,56
0,25,0,15,10,5,0,55
0,25,20,0,10,0,0,55
30,0,0,15,10,0,0,55
30,0,20,0,0,5,0,55
30,25,0,0,0,0,0,55
0,0,20,15,10,5,1,51
0,25,0,15,10,0,1,51
0,25,20,0,0,5,1,51
30,0,0,15,0,5,1,51
30,0,20,0,0,0,1,51
0,0,20,15,10,5,0,50
0,25,0,15,10,0,0,50
0,25,20,0,0,5,0,50
30,0,0,15,0,5,0,50
30,0,20,0,0,0,0,50
0,0,20,15,10,0,1,46
0,25,0,15,0,5,1,46
30,0,0,0,10,5,1,46
0,25,20,0,0,0,1,46
30,0,0,15,0,0,1,46
0,0,20,15,10,0,0,45
0,25,0,15,0,5,0,45
30,0,0,0,10,5,0,45
0,25,20,0,0,0,0,45
30,0,0,15,0,0,0,45
0,0,20,15,0,5,1,41
0,25,0,0,10,5,1,41
0,25,0,15,0,0,1,41
30,0,0,0,10,0,1,41
0,0,20,15,0,5,0,40
0,25,0,0,10,5,0,40
0,25,0,15,0,0,0,40
30,0,0,0,10,0,0,40
0,0,20,0,10,5,1,36
0,0,20,15,0,0,1,36
0,25,0,0,10,0,1,36
30,0,0,0,0,5,1,36
0,0,20,0,10,5,0,35
0,0,20,15,0,0,0,35
0,25,0,0,10,0,0,35
30,0,0,0,0,5,0,35
0,0,0,15,10,5,1,31
0,0,20,0,10,0,1,31
0,25,0,0,0,5,1,31
30,0,0,0,0,0,1,31
0,0,0,15,10,5,0,30
0,0,20,0,10,0,0,30
0,25,0,0,0,5,0,30
30,0,0,0,0,0,0,30
0,0,0,15,10,0,1,26
0,0,20,0,0,5,1,26
0,25,0,0,0,0,1,26
0,0,0,15,10,0,0,25
0,0,20,0,0,5,0,25
0,25,0,0,0,0,0,25
0,0,0,15,0,5,1,21
0,0,20,0,0,0,1,21
0,0,0,15,0,5,0,20
0,0,20,0,0,0,0,20
0,0,0,0,10,5,1,16
0,0,0,15,0,0,1,16
0,0,0,0,10,5,0,15
0,0,0,15,0,0,0,15
0,0,0,0,10,0,1,11
0,0,0,0,10,0,0,10
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1

CSV Python list

Name Gender Physics Maths
A 45 55
X 22 64
C 0 86
I have a csv file like this, I have made some modification to get list with only the marks in the form [[45,55],[22,64]]
I want to find the minimum for each subject.
But when I run my code, I only get the minimum for the first subject and the other values are copied from the row
The answer I want - [0,55]
The answer I get - [0,86]
def find_min(marks,cols,rows):
minimum = []
temp = []
for list in marks:
min1 = min([x for x in list])
minimum.append(min1)
# for j in range(rows):
# for i in range(cols):
# temp.append(marks)
# x = min(temp)
# minimum.append(x)
return minimum
How do I modify my code
I cant use any other modules/libraries like csv or pandas
i tries using zip(*marks) - But that just prints my marks list as is.
Is there any way to separate the inner-lists from the larger lists
This will calculate the minimum per subject:
In [707]: marks = [[45,55],[22,64]]
In [697]: [min(idx) for idx in zip(*marks)]
Out[697]: [22, 55]
Try transposing the marks array (which is one student per row) so each list entry corresponds to a column ("subject") from your CSV:
def find_min(marks):
mt = zip(*marks)
mins = [min(row) for row in mt]
return mins
example usage:
marks = [[45,55],[22,64],[0,86]]
print(find_min(marks))
which prints:
[0, 55]

Memory efficient way to read an array of integers from single line of input in python2.7

I want to read a single line of input containing integers separated by spaces.
Currently I use the following.
A = map(int, raw_input().split())
But now the N is around 10^5 and I don't need the whole array of integers, I just need to read them 1 at a time, in the same sequence as the input.
Can you suggest an efficient way to do this in Python2.7
Use generators:
numbers = '1 2 5 18 10 12 16 17 22 50'
gen = (int(x) for x in numbers.split())
for g in gen:
print g
1
5
6
8
10
12
68
13
the generator object would use one item at a time, and won't construct a whole list.
You could parse the data a character at a time, this would reduce memory usage:
data = "1 50 30 1000 20 4 1 2"
number = []
numbers = []
for c in data:
if c == ' ':
if number:
numbers.append(int(''.join(number)))
number = []
else:
number.append(c)
if number:
numbers.append(int(''.join(number)))
print numbers
Giving you:
[1, 50, 30, 1000, 20, 4, 1, 2]
Probably quite a bit slower though.
Alternatively, you could use itertools.groupby() to read groups of digits as follows:
from itertools import groupby
data = "1 50 30 1000 20 4 1 2"
numbers = []
for k, g in groupby(data, lambda c: c.isdigit()):
if k:
numbers.append(int(''.join(g)))
print numbers
If you're able to destroy the original string, split accepts a parameter for the maximum number of breaks.
See docs for more details and examples.

Query Board challenge on Python, need some pointers

So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k

How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python

Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0

Categories