I'm trying to build a randomized dataset based on an input dataset.
The input dataset consists of 856471 lines, and in each line there is a pair of values separated by a tab.
NO entry from the randomized dataset can be equal to any of those in the input dataset, this means:
If the pair in line 1 is "Protein1 Protein2", the randomized dataset cannot contain the following pairs:
"Protein1 Protein2"
"Protein2 Protein1"
In order to achieve this I tried the following:
data = infile.readlines()
ltotal = len(data)
for line in data:
words = string.split(line)
init = 0
while init != ltotal:
p1 = random.choice(words)
p2 = random.choice(words)
words.remove(p1)
words.remove(p2)
if "%s\t%s\n" % (p1, p2) not in data and "%s\t%s\n" % (p2, p1) not in data:
outfile.write("%s\t%s\n" % (p1, p2))
However, I'm getting the following error:
Traceback (most recent call last): File
"C:\Users\eduarte\Desktop\negcreator.py", line 46, in <module>
convert(indir, outdir) File "C:\Users\eduarte\Desktop\negcreator.py", line 27, in convert
p1 = random.choice(words) File "C:\Python27\lib\random.py", line 274, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
I was pretty sure this would work. What am I doing wrong?
Thanks in advance.
The variable words is overwritten for each line in the loop
for line in data:
words = string.split(line)
This is most probably not what you want.
Moreover, your while loop is an infinite loop, which will consume words eventually, leaving no choices for random.choice().
Edit: My guess is that you have a file of tab-separated word pairs, a pair in each line, and you are trying to form random pairs from all of the words, writing only those random pairs to the output file that do not occur in the original file. Here is some code doing this:
import itertools
import random
with open("infile") as infile:
pairs = set(frozenset(line.split()) for line in infile)
words = list(itertools.chain.from_iterable(pairs))
random.shuffle(words)
with open("outfille", "w") as outfile:
for pair in itertools.izip(*[iter(words)] * 2):
if frozenset(pair) not in pairs:
outfile.write("%s\t%s\n" % pair)
Notes:
A pair of words is represented by a frozenset, since order does not matter.
I use a set for all the pairs to be able to test if a pair is in the set in constant time.
Instead of using random.choice() repeatedly, I only shuffle the whole list once, and then iterate over it in pairs. This way, we don't need to remove the already used words from the list, so it's much more efficient. (This change an the previous one bring down the algorithmic complexity of the approach from O(n²) to O(n).)
The expression itertools.izip(*[iter(words)] * 2) is a common Python idiom to iterate over words in pairs, in case you did not encounter that one yet.
The code is still untested.
Related
I am making something like a random sentence generator. I want to make a random sentence from words taken randomly from 10 .csv files, which change size frequently so I have to count their size before I select a random line. I have already made it but I'm using way too much code... it currently does something like this:
def file_len(fname):
f = open(fname)
try:
for i, l in enumerate(f):
pass
finally:
f.close()
return i + 1
then selecting random lines...
for all of them I do this:
file1_size = file_len('items/file1.csv')
file1_line = randrange(file1_size) + 1
file1_output = linecache.getline('items/file1.csv', file1_line)
and when it is done for all of them, I just print the output...
print file1_out + file2_out + file3_out ...
Also, sometimes I only want to use some files and not others, so I'd just print the ones I want... e.g. if I just want files number 3, 4 and 7 then I do:
print file3_out + file4_out + file7_out
Obviously there's 30 lines in the line counting, selecting random and assigning that to a variable - 3 lines of code for each file. But things are getting more complex and I thought a dictionary variable might be able to do what I want more quickly and with less code.
I thought it would be good to generate a variable whereby we end up with
random_lines = {'file1.csv': 24, 'file2.csv': 13, 'file3.csv': 3, 'file4.csv': 22, 'file5.csv': 8, 'file6.csv': 97, 'file7.csv': 34, 'file8.csv': 67, 'file9.csv': 31, 'file10.csv': 86}
(The key is the filename and the integer is a random line within the file, re-assigned each time the code is run)
Then, some kind of process that picks the required items (let's say sometimes we only want to use lines from files 1, 6, 8, and 10) and outputs the random line
output = file1.csv random line + file6.csv random line + file8.csv random line + file10.csv random line
then
print output
If anyone can see the obvious quick way to do this (I don't think it's rocket science but I am a beginner at python!) then I'd be very grateful!
Any time you find yourself reusing the same code over and over in an object-oriented language, use a function instead.
def GetOutput(file_name):
file_size = file_len(file_name)
file_line = randrange(file_size) + 1
return linecache.getline(file_name, file_line)
file_1 = GetOutput(file_1)
file_2 = GetOutput(file_2)
...
You can further simplify by storing everything in a dict as you suggest in your original question.
input_files = ['file_1.txt', 'file_2.txt', ...]
outputs = {}
for file_name in input_files:
outputs[file_name] = GetOutput(file_name)
I am brand new to Python and looking up examples for what I want to do. I am not sure what is wrong with this loop, what I would like to do is read a csv file line by line and for each line:
Split by comma
Remove the first entry (which is a name) and store it as name
Convert all other entries to floats
Store name and the float entries in my Community class
This is what I am trying at the moment:
class Community:
num = 0
def __init__(self, inName, inVertices):
self.name = inName
self.vertices = inVertices
Community.num += 1
allCommunities = []
f = open("communityAreas.csv")
for i, line in enumerate(f):
entries = line.split(',')
name = entries.pop(0)
for j, vertex in entries: entries[j] = float(vertex)
print name+", "+entries[0]+", "+str(type(entries[0]))
allCommunities.append(Community(name, entries))
f.close()
The error I am getting is:
>>>>> PYTHON ERROR!!! Traceback (most recent call last):
File "alexChicago.py", line 86, in <module>
for j, vertex in entries: entries[j] = float(vertex)
ValueError: too many values to unpack
It may be worth pointing out that this is running in omegalib, a library for a visual cluster that runs in C and interprets Python.
I think you forgot the enumerate() function on line 86; should be
for j, vertex in enumerate(entries): entries[j] = float(vertex)
If there's always a name and then a variable number of float values, it sounds like you need to split twice: the first time with a maxsplit of 1, and the other as many times as possible. Example:
name, float_values = line.split(',',1)
float_values = [float(x) for x in float_values.split(',')]
I may not be absolutely certain about what you want to achieve here, but converting all the element in entries to float, should not this be sufficient?: Line 86:
entries=map(float, entries)
I am trying to use python to parse a text file (stored in the var trackList) with times and titles in them it looks like this
00:04:45 example text
00:08:53 more example text
12:59:59 the last bit of example text
My regular expression (rem) works, I am also able to split the string (i) into two parts correctly (as in I separate times and text) but I am unable to then add the arrays (using .extend) that the split returns to a large array I created earlier (sLines).
f=open(trackList)
count=0
sLines=[[0 for x in range(0)] for y in range(34)]
line=[]
for i in f:
count+=1
line.append(i)
rem=re.match("\A\d\d\:\d\d\:\d\d\W",line[count-1])
if rem:
sLines[count-1].extend(line[count-1].split(' ',1))
else:
print("error on line: "+count)
That code should go through each line in the file trackList, test to see if the line is as expected, if so separate the time from the text and save the result of that as an array inside an array at the index of one less than the current line number, if not print an error pointing me to the line
I use array[count-1] as python arrays are zero indexed and file lines are not.
I use .extend() as I want both elements of the smaller array added to the larger array in the same iteration of the parent for loop.
So, you have some pretty confusing code there.
For instance doing:
[0 for x in range(0)]
Is a really fancy way of initializing an empty list:
>>> [] == [0 for x in range(0)]
True
Also, how do you know to get a matrix that is 34 lines long? You're also confusing yourself with calling your line 'i' in your for loop, usually that would be reserved as a short hand syntax for index, which you'd expect to be a numerical value. Appending i to line and then re-referencing it as line[count-1] is redundant when you already have your line variable (i).
Your overall code can be simplified to something like this:
# load the file and extract the lines
f = open(trackList)
lines = f.readlines()
f.close()
# create the expression (more optimized for loops)
expr = re.compile('^(\d\d:\d\d:\d\d)\s*(.*)$')
sLines = []
# loop the lines collecting both the index (i) and the line (line)
for i, line in enumerate(lines):
result = expr.match(line)
# validate the line
if ( not result ):
print("error on line: " + str(i+1))
# add an invalid list to the matrix
sLines.append([]) # or whatever you want as your invalid line
continue
# add the list to the matrix
sLines.append(result.groups())
I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).
How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.
Any suggestions are appreciated.
If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.
I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).
Notes:
1.I only store each line's hash to save space (and time if paging might occur)
2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again
3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.
4.I import hashlib because the built-in hash() function is too short to avoid conflicts.
import sys
import hashlib
file = []
lines = []
for i in range(2):
# open the files named in the command line
file.append(open(sys.argv[1+i], 'r'))
# stores the hash value and the line number for each line in file i
lines.append({})
# assuming you like counting lines starting with 1
counter = 1
while 1:
# assuming default encoding is sufficient to handle the input file
line = file[i].readline().encode()
if not line: break
hashcode = hashlib.sha512(line).hexdigest()
lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.
Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.
EDIT:
In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.
#!/usr/bin/python
class Reader:
def __init__(self, file):
self.count = 0
self.dict = {}
self.file = file
def readline(self):
line = self.file.readline()
if not line:
return None
if self.dict.has_key(line):
return self.dict[line]
else:
self.count = self.count + 1
self.dict[line] = self.count
return self.count
if __name__ == '__main__':
print "Type Ctrl-D to quit."
import sys
r = Reader(sys.stdin)
result = 'ignore'
while result:
result = r.readline()
print result
If I understand correctly, you want the lines of these files without duplicates. This does the job:
uniqA = set(open('fileA', 'r'))
Python has difflib which claims to be quite competitive with other diff utilities see:
http://docs.python.org/library/difflib.html
I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.