Wrong order in iterating over folder and printing filenames - python

Please help with trying to print the filenames of the pictures. I either print the same filename or the same picture with different filename.
I want the output to be FileName then Pic associated with FileName.
Instead I am getting FileName0 and Pic0 then FileName0 then Pic1
or Filename0 then Pic0 then Filename1 then Pic0.
I added more code to the original post for clarification of what I was trying to do. Hopefully I/it makes sense. I want to print the name of the image and then display the image. The new code I came up with displays the image then prints the name at them bottom of it with none and then the program terminates. Say the list has 4 images. I want to print the name at image[0] and then display image[0] in a loop and then print image[1] display image[1]
#OLD CODE
with zipfile.ZipFile("file.zip", 'r') as zip_ref:
zip_ref.extractall("folderName")
for info in zip_ref.infolist():
for file in os.listdir("folderName"):
image=Image.open(file).convert('RGB')
print(info.filename)
display(image)
#NEW CODE
#My current list length is 4
file_name = []
actual_image = []
##Extract all the files and put in folder
with zipfile.ZipFile("readonly/small_img.zip", 'r') as zip_ref:
zip_ref.extractall("pyproject")
#Add name to list/Add image to list. Probably should be one list.
for entry in os.scandir("pyproject"):
file_name.append(entry.name)
for file in os.listdir("pyproject"):
image=Image.open(file).convert('RGB')
actual_image.append(image)
#print(info.filename,display(image))
#Newer line of code directly above.
#When the above for loop becomes nested it displays 4
#pictures with the file number underneath. Expected result is 1pic to 1 filename.
#Its closer to what I want. Will keep trying.
print(len(file_name))
#Returns file names.
def name_of_file(a):
for names in a:
return names
#Returns image to be displayed
def image_of_file(b):
for image in b:
return (display(image))
##Prints out image name and then displays image
print(name_of_file(file_name),image_of_file(actual_image))
###Dictionary example code:
list_of_pictures = [{image1_of_four :PIL.image,bounding_box,pytesseract_text}]

I think the confusion comes from the double iteration. As far as I can see, this is not necessary, because you just want to iterate over every (image) file in a zipped directory. (If I have understood the question correctly.)
A single iteration is sufficient here:
import zipfile
with zipfile.ZipFile("file.zip", 'r') as zip_ref:
for file in zip_ref.filelist:
print(file.filename)
# ...
So for processing the files inside the zip archive you could do something like this (of course there are several possibilities, depending on the usecase):
import zipfile
from PIL import Image, UnidentifiedImageError
with zipfile.ZipFile("file.zip", 'r') as zip_ref:
for zipped_file in zip_ref.filelist:
print(f"This is your fileinfo: {zipped_file.filename}")
try:
file = zip_ref.open(zipped_file)
image = Image.open(file).convert('RGB')
except UnidentifiedImageError:
print(f"Error processing {zipped_file.filename}")
If you really need more information from the iterated (zipped) files, then the infolist() method is ok, giving you the information from ZipInfo-object.
Update after question editing:
As far as I can see, there is still a picture to be displayed and the corresponding file name to be printed. If my assumption is correct, then there are several issues with the presented code:
There is no need at all to iterate several times. No matter if you have one nested iteration or several iterations in a row. Limiting the number of iterations also reduces the complexity and probably the whole thing becomes less complicated. To go into detail: You use several iterations to 1. unzip the files (zip_ref.extractall() is already iterating itself), 2. store the filenames in a list, 3. store the image objects in a list, 4. print the stored filenames, 5. display the image objects. All information is already available to you when iterating over the files in the archive, or can be easily computed in the current iteration step. This completely eliminates the need to create multiple data structures for file names, image objects etc. Here you already have the file, thus also the file name and thus also the corresponding image.
I still see no reason to unpack the whole archive first. All this can be done in the iteration itself. If the images themselves are to be saved, then unpacking is of course useful. But then you can also simply unzip the files and then iterate over the unzipped files with Python, e.g. with os.scandir(). This was implemented in the updated code. But this is not necessary if you only want to display the current file of each iteration step.
Unfortunately the function of display() is still not known to me. Probably something similar to Image.show() is done there. After the code update within the question, I can only mention small changes to my example to show how easy it can be to display the file name for the corresponding image:
import os
import zipfile
from PIL import Image, UnidentifiedImageError
with zipfile.ZipFile("file.zip", 'r') as zip_ref:
for zipped_file in zip_ref.filelist:
try:
image = Image.open(zip_ref.open(zipped_file)).convert('RGB')
print(os.path.basename(zipped_file.filename))
image.show() # simulating: display(image)
input("Press a key to show next image...")
except UnidentifiedImageError:
pass
I only print the file name, for which there is also a matching picture. No other prints (to keep things as clear as possible). image.show() is used to simulate the unknown display(image)-function. To make it clear that the corresponding file name refers to the currently opened image, I have included a pause, here in the form of a user prompt (input()).
All this under the assumption that simply the appropriate file name for a certain image should be displayed. Using only one iteration should be the appropriate solution here.
Using multiple iterations to store objects in multiple lists (as done in the question) leads to a disadvantage: Higher complexity. In this case, the index positions of the lists have to match each other, and when iterating over one list, you have to access the other list with the same index position like this:
list_a = [1, 2, 3]
list_b = ["a", "b", "c"]
for index, el in enumerate(list_a):
print(el, list_b[index])
You have to do this without changing much of your code. But then you have to make sure that the lists never change (or rather use tuples) and this is simply more complex (and also more complicated). See also this.

Related

Entry of a list item in a file

Good afternoon, I have a multiple list of IP and MAC, list of arbitrary length
A = [['10.0.0.1','00:4C:3S:**:**:**', 0], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
I want to check if this MAC is in the oui file:
E043DB (base 16) Shenzhen
2405f5 (base 16) Integrated
3CD92B (base 16) Hewlett Packard
...
If the MAC from the list is in the file, write the name of the manufacturer as 3 list items. I'm trying to do so and it turns out to check only the first element, the remaining ones are not checked, how can I do this please tell me?
f = open('oui.txt', 'r')
for values in A:
for line in f.readlines():
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
f.close()
print (A)
And get an answer:
A = [['10.0.0.1','00:4C:3S:**:**:**', 'Firm Name'], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
The Problem
Consider the "shape" of your code:
f = open('a file')
for values in [ 'some list' ]:
for line in f.readlines():
Your two loops are doing this:
Start with first value in list
Read all lines remaining in file object f
Move to next value in list
Read all lines remaining in file object f
Except that the first time you told it to "read all lines remaining" it would do so.
So, unless you have some way to put more lines into f (which can happen with async files like stdin!) you are going to get one "good" pass through the file, and then every subsequent pass the file object will point to the end of the file, so you'll get nothing.
A Solution
When you are dealing with a file, you want to only process it one time. File I/O is expensive compared to other operations. So you can choose to either (a) read the entire file into memory, and do whatever you want since it's not a file any more; or (b) scan through it only one time.
If you choose to scan through it only once, the easy solution is just to invert the two for loops. Instead of doing this:
for item in list:
for line in file:
Do this instead:
for line in file:
for item in list:
And presto! You are now only reading the file one time.
Other Considerations
If I look at your code, and your examples, it seems like you are trying for an exact match on a particular key. You trim down the MAC addresses in your list to check them against the manufacturer ids.
This suggests to me that you might well have many, many more list values (source MAC addresses) than you have manufacturers. So perhaps you should consider reading the contents of the tile into memory, rather than processing it one line at a time.
Once you have the file in memory, consider building a proper dictionary. You have a key (MAC prefix) and a value (manufacturer). So build something like:
for line in f:
mac = line.split('(base 16)')[0].strip()
mfg = line.split('(base 16)')[1].strip()
mac_to_mfg[mac] = mfg
Then you can make one pass through the source addresses and use the dict's O(1) lookup to your advantage:
for src in A:
prefix = src[1][:8].replace(':', '')
if prefix in mac_to_mfg:
# etc...
The problem is you got the order of the loops reversed. Usually this isn't that big of a problem, but when working objects that are consumed (like the IO file object) the contents will no longer produce once it's been iterated over.
You'll need to iterate the lines first, and then within each lines iterate through A to check the values:
with open('oui.txt', 'r') as f:
for line in f.readlines():
for values in A:
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
print (A)
Notice I changed your file opening to use the with context manager instead, where once your code exists the with block it will automatically close() the file for you. It is recommended over manually opening the file as you might forget to close() it after.

Delete a line in multiple text files with the same line beginning but varying line ending using Python v3.5

I have a folder full of .GPS files, e.g. 1.GPS, 2.GPS, etc...
Within each file is the following five lines:
Trace #1 at position 0.004610
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,39.0304,T,39.0304,M,0.029,N,0.054,K,D*32
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27
...followed by the same data structure, with different values, over the next five lines:
Trace #6 at position 0.249839
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,247.2375,T,247.2375,M,0.081,N,0.149,K,D*3D
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F
(I realise the values after the $GNGSA lines don't vary in the above example. This is just a bad example... in the real dataset they do vary!)
I need to remove the lines that begin with "$GNGSA" and "$GNVTG" (i.e. I need to delete lines 2, 3, and 4 from each group of five lines within each .GPS file).
This five-line pattern continues for a varying number of times throughout each file (for some files, there might only be two five-line groups, while other files might have hundreds of the five-line groups). Hence, deleting these lines based on the line number will not work (because the line number would be variable).
The problem I am having (as seen in the above examples) is that the text that follows the "$GNGSA" or "$GNVTG" varies.
I'm currently learning Python (I'm using v3.5), so figured this would make for a good project for me to learn a few new tricks...
What I've tried already:
So far, I've managed to create the code to loop through the entire folder:
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.endswith('.GPS'): # if the filename of an iteration ends with .GPS, then...
print(i + ' loaded') # print the filename to CLI, simply for debugging purposes.
with open(indir + i, 'r') as my_file: # open the iteration file
file_lines = my_file.readlines() # uses the readlines method to create a list of all lines in the file.
print(file_lines) # this prints the entire contents of each file to CLI for debugging purposes.
Everything in the above works perfectly.
What I need help with:
How do I detect and delete the lines themselves, and then save the file (to the same location; there is no need to save to a different filename)?
The filenames - which usually end with ".GPS" - sometimes end with ".gps" instead (the only difference being the case). My above code will only work with the uppercase files. Besides completely duplicating the code and changing the endswith argument, how do I make it work with both cases?
In the end, my file needs to look something like this:
Trace #1 at position 0.004610
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27
Trace #6 at position 0.249839
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F
Any suggestions, please? Thanks in advance. :)
You're almost there.
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.endswith('.GPS'): # if the filename of an iteration ends with .GPS, then...
print(i + ' loaded') # print the filename to CLI, simply for debugging purposes.
with open(indir + i, 'r') as my_file: # open the iteration file
for line in my_file:
if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'):
print(line)
As per what the others have said, you're on the right track! Where you're going wrong is in the case-sensitive file extension check, and in reading in the entire file contents at once (this isn't per se wrong, but it's probably adding complexity we won't need).
I've commented your code, removing all the debug stuff for simplicity, to illustrate what I mean:
import os
indir = '/path/to/files'
for i in os.listdir(indir):
if i.endswith('.GPS'): #This CASE SENSITIVELY checks the file extension
with open(indir + i, 'r') as my_file: # Opens the file
file_lines = my_file.readlines() # This reads the ENTIRE file at once into an array of lines
So we need to fix the case sensitivity issue, and instead of reading in all the lines, we'll instead read the file line-by-line, check each line to see if we want to discard it or not, and write the lines we're interested in into an output file.
So, incorporating #tdelaney's case-insensitive fix for file name, we replace line #5 with
if i.lower().endswith('.gps'): # Case-insensitively check the file name
and instead of reading in the entire file at once, we'll instead iterate over the file stream and print each desired line out
with open(indir + i) as in_file, open(indir + i + 'new.gps') as out_file: # Open the input file for reading and creates + opens a new output file for writing - thanks #tdelaney once again!
for line in in_file # This reads each line one-by-one from the in file
if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'): # Check the line has what we want (thanks Avinash)
out_file.write(line + "\n") # Write the line to the new output file
Note that you should make certain that you open the output file OUTSIDE of the 'for line in in_file' loop, or else the file will be overwritten on every iteration which will erase what you've already written to it so far (I suspect this is the issue you've had with the previous answers). Open both files at the same time and you can't go wrong.
Alternatively, you can specify the file access mode when you open the file, as per
with open(indir + i + 'new.gps', 'a'):
which will open the file in append-mode, which is a specialised from of write-mode that preserves the original contents of the file, and appends new data to it instead of overwriting existing data.
Ok, based on suggestions by Avinash Raj, tdelaney, and Sampson Oliver, here on Stack Overflow, and another friend who helped privately, here is the solution that is now working:
import os
indir = '/Users/dhunter/GRID01/' # input directory
for i in os.listdir(indir): # for each "i" (iteration) within the indir variable directory...
if i.lower().endswith('.gps'): # if the filename of an iteration ends with .GPS, then...
if not i.lower().endswith('.gpsnew.gps'): # if the filename does not end with .gpsnew.gps, then...
print(i + ' loaded') # print the filename to CLI.
with open (indir + i, 'r') as my_file:
for line in my_file:
if not line.startswith('$GNGSA'):
if not line.startswith('$GNVTG'):
with open(indir + i + 'new.gps', 'a') as outputfile:
outputfile.write(line)
outputfile.write('\r\n')
(You'll see I had to add in another layer of if statement to stop it from using the output files from previous uses of the script "if not i.lower().endswith('.gpsnew.gps'):", but this line can easily be deleted for anyone who uses these instructions in future)
We switched the open mode on the third-last line to "a" for append, so that it would save all the right lines to the file, rather than overwriting each time.
We also added in the final line to add a line break at the end of each line.
Thanks everyone for their help, explanations, and suggestions. Hopefully this solution will be useful to someone in future. :)
2. The filenames:
The if accepts any expression returning a truth value, and you can combine expressions with the standart boolean operators: if i.endswith('.GPS') or i.endswith('.gps').
You can also put the ... and ... expression after the if in brackets, to feel more sure, but it's not neccessary.
Alternatively, as a less universal solution, (but since you wanted to learn a few tricks :)) you can use string manipulation in this case: an object of type string has a lot of methods. '.gps'.upper() gives '.GPS' -- try, if you can make use of this! (even a printed string is a string object, but your variables behave the same).
1. Finding the Lines:
As you can see in the other solution, you need not read out all of your lines, you can check if want to have them 'on the fly'. But I will stick to your approach with readlines. It gives you a list, and lists support indexing and slicing. Try:
anylist[stratindex, endindex, stride], for any values, so for example try: newlist = range(100)[1::5].
It's always helpfull to try out the easy basic operations in interactive mode, or at the beginning of your script. Here range(100) is just some sample list. Here you see, how the python for-syntax works, differently than in other languages: you can iterate over any list, and if you just need integers, you create a list with integers with range().
So this will work the same with any other list -- e.g. the one you get from readlines()
This selects a slice from the list, beginnig with the second element, ending at the end (since the end index is omitted), and taking every 5th element. Now you have this sub-list, you can just revome it from the original. So for the example with the range:
a = range(100)
del(a[1::5])
print a
So you see, that the appropriate items have been removed. Now do the same with your file_lines, and then proceed to remove the other lines you want to remove.
Then, in a new with block, open the file for writing and do writelines(file_lines), so the remainig lines are written back to the file.
Of course you can also take the approach to look for the content of each line with a for loop over your list and startswith(). Or you can combine the approaches, and check, if deleting lines by number leaves the right starts, so you can print an error if something is unexpected...
3. Saving the file
You can close your file after you have the lines saved in the readlines(). In fact this is done automatically at the end of the with-block. Then just open it in 'w' mode instead of 'r' and do yourfilename.writelines(yourlist). You don't need to save, it's saven on closing.

Python upper limit for loop

Let me start by saying I'm fairly new to python.
Ok so I'm running code to perform physics calculations/draw graphs etc on data files, and what I need to do is loop over a number of files and sub-files. But the problem is there are a different number of sub-files in each file (e.g. file 0 has 711 sub-files, file 1 has 660 odd). It obviously doesn't like it when I run it across a file that doesn't have a sub-file at index x, so I was wondering is there a way to get it to run (iterate?) up to the final limit in each file automatically?
What I've got is a nested loop like:
for i in range(0,120):
for j in range(0,715):
stuff
Cheers in advance for any help, and sorry if my explanation is bad!
Edit:
some more of the code. So what I'm actually doing is calculating/plotting the angular momentum of gas and dark matter particles. These are in halos (j), & there are a number of files (i) containing lots and lots of these halos.
import getfiby
import numpy as np
import pylab as pl
angmom=getfiby.ReadFIBY("B")
for i in range(0,120):
for j in range(0,715):
pos_DM = angmom.getParticleField(49,i,j,"DM","Coordinates")
vel_DM = angmom.getParticleField(49,i,j,"DM","Velocity")
mass_DM = angmom.getParticleField(49,i,j,"DM","Mass")
more stuff
getfiby is a code I was given that retrieves all the data from the files (that I can't see). It's not really a massive problem, as I can still get it to run the calculations & make plots even if the upper limit I put on my range for j exceeds the number of halos there are in a particular file (I get: Halo index out of bounds. Goodbye.) But yeah I just wondered if there was a nicer, tidier way of getting it to run.
You may want something like this:
a=range(3)
for i in range(5):
try:
b=a[i] #if you iterate over your "subfiles" by index
except IndexError:
break #break out when list index out of range
When you build the list of files and subfiles to process, you could have the list in a variable named filelist and the subfile list in a variable called subfilelist You could then run the loops as
for myfile in filelist:
# Process code
for subfile in subfilelist:
# Process code.
If you need to use a range then
filerange = len(filelist)
subfilerange = len(subfilelist)
for i in range(0, filerange):
for j in range(0, subfilerange):
I may be misunderstanding your structure/intent here, but it sounds like you want to perform some calculations on each file in some folder structure, where the folders contain various numbers of the data files. In that case, I'd make use of the os.listdir function rather than trying to manually index each data file.
import os
for file in os.listdir(ROOT_DIRECTORY):
for subfile in os.listdir(os.path.join(ROOT_DIRECTORY, file)):
process(os.path.join(ROOT_DIRECTORY, file, subfile)) # or whatever
This can of course be made a little easier to look at in a few ways (personally I have my own listdir wrapper that returns full paths instead of just the basenames), but that would be my basic idea.
If you also need the index of each file, you can probably get it using some combination of enumerate and maybe sorted (e.g. for i, file in enumerate(sorted(os.listdir(...)))), but the specifics obviously depend on your filenames and directory structure.
Assuming fileSet is an iterable full of file objects, and each file object is itself an iterable full of subfile objects, than:
for i,file in enumerate(fileSet):
for j,subfile in enumerate(file):
stuff
But think hard about whether you really need the indices i and j, as you already have the words file and subfile to refer to the objects you are dealing with. If you don't need the indices, it's simply:
for file in fileSet:
for subfile in file:
stuff
Now, if by "file" you actually meant files in the Operating System's Filesystem, then I need you to explain me what a subfile is in that context, as Filesystem files usually cannot be nested, only directories can.

Python securely remove file

How can I securely remove a file using python? The function os.remove(path) only removes the directory entry, but I want to securely remove the file, similar to the apple feature called "Secure Empty Trash" that randomly overwrites the file.
What function securely removes a file using this method?
You can use srm to securely remove files. You can use Python's os.system() function to call srm.
You can very easily write a function in Python to overwrite a file with random data, even repeatedly, then delete it. Something like this:
import os
def secure_delete(path, passes=1):
with open(path, "ba+") as delfile:
length = delfile.tell()
with open(path, "br+") as delfile:
for i in range(passes):
delfile.seek(0)
delfile.write(os.urandom(length))
os.remove(path)
Shelling out to srm is likely to be faster, however.
You can use srm, sure, you can always easily implement it in Python. Refer to wikipedia for the data to overwrite the file content with. Observe that depending on actual storage technology, data patterns may be quite different. Furthermore, if you file is located on a log-structured file system or even on a file system with copy-on-write optimisation, like btrfs, your goal may be unachievable from user space.
After you are done mashing up the disk area that was used to store the file, remove the file handle with os.remove().
If you also want to erase any trace of the file name, you can try to allocate and reallocate a whole bunch of randomly named files in the same directory, though depending on directory inode structure (linear, btree, hash, etc.) it may very tough to guarantee you actually overwrote the old file name.
So at least in Python 3 using #kindall's solution I only got it to append. Meaning the entire contents of the file were still intact and every pass just added to the overall size of the file. So it ended up being [Original Contents][Random Data of that Size][Random Data of that Size][Random Data of that Size] which is not the desired effect obviously.
This trickery worked for me though. I open the file in append to find the length, then reopen in r+ so that I can seek to the beginning (in append mode it seems like what caused the undesired effect is that it was not actually possible to seek to 0)
So check this out:
def secure_delete(path, passes=3):
with open(path, "ba+", buffering=0) as delfile:
length = delfile.tell()
delfile.close()
with open(path, "br+", buffering=0) as delfile:
#print("Length of file:%s" % length)
for i in range(passes):
delfile.seek(0,0)
delfile.write(os.urandom(length))
#wait = input("Pass %s Complete" % i)
#wait = input("All %s Passes Complete" % passes)
delfile.seek(0)
for x in range(length):
delfile.write(b'\x00')
#wait = input("Final Zero Pass Complete")
os.remove(path) #So note here that the TRUE shred actually renames to file to all zeros with the length of the filename considered to thwart metadata filename collection, here I didn't really care to implement
Un-comment the prompts to check the file after each pass, this looked good when I tested it with the caveat that the filename is not shredded like the real shred -zu does
The answers implementing a manual solution did not work for me. My solution is as follows, it seems to work okay.
import os
def secure_delete(path, passes=1):
length = os.path.getsize(path)
with open(path, "br+", buffering=-1) as f:
for i in range(passes):
f.seek(0)
f.write(os.urandom(length))
f.close()

Confusing loop problem (python)

this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.
Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.
You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.
Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.
Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.
#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)

Categories