this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.
Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.
You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.
Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.
Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.
#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)
Related
Good afternoon, I have a multiple list of IP and MAC, list of arbitrary length
A = [['10.0.0.1','00:4C:3S:**:**:**', 0], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
I want to check if this MAC is in the oui file:
E043DB (base 16) Shenzhen
2405f5 (base 16) Integrated
3CD92B (base 16) Hewlett Packard
...
If the MAC from the list is in the file, write the name of the manufacturer as 3 list items. I'm trying to do so and it turns out to check only the first element, the remaining ones are not checked, how can I do this please tell me?
f = open('oui.txt', 'r')
for values in A:
for line in f.readlines():
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
f.close()
print (A)
And get an answer:
A = [['10.0.0.1','00:4C:3S:**:**:**', 'Firm Name'], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
The Problem
Consider the "shape" of your code:
f = open('a file')
for values in [ 'some list' ]:
for line in f.readlines():
Your two loops are doing this:
Start with first value in list
Read all lines remaining in file object f
Move to next value in list
Read all lines remaining in file object f
Except that the first time you told it to "read all lines remaining" it would do so.
So, unless you have some way to put more lines into f (which can happen with async files like stdin!) you are going to get one "good" pass through the file, and then every subsequent pass the file object will point to the end of the file, so you'll get nothing.
A Solution
When you are dealing with a file, you want to only process it one time. File I/O is expensive compared to other operations. So you can choose to either (a) read the entire file into memory, and do whatever you want since it's not a file any more; or (b) scan through it only one time.
If you choose to scan through it only once, the easy solution is just to invert the two for loops. Instead of doing this:
for item in list:
for line in file:
Do this instead:
for line in file:
for item in list:
And presto! You are now only reading the file one time.
Other Considerations
If I look at your code, and your examples, it seems like you are trying for an exact match on a particular key. You trim down the MAC addresses in your list to check them against the manufacturer ids.
This suggests to me that you might well have many, many more list values (source MAC addresses) than you have manufacturers. So perhaps you should consider reading the contents of the tile into memory, rather than processing it one line at a time.
Once you have the file in memory, consider building a proper dictionary. You have a key (MAC prefix) and a value (manufacturer). So build something like:
for line in f:
mac = line.split('(base 16)')[0].strip()
mfg = line.split('(base 16)')[1].strip()
mac_to_mfg[mac] = mfg
Then you can make one pass through the source addresses and use the dict's O(1) lookup to your advantage:
for src in A:
prefix = src[1][:8].replace(':', '')
if prefix in mac_to_mfg:
# etc...
The problem is you got the order of the loops reversed. Usually this isn't that big of a problem, but when working objects that are consumed (like the IO file object) the contents will no longer produce once it's been iterated over.
You'll need to iterate the lines first, and then within each lines iterate through A to check the values:
with open('oui.txt', 'r') as f:
for line in f.readlines():
for values in A:
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
print (A)
Notice I changed your file opening to use the with context manager instead, where once your code exists the with block it will automatically close() the file for you. It is recommended over manually opening the file as you might forget to close() it after.
I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.
Let me start by saying I'm fairly new to python.
Ok so I'm running code to perform physics calculations/draw graphs etc on data files, and what I need to do is loop over a number of files and sub-files. But the problem is there are a different number of sub-files in each file (e.g. file 0 has 711 sub-files, file 1 has 660 odd). It obviously doesn't like it when I run it across a file that doesn't have a sub-file at index x, so I was wondering is there a way to get it to run (iterate?) up to the final limit in each file automatically?
What I've got is a nested loop like:
for i in range(0,120):
for j in range(0,715):
stuff
Cheers in advance for any help, and sorry if my explanation is bad!
Edit:
some more of the code. So what I'm actually doing is calculating/plotting the angular momentum of gas and dark matter particles. These are in halos (j), & there are a number of files (i) containing lots and lots of these halos.
import getfiby
import numpy as np
import pylab as pl
angmom=getfiby.ReadFIBY("B")
for i in range(0,120):
for j in range(0,715):
pos_DM = angmom.getParticleField(49,i,j,"DM","Coordinates")
vel_DM = angmom.getParticleField(49,i,j,"DM","Velocity")
mass_DM = angmom.getParticleField(49,i,j,"DM","Mass")
more stuff
getfiby is a code I was given that retrieves all the data from the files (that I can't see). It's not really a massive problem, as I can still get it to run the calculations & make plots even if the upper limit I put on my range for j exceeds the number of halos there are in a particular file (I get: Halo index out of bounds. Goodbye.) But yeah I just wondered if there was a nicer, tidier way of getting it to run.
You may want something like this:
a=range(3)
for i in range(5):
try:
b=a[i] #if you iterate over your "subfiles" by index
except IndexError:
break #break out when list index out of range
When you build the list of files and subfiles to process, you could have the list in a variable named filelist and the subfile list in a variable called subfilelist You could then run the loops as
for myfile in filelist:
# Process code
for subfile in subfilelist:
# Process code.
If you need to use a range then
filerange = len(filelist)
subfilerange = len(subfilelist)
for i in range(0, filerange):
for j in range(0, subfilerange):
I may be misunderstanding your structure/intent here, but it sounds like you want to perform some calculations on each file in some folder structure, where the folders contain various numbers of the data files. In that case, I'd make use of the os.listdir function rather than trying to manually index each data file.
import os
for file in os.listdir(ROOT_DIRECTORY):
for subfile in os.listdir(os.path.join(ROOT_DIRECTORY, file)):
process(os.path.join(ROOT_DIRECTORY, file, subfile)) # or whatever
This can of course be made a little easier to look at in a few ways (personally I have my own listdir wrapper that returns full paths instead of just the basenames), but that would be my basic idea.
If you also need the index of each file, you can probably get it using some combination of enumerate and maybe sorted (e.g. for i, file in enumerate(sorted(os.listdir(...)))), but the specifics obviously depend on your filenames and directory structure.
Assuming fileSet is an iterable full of file objects, and each file object is itself an iterable full of subfile objects, than:
for i,file in enumerate(fileSet):
for j,subfile in enumerate(file):
stuff
But think hard about whether you really need the indices i and j, as you already have the words file and subfile to refer to the objects you are dealing with. If you don't need the indices, it's simply:
for file in fileSet:
for subfile in file:
stuff
Now, if by "file" you actually meant files in the Operating System's Filesystem, then I need you to explain me what a subfile is in that context, as Filesystem files usually cannot be nested, only directories can.
I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping time ranges, and my goal is to blend them together into one large file, sorted by timestamp. There can be ties in the sorting, in which case I want to next line to come from whatever file is listed first in my input list.
I've seen examples of how to do this using fileinput (see here), but this seems to read all the files into memory. Owing to the large size of my files this will be a problem. Because my files are pre-sorted, it seems there should be a way to merge them using a method that only has to consider the most recent unexplored line from each file.
Why roll your own if there is heapq.merge() in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:
from itertools import imap
from operator import itemgetter
import heapq
def extract_timestamp(line):
"""Extract timestamp and convert to a form that gives the
expected result in a comparison
"""
return line.split()[1] # for example
with open("log1.txt") as f1, open("log2.txt") as f2:
sources = [f1, f2]
with open("merged.txt", "w") as dest:
decorated = [
((extract_timestamp(line), line) for line in f)
for f in sources]
merged = heapq.merge(*decorated)
undecorated = imap(itemgetter(-1), merged)
dest.writelines(undecorated)
Every step in the above is "lazy". As I avoid file.readlines() the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge() is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap(), the lazy variant of the map() built-in to undecorate.
(In Python 3 map() has become lazy, so you can use that)
You want to implement a file-based merge sort. Read a line from both files, output the older line, then read another line from that file. Once one of the files is exhausted, output all the remaining lines from the other file.
I have a file that has over 200 lines in this format:
name old_id new_id
The name is useless for what I'm trying to do currently, but I still want it there because it may become useful for debugging later.
Now I need to go through every file in a folder and find all the instances of old_id and replace them with new_id. The files I'm scanning are code files that could be thousands of lines long. I need to scan every file with each of the 200+ ids that I have, because some may be used in more than one file, and multiple times per file.
What is the best way to go about doing this? So far I've been creating python scripts to figure out the list of old ids and new ids and which ones match up with each other, but I've been doing it very inefficient because I basically scanned the first file line by line and got the current id of the current line, then I would scan the second file line by line until I found a match. Then I did this over again for each line in the first file, which ended up with my reading the second file a lot. I didn't mind doing this inefficiently because they were small files.
Now that I'm searching probably somewhere around 30-50 files that can have thousands of line of code in it, I want it to be a little more efficient. This is just a hobbyist project, so it doesn't need to be super good, I just don't want it to take more than 5 minutes to find and replace everything, then look at the result and see that I made a little mistake and need to do it all over again. Taking a few minutes is fine(although I'm sure with computers nowadays they can do this almost instantly still) but I just don't want it to be ridiculous.
So what's the best way to go about doing this? So far I've been using python but it doesn't need to be a python script. I don't care about elegance in the code or way I do it or anything, I just want an easy way to replace all of my old ids with my new ids using whatever tool is easiest to use or implement.
Examples:
Here is a line from the list of ids. The first part is the name and can be ignored, the second part is the old id, and the third part is the new id that needs to replace the old id.
unlock_music_play_grid_thumb_01 0x108043c 0x10804f0
Here is an example line in one of the files to be replaced:
const v1, 0x108043c
I need to be able to replace that id with the new id so it looks like this:
const v1, 0x10804f0
Use something like multiwordReplace (I've edited it for your situation) with mmap.
import os
import os.path
import re
from mmap import mmap
from contextlib import closing
id_filename = 'path/to/id/file'
directory_name = 'directory/to/replace/in'
# read the ids into a dictionary mapping old to new
with open(id_filename) as id_file:
ids = dict(line.split()[1:] for line in id_file)
# compile a regex to do the replacement
id_regex = re.compile('|'.join(map(re.escape, ids)))
def translate(match):
return ids[match.group(0)]
def multiwordReplace(text):
return id_regex.sub(translate, text)
for code_filename in os.listdir(directory_name):
with open(os.path.join(directory, code_filename), 'r+') as code_file:
with closing(mmap(code_file.fileno(), 0)) as code_map:
new_file = multiword_replace(code_map)
with open(os.path.join(directory, code_filename), 'w') as code_file:
code_file.write(new_file)