Image recovery from binary file - python

The objective is to extract an image from of a binary file. How do I search a binary file for the filetype's markers, SOI and EOI.
Regular find() functions don't seem to work as I cannot load the binary file as a string.

You want to search a magic word in a stream (not string).
Here's the idea:
read one char a time (use file.read(1)) from this file
use a queue length of your magic word, check the queue for each read
MAGIC_WORD = r'JPEG' # it's example... just example
l = list(c for c in f.read(len(MAGIC_WORD)))
offset = 0
while True:
if ''.join(l) == MAGIC_WORD:
return offset
offset += 1
l.pop(0)
l.append(f.read(1))
If you feel the need... I mean, the need for speed, check this wiki article, use a smarter algorithm, and finally switch to c++.
Sorry I don't know any presenting python library that does this. good luck

Another think:
if you can use unix shell (instead of Python), you can try to use unix pipes and chain some search tools (like grep and xxd)
like
cat yourbinfile | xxd -p | grep HEXMAGICWORD
where HEXMAGICWORD is
echo jpeg | xxd -p
I'm not very familiar with shell, so it's not exact answer.

Related

How to extract user defined region from an fasta file with a list in other file

I have a multi-fasta sequence file: test.fasta
>Ara_001
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_002
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_003
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
I have another list file with a range: range.txt
Ara_001 3 60
Ara_002 10 80
Ara_003 20 50
I want to extract the defined region.
My expected out put would be:
>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VT
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKP
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG
I tried:
#!/bin/bash
lines=$(awk 'END {print NR}' range.txt)
for ((a=1; a<= $lines ; a++))
do
number=$(awk -v lines=$a 'NR == lines' range.txt)
grep -v ">" test.fasta | awk -v lines=$a 'NR == lines' | cut -c$number
done;
Do not reinvent the wheel. Use standard bioinformatics tools written for this purpose and used widely. In your example, use bedtools getfasta. Reformat your regions file to be in 3-column bed format, then:
bedtools getfasta -fi test.fasta -bed range.bed
Install bedtools suite, for example, using conda, specifically miniconda, like so:
conda create --name bedtools bedtools
Using biopython:
# read ranges as dictionary
with open('range.txt') as f:
ranges = {ID: (int(start), int(stop)) for ID, start, stop in map(lambda s: s.strip().split(), f)}
# {'Ara_001': (3, 60), 'Ara_002': (10, 80), 'Ara_003': (20, 50)}
# load input fasta and slice
from Bio import SeqIO
with open ('test.fasta') as handle:
out = [r[slice(*ranges[r.id])] for r in SeqIO.parse(handle, 'fasta')]
# export sliced sequences
with open('output.fasta', 'w') as handle:
SeqIO.write(out, handle, 'fasta')
Output file:
>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGEVTSHLQGMFN
RTIRLLEAGI
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG
NB. With this quick code there must be a entry for each sequence id in range.txt, it's however quite easy to modify it to use a default behavior on case of absence of it.
When I wrote my initial answer I misunderstood the question. Your situation is not so much dependent on any programming languages. You seem to need the utility in Timur Shtatland's answer. Either install that utility or use mozway's python code snippet which handles both of your files.
I read the format of your question as if you had 4 individual files, not two.
With Biotite, a package I am developing, this can be done with:
import biotite.sequence.io.fasta as fasta
input_file = fasta.FastaFile.read("test.fasta")
output_file = fasta.FastaFile()
with open("range.txt") as file:
for line in file.read().splitlines():
seq_id, start, stop = line.split()
start = int(start)
stop = int(stop)
output_file[seq_id] = input_file[seq_id][start : stop]
output_file.write("path/to/output.fasta")

how to pipe stdin directly into python and parse like grep?

I'm trying to perform a sed/awk style regex substitution with python3's re module.
You can see it works fine here with a hardcoded test string:
#!/usr/bin/env python3
import re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
line = ("21:21:54.165651 stat64 this/ 0.000012 THiNG1.12471\n"
"21:21:54.165652 stat64 /that 0.000012 2thIng.12472\n"
"21:21:54.165653 stat64 /and/the other thing.xml 0.000012 With S paces.12473\n"
"21:21:54.165654 stat64 /and/the_other_thing.xml 0.000012 without_em_.12474\n"
"21:59:57.774616 fstat64 F=90 0.000002 tmux.4129\n")
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result)
But I'm having some trouble getting it to work the same way with the stdin:
#!/usr/bin/env python3
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
for line in str(sys.stdin):
#sys.stdout.write(line)
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result,end='')
I'd like to be able to pipe input straight into it from another utility, like is common with grep and similar CLI utilities.
Any idea what the issue is here?
Addendum
I tried to keep the question simple and generalized in the hope that answers might be more useful in similar but different situations, and useful to more people. However, the details might shed some more light on the problem, so here I will include are the exact details of my current scenario:
The desired input to my script is actually the output stream from a utility called fs_usage, it's similar to utilities like ps, but provides a constant stream of system calls and filesystem operations. It tells you which files are being read from, written to, etc. in real time.
From the manual:
NAME
fs_usage -- report system calls and page faults related to filesystem activity in real-time
DESCRIPTION
The fs_usage utility presents an ongoing display of system call usage information pertaining to filesystem activity. It requires root privileges due to the kernel tracing facility it uses to operate.
By default, the activity monitored includes all system processes except for:
fs_usage, Terminal.app, telnetd, telnet, sshd, rlogind, tcsh, csh, sh, zsh. These defaults can be overridden such that output is limited to include or exclude (-e) a list of processes specified by the user.
The output presented by fs_usage is formatted according to the size of your window.
A narrow window will display fewer columns. Use a wide window for maximum data display.
You may override the formatting restrictions by forcing a wide display with the -w option.
In this case, the data displayed will wrap when the window is not wide enough.
I hack together a crude little bash script to rip the process names from the stream, and dump them to a temporary log file. You can think of it as a filter or an extractor. Here it is as a function that will dump straight to stdout (remove the comment on the last line to dump to file).
proc_enum ()
{
while true; do
sudo fs_usage -w -e 'grep' 'awk' |
grep -E -o '(?:\d\.\d{6})\s{3}\S+\.\d+' |
awk '{print $2}' |
awk -F '.' '{print $1}' \
#>/tmp/proc_names.logx
done
}
Useful Links
Regular Expressions 101
Stack Overflow - How to pipe input to python line by line from linux program?
The problem str(sys.stdin) what Python will do in for loop is this:
i = iter(str(sys.stdin))
# then in every iteration
next(i)
Here you are converting the method to str, result in my computer is:
str(sys.stdin) == "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='cp1256'>"
you are not looping on the lines received by stdin, you are looping on the string representation of the function.
And another problem in the first example you are applying the re.sub on the entire text
but here you are applying for each line, so you should concatenate the result of each
line or concatenate the lines in single text before applying re.sub.
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
result = ''
for line in sys.stdin:
# here you should convert the input but I think is optional
line = str(line)
result += re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result, end='')

Read a python variable in a shell script?

my python file has these 2 variables:
week_date = "01/03/16-01/09/16"
cust_id = "12345"
how can i read this into a shell script that takes in these 2 variables?
my current shell script requires manual editing of "dt" and "id". I want to read the python variables into the shell script so i can just edit my python parameter file and not so many files.
shell file:
#!/bin/sh
dt="01/03/16-01/09/16"
cust_id="12345"
In a new python file i could just import the parameter python file.
Consider something akin to the following:
#!/bin/bash
# ^^^^ NOT /bin/sh, which doesn't have process substitution available.
python_script='
import sys
d = {} # create a context for variables
exec(open(sys.argv[1], "r").read()) in d # execute the Python code in that context
for k in sys.argv[2:]:
print "%s\0" % str(d[k]).split("\0")[0] # ...and extract your strings NUL-delimited
'
read_python_vars() {
local python_file=$1; shift
local varname
for varname; do
IFS= read -r -d '' "${varname#*:}"
done < <(python -c "$python_script" "$python_file" "${#%%:*}")
}
You might then use this as:
read_python_vars config.py week_date:dt cust_id:id
echo "Customer id is $id; date range is $dt"
...or, if you didn't want to rename the variables as they were read, simply:
read_python_vars config.py week_date cust_id
echo "Customer id is $cust_id; date range is $week_date"
Advantages:
Unlike a naive regex-based solution (which would have trouble with some of the details of Python parsing -- try teaching sed to handle both raw and regular strings, and both single and triple quotes without making it into a hairball!) or a similar approach that used newline-delimited output from the Python subprocess, this will correctly handle any object for which str() gives a representation with no NUL characters that your shell script can use.
Running content through the Python interpreter also means you can determine values programmatically -- for instance, you could have some Python code that asks your version control system for the last-change-date of relevant content.
Think about scenarios such as this one:
start_date = '01/03/16'
end_date = '01/09/16'
week_date = '%s-%s' % (start_date, end_date)
...using a Python interpreter to parse Python means you aren't restricting how people can update/modify your Python config file in the future.
Now, let's talk caveats:
If your Python code has side effects, those side effects will obviously take effect (just as they would if you chose to import the file as a module in Python). Don't use this to extract configuration from a file whose contents you don't trust.
Python strings are Pascal-style: They can contain literal NULs. Strings in shell languages are C-style: They're terminated by the first NUL character. Thus, some variables can exist in Python than cannot be represented in shell without nonliteral escaping. To prevent an object whose str() representation contains NULs from spilling forward into other assignments, this code terminates strings at their first NUL.
Now, let's talk about implementation details.
${#%%:*} is an expansion of $# which trims all content after and including the first : in each argument, thus passing only the Python variable names to the interpreter. Similarly, ${varname#*:} is an expansion which trims everything up to and including the first : from the variable name passed to read. See the bash-hackers page on parameter expansion.
Using <(python ...) is process substitution syntax: The <(...) expression evaluates to a filename which, when read, will provide output of that command. Using < <(...) redirects output from that file, and thus that command (the first < is a redirection, whereas the second is part of the <( token that starts a process substitution). Using this form to get output into a while read loop avoids the bug mentioned in BashFAQ #24 ("I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?").
The IFS= read -r -d '' construct has a series of components, each of which makes the behavior of read more true to the original content:
Clearing IFS for the duration of the command prevents whitespace from being trimmed from the end of the variable's content.
Using -r prevents literal backslashes from being consumed by read itself rather than represented in the output.
Using -d '' sets the first character of the empty string '' to be the record delimiter. Since C strings are NUL-terminated and the shell uses C strings, that character is a NUL. This ensures that variables' content can contain any non-NUL value, including literal newlines.
See BashFAQ #001 ("How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?") for more on the process of reading record-oriented data from a string in bash.
Other answers give a way to do exactly what you ask for, but I think the idea is a bit crazy. There's a simpler way to satisfy both scripts - move those variables into a config file. You can even preserve the simple assignment format.
Create the config itself: (ini-style)
dt="01/03/16-01/09/16"
cust_id="12345"
In python:
config_vars = {}
with open('the/file/path', 'r') as f:
for line in f:
if '=' in line:
k,v = line.split('=', 1)
config_vars[k] = v
week_date = config_vars['dt']
cust_id = config_vars['cust_id']
In bash:
source "the/file/path"
And you don't need to do crazy source parsing anymore. Alternatively you can just use json for the config file and then use json module in python and jq in shell for parsing.
I would do something like this. You may want to modify it little bit for minor changes to include/exclude quotes as I didn't really tested it for your scenario:
#!/bin/sh
exec <$python_filename
while read line
do
match=`echo $line|grep "week_date ="`
if [ $? -eq 0 ]; then
dt=`echo $line|cut -d '"' -f 2`
fi
match=`echo $line|grep "cust_id ="`
if [ $? -eq 0 ]; then
cust_id=`echo $line|cut -d '"' -f 2`
fi
done

Python - Checking concordance between two huge text files

So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!
Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.
Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.
Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/
import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));
Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).

Python or Bash - Iterate all words in a text file over itself

I have a text file that contains thousands of words, e.g:
laban
labrador
labradors
lacey
lachesis
lacy
ladoga
ladonna
lafayette
lafitte
lagos
lagrange
lagrangian
lahore
laius
lajos
lakeisha
lakewood
I want to iterate every word over itself so i get:
labanlaban
labanlabrador
labanlabradors
labanlacey
labanlachesis
etc...
In bash i can do the following, but it is extremely slow:
#!/bin/bash
( cat words.txt | while read word1; do
cat words.txt | while read word2; do
echo "$word1$word2" >> doublewords.txt
done; done )
Is there a faster and more efficient way to do this?
Also, how would i iterate two different text files in this manner?
If you can fit the list into memory:
import itertools
with open(words_filename, 'r') as words_file:
words = [word.strip() for word in words_file]
for words in itertools.product(words, repeat=2):
print(''.join(words))
(You can also do a double-for loop, but I was feeling itertools tonight.)
I suspect the win here is that we can avoid re-reading the file over and over again; the inner loop in your bash example will cat the file one for each iteration of the outer loop. Also, I think Python just tends to execute faster than bash, IIRC.
You could certainly pull this trick with bash (read the file into an array, write a double-for loop), it's just more painful.
It looks like sed is pretty efficient to append a text to each line.
I propose:
#!/bin/bash
for word in $(< words.txt)
do
sed "s/$/$word/" words.txt;
done > doublewords.txt
(Do you confuse $ which means end of line for sed and $word which is a bash variable).
For a 2000 line file, this runs in about 20 s on my computer, compared to ~2 min for you solution.
Remark: it also looks like you are slightly better off redirecting the standard output of the whole program instead of forcing writes at each loop.
(Warning, this is a bit off topic and personal opinion!)
If you are really going for speed, you should consider using a compiled language such as C++. For example:
vector<string> words;
ifstream infile("words.dat");
for(string line ; std::getline(infile,line) ; )
words.push_back(line);
infile.close();
ofstream outfile("doublewords.dat");
for(auto word1 : data)
for(auto word2 : data)
outfile << word1 << word2 << "\n";
outfile.close();
You need to understand that both bash and python are bad at double for loops: that's why you use tricks (#Thanatos) or predefined commands (sed). Recently, I came across a double for loop problem (given a set of 10000 points in 3d, compute all the distances between pairs) and I successful solved it using C++ instead of python or Matlab.
If you have GHC available, Cartesian products are a synch!
Q1: One file
-- words.hs
import Control.Applicative
main = interact f
where f = unlines . g . words
g x = map (++) x <*> x
This splits the file into a list of words, and then appends each word to each other word with the applicative <*>.
Compile with GHC,
ghc words.hs
and then run with IO redirection:
./words <words.txt >out
Q2: Two files
-- words2.hs
import Control.Applicative
import Control.Monad
import System.Environment
main = do
ws <- mapM ((liftM words) . readFile) =<< getArgs
putStrLn $ unlines $ g ws
where g (x:y:_) = map (++) x <*> y
Compile as before and run with the two files as arguments:
./words2 words1.txt words2.txt > out
Bleh, compiling?
Want the convenience of a shell script and the performance of a compiled executable? Why not do both?
Simply wrap the Haskell program you want in a wrapper script which compiles it in /var/tmp, and then replaces itself with the resulting executable:
#!/bin/bash
# wrapper.sh
cd /var/tmp
cat > c.hs <<CODE
# replace this comment with haskell code
CODE
ghc c.hs >/dev/null
cd - >/dev/null
exec /var/tmp/c "$#"
This handles arguments and IO redirection as though the wrapper didn't exist.
Results
Timing against some of the other answers with two 2000 word files:
$ time ./words2 words1.txt words2.txt >out
3.75s user 0.20s system 98% cpu 4.026 total
$ time ./wrapper.sh words1.txt words2.txt > words2
4.12s user 0.26s system 97% cpu 4.485 total
$ time ./thanatos.py > out
4.93s user 0.11s system 98% cpu 5.124 total
$ time ./styko.sh
7.91s user 0.96s system 74% cpu 11.883 total
$ time ./user3552978.sh
57.16s user 29.17s system 93% cpu 1:31.97 total
You can do this in pythonic way by creating a tempfile and write data to it while reading the existing file and finally remove the original file and move the new file to original file.
import sys
from os import remove
from shutil import move
from tempfile import mkstemp
def data_redundent(source_file_path):
fh, target_file_path = mkstemp()
with open(target_file_path, 'w') as target_file:
with open(source_file_path, 'r') as source_file:
for line in source_file:
target_file.write(line.replace('\n', '')+line)
remove(source_file_path)
move(target_file_path, source_file_path)
data_redundent('test_data.txt')
I'm not sure how efficient this is, but a very simple way, using the Unix tool specifically designed for this sort of thing, would be
paste -d'\0' <file> <file>
The -d option specifies the delimiter to be used between the concatenated parts, and \0 indicates a NULL character (i.e. no delimiter at all).

Categories