extract specific set of lines from files

extract specific set of lines from files - python

I have many large (~30 MB a piece) tab-delimited text files with variable-width lines. I want to extract the 2nd field from the nth (here, n=4) and next-to-last line (the last line is empty). I can get them separately using awk:
awk 'NR==4{print $2}' filename.dat
and (I don't comprehend this entirely but)
awk '{y=x "\n" $2};END{print y}' filename.dat
but is there a way to get them together in one call? My broader intention is to wrap it in a Python script to harvest these values from a large number of files (many thousands) in separate directories and I want to reduce the number of system calls. Thanks a bunch -
Edit: I know I can read over the whole file with Python to extract those values, but thought awk might be more appropriate for the task (having to do with one of the two values located near the end of the large file).

awk 'NR==4{print $2};{y=x "\n" $2};END{print y}' filename.dat

You can pass the number of lines into awk:
awk -v lines=$( wc -l < filename.dat ) -v n=4 '
NR == n || NR == lines-1 {print $2}
' filename.dat
Note, in the wc command, use the < redirection to avoid the filename being printed.

Here's how to implement this in Python without reading the whole file
To get the nth line, you have no choice but to read the file up to the nth line as the lines are variable width.
To get the second to last line, guess how long the line might be (be generous) and seek to that many bytes before the end of the file.
read() from the point you have seeked to. Count the number of newline characters - You need at least two. If there are less than 2 newlines double your guess and try again
split the data you read at newlines - the line you seek will be the second to last item in the split

This is my solution in Python. Inspired by this other code:
def readfields(filename,nfromtop=3,nfrombottom=-2,fieldnum=1,blocksize=4096):
f = open(filename,'r')
out = ''
for i,line in enumerate(f):
if i==nfromtop:
out += line.split('\t')[fieldnum]+'\t'
break
f.seek(-blocksize,2)
out += str.split(f.read(blocksize),'\n')[nfrombottom].split('\t')[fieldnum]
return out
When I profiled it, the difference was 0.09 seconds quicker than a solution calling awk (awk 'NR==4{print $2};{y=x $2};END{print y}' filename.dat) with the subprocess module. Not a dealbreaker, but when the rest of the script is in Python it appears there is a payoff in going there (especially since I have a lot of these files).

Related

How to remove duplicates from fasta file but keep at least one per group based on header

I have a multifasta file that looks like this:
( all sequences are >100bp, more than one line, and same lenght )
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage3_samplenameD
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I need to remove the duplicates BUT keep at least on sequence per lineage. So in this simple example (Notice samplenameA,C and D are identical) above I would want to remove only samplenameD or samplenameC but not both of them. In the end I want to get the same header information as in the original file.
Example output:
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I found out a way that works to remove just the duplicates. Thanks to Pierre Lindenbaum.
sed -e '/^>/s/$/#/' -e 's/^>/#/'
file.fasta |\
tr -d '\n' | tr "#" "\n" | tr "#"
"\t" |\
sort -u -t ' ' -f -k 2,2 |\
sed -e 's/^/>/' -e 's/\t/\n/'
Running this on my example above would result in:
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
—> so losing the lineage 3 sequence
Now I’m just looking for a quick solution to remove duplicates but keep at least one sequence per lineage based on the fasta header.
I’m new to scripting... any ideas in bash/python/R are welcome.
Thanks!!!

In this case I can see two relatively good alternatives. A) look into existing tools (such as the Biopython library, or FASTX toolkit. I think both of them have good commands to do most of the work here, so it may be worthwhile to learn them. Or, B) write your own. In this case you may want to try (I'll stick to python):
loop over the file, line-by-line, and add the lineage/sequence data to a dictionary. I suggest using the sequence as a key. This way, you can easily know if you already encountered this key.
myfasta = {}
if myfasta[sequence]:
myfasta[sequence].append(lineage_id)
else:
myfasta[sequence] = [lineage_id]
This way your key (sequence) will hold the list of lineage_ids that have the same sequence. Note that the annoying bits of this solution will be to loop over the file, separate lineage-id from sequence, account for sequences that may extend to multiple lines, etc.
After that, you can loop over the dictionary, and write the sequences to file by using only the first lineage_id from the list within the dictionary.

sed not working on large file [Looking for other options]

I have a gigantic json file that was accidentally output without a newline character in between all the json entries. It is being treated as one giant single line. So what I did was try and take a find an replace with sed and insert a newline.
sed 's/{"seq_id"/\n{"seq_id"/g' my_giant_json.json
It doesn't output anything
However, I know my sed expression is working if I operate on just a small part of the file and it works fine.
head -c 1000000 my_giant_json.json | sed 's/{"seq_id"/\n{"seq_id"/g'
I have also tried using python with this gnarly one liner
'\n{"seq_id'.join(open(json_file,'r').readlines()[0].split('{"seq_id')).lstrip()
But this loads into memory thanks to readlines() method. But I don't know how to iterate through a giant single line of characters (iterate in chunks) and do a find and replace.
Any thoughts?

Perl will let you change the input separator ($/) from newline to another character. You could take advantage of this to get some convenient chunking.
perl -pe'BEGIN{$/="}"}s/^({"seq_id")/\n$1/' my_giant_json.json
That sets the input separator to be "}". Then it looks for chunks that start with {"seq_id" and prefixes them with a newline.
Note that it puts an unnecessary empty line at the beginning. You could complicate the program to eliminate that or just delete it manually after.

doing basic UNIX operations Pythonic way

I have a space separated file (file1.csv) on which I perform 3 UNIX operations manually, namely:
step1. removing all double quotes(") from each line.
sed 's/"//g' file1.csv > file_tmp1.csv
step2. removing all white spaces at the beginning of any line.
sed 's/^ *//' file_tmp1.csv > file_tmp2.csv
step3. removing all additional white spaces in between texts of each line.
cat file_tmp2.csv | tr -s " " > file1_processed.csv
So, i wanted to know if there's any better approach to this and that to in a Pythonic way without much of a computation-time. These 3 steps takes about ~5 min(max) when done using UNIX commands.
Please note the file file1.csv is a space-separated file and I want it to stay space-separated.
Also if your solution suggests loading entire file1.csv into memory then I would request you to suggest a way where this is done in chunks because the file is way too big (~20 GB or so) to load into memory every time.
thanks in advance.

An obvious improvement would be to convert the tr step to sed and combine all parts to one job. First the test data:
$ cat file
"this" "that"
The job:
$ sed 's/"//g;s/^ *//;s/ \+/ /g' file
this that
Here's all of those steps in one awk:
$ awk '{gsub(/\"|^ +/,""); gsub(/ +/," ")}1' file
this that
If you test it, let me know how long it took.

Here's a process which reads one line at a time and performs the substitutions you specified in Python.
with open('file1.csv') as source:
for line in source:
print(' '.join(line.replace('"', '').split())
The default behavior of split() includes trimming any leading (and trailing) whitespace, so we don't specify that explicitly. If you need to keep trailing whitespace, perhaps you need to update your requirements.
Your shell script attempt with multiple temporary files and multiple invocations of sed are not a good example of how to do this in the shell, either.

SED or AWK script to replace multiple text

I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.
Scenario:
I've student records (> 1 million) in students.txt.
In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on
students.txt
10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...
I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below
encrypted_contact_numbers.txt
Phone_Number, Encrypted_Phone_Number
9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747
I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.
Expected Output:
10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...
I am using the below sed script to do this operation. It is working fine but too slowly.
Approach 1:
while read -r pattern replacement; do
sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt
Approach 2:
sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt
Is there any way to process this huge file quickly?
Update: 9-Feb-2018
Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?
students.txt : Updated version
10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...

awk to the rescue!
if you have enough memory to keep the phone_map file in memory
awk -F', *' 'NR==FNR{a[$1]=$2; next}
{key=substr($0,11,20)}
key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file
not tested since you're missing the data file. It should speed up since both files will be scanned only once.

Following awk may help you on same.
awk '
FNR==NR{
sub(/ +$/,"");
a[$1]=$2;
next
}
(substr($0,11,10) in a){
print substr($0,1,10) a[substr($0,11,10)] substr($0,21)
}
' FS=", " encrypted_contact_number.txt students.txt
Output will be as follows. Will add explanation too shortly.
10000000011122334455XXX...
10000000024466742178YYY...

What question would be complete without a Perl answer? :) Adapted from various answers in the Perl Monks' discussion of this topic.
Edited source
Edited per #Borodin's comment. With some inline comments for explanation, in hopes that they are helpful.
#!/usr/bin/env perl
use strict; # keep out of trouble
use warnings; # ditto
my %numbers; # map from real phone number to encrypted phone number
open(my $enc, '<', 'encrypted_contact_numbers.txt') or die("Can't open map file");
while(<$enc>) {
s{\s+}{}g; #remove all whitespace
my ($regular, $encrypted) = split ',';
$numbers{$regular} = $encrypted;
}
# Make a regex that will match any of the numbers of interest
my $number_pattern = join '|', map quotemeta, keys %numbers;
$number_pattern = qr{$number_pattern}o;
# Compile the regex - we no longer need the string representation
while(<>) { # process each line of the input
next unless length > 1; # Skip empty lines (don't need this line if there aren't any in your input file)
substr($_, 10, 10) =~ s{($number_pattern)}{$numbers{$1}}e;
# substr: replace only in columns 11--20
# Replacement (s{}{}e): the 'e' means the replacement text is perl code.
print; # output the modified line
}
Test
Tested on Perl v5.22.4.
encrypted_contact_numbers.txt:
9234567890, 1122334455
9325788532, 4466742178
students.txt:
aaaaaaaaaa9234567890XXX...
bbbbbbbbbb9325788532YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number
(modified for ease of reading)
Output of ./process.pl students.txt:
aaaaaaaaaa1122334455XXX...
bbbbbbbbbb4466742178YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number
The change has been made on the first two lines, but not the second two, which is correct for this input.

Fastest way of processing regexp

I have a script in python to process a log file - it parses the values and joins them simply with a tab.
p = re.compile(
"([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
"worker\\(([0-9]+)\\)(?:#([^]]*))?.*\\[([0-9]+)\\] "+
"=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
"JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
)
for line in sys.stdin:
line = line.strip()
if len(line) == 0: continue
result = p.match(line)
if result != None:
print "\t".join([x if x is not None else "." for x in result.groups()])
However, the scripts behaves quite slowly and it takes a long time to process the data.
How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?
Thanks

It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.
line.split() #Split based on whitespace
or
line.split(' ') #Split based on a single space character
After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.
It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.
I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.

Im writing Perl, not Python, but recently i used this technique to parse very big logs:
Divide input file to chunks (for example, FileLen/NumProcessors bytes
each).
Adjust start and end of every chunk to \n so you take full lines to
each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.
Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.

In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.
http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions
"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."
If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract specific set of lines from files - python

awk 'NR==4{print $2};{y=x "\n" $2};END{print y}' filename.dat

You can pass the number of lines into awk: awk -v lines=$( wc -l < filename.dat ) -v n=4 ' NR == n || NR == lines-1 {print $2} ' filename.dat Note, in the wc command, use the < redirection to avoid the filename being printed.

Related

How to remove duplicates from fasta file but keep at least one per group based on header

sed not working on large file [Looking for other options]

doing basic UNIX operations Pythonic way

SED or AWK script to replace multiple text

Fastest way of processing regexp

Categories

Resources