How to randomly remove 100 blocks from a text file - python

Suppose I have a huge text file like below:
19990231
blabla
sssssssssssss
hhhhhhhhhhhhhh
ggggggggggggggg
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
20010221
fgghgg
sssssssssssss
hhhhhhhhhhhhhhh
ggggggggggggggg
<etc>
How can I randomly remove 100 blocks that start with numeric characters and end with a blank line? Eg:
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
<blank line>

This is not that difficult. The trick is to define the records first and this can be done with the record separator :
RS: The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So the number of records is given by :
$ NR=$(awk 'BEGIN{RS=""}END{print NR}' <file>)
You can then use shuf to get a hundred random numbers between 1 and NR:
$ shuf -i 1-$NR -n 100
This command you feed again in awk to select the records:
$ awk -v n=100 '(NR==n){RS="";ORS="\n\n"} # reset the RS for reading <file>
(NR==FNR){print $1; a[$1];next} # load 100 numbers in memory
!(FNR in a) { print } # print records
' <(shuf -i 1-$NR -n 100) <file>
We can also do this in one go using the Knuth shuffle and doing a double pass of the file
awk -v n=100 '
# Create n random numbers between 1 and m
function shuffle(m,n, b, i, j, t) {
for (i = m; i > 0; i--) b[i] = i
for (i = m; i > 1; i--) {
# j = random integer from 1 to i
j = int(i * rand()) + 1
# swap b[i], b[j]
t = b[i]; b[i] = b[j]; b[j] = t
}
for (i = n; i > 0; i--) a[b[i]]
}
BEGIN{RS=""; srand()}
(NR==FNR) {next}
(FNR==1) {shuffle(NR-1,n) }
!(FNR in a) { print }' <file> <file>

Using awk and shuf to delete 4 blocks out of 6 blocks where each block is 3 lines long:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
NR==FNR { next }
FNR==1 {
cmd = sprintf("shuf -i 1-%d -n %d", NR-FNR, numToDel)
oRS=RS; RS="\n"
while ( (cmd | getline line) > 0 ) {
badNrs[line]
}
RS=oRS
close(cmd)
}
!(FNR in badNrs)
$ awk -v numToDel=4 -f tst.awk file file
1
2
3
10
11
12
Just change numToDel=4 to numToDel=100 for your real input.
The input file used to test against above was generated by:
$ seq 18 | awk '1; !(NR%3){print ""}' > file
which produced:
$ cat file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

here is a solution without shuffle
$ awk -v RS= -v ORS='\n\n' -v n=100 '
BEGIN {srand()}
NR==FNR{next}
FNR==1 {r[0];
while(length(r)<=n) r[int(rand()*NR)]}
!(FNR in r)' file{,}
double pass algorithm, first round is to count number of records, create a random list of index numbers up to required value, print the records not in the list. Note that if the deleted number is closer to number of records, the performance will degrade (probability of getting a new number will be low). For your case of 100 out of 600 will not be a problem. In the alternative case, it would be easier to pick the to be printed records instead of deleted records.
Since shuf is very fast I don't think this will buy you performance gains but perhaps simpler this way.

Related

Awk commands for changing letters in a file with multiple outputs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file which looks like this:
input.txt
THISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I have another file with positions of letters i want to change and the letter i want to change it to, such as this:
textpos.txt
Position Text_Change
1 A
2 B
3 X
(Actually there will be about 10,000 alphabet changes)
And I would like one separate output file for each text change, which should look like this:
output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I would like to learn how to do this in an awk command and a pythonic way as well, and was wondering what would be the best and quickest way to do this?
Thanks in advance.
Could you please try following(considering that your actual Input_files will be having same kind of data in them). This solution should take care of error Too many open files error while running awk command since I am closing the output files in awk code.
awk '
FNR==NR{
a[++count]=$0
next
}
FNR>1{
close(file)
file="output"(FNR-1)".txt"
for(i=1;i<=count;i++){
if($1==1){
print $2 substr(a[i],2) > file
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file
}
}
}' input.txt textpos.txt
3 output files named output1.txt, output2.txt and output3.txt and their content will be as follows.
cat output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Explanation: Adding explanation for above code here.
awk '
FNR==NR{ ##Condition FNR==NR will be TRUE when first file named input.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is increasing value of count and value is current line.
next ##next will skip all further statements from here.
}
FNR>1{ ##This condition will be executed when 2nd Input_file textpos.txt is being read(excluding its header).
close(file) ##Closing file named file whose value will be output file names, getting created further.
file="output"(FNR-1)".txt" ##Creating output file named output FNR-1(line number -1) and .txt in it.
for(i=1;i<=count;i++){ ##Starting a for loop from 1 to till count value.
if($1==1){ ##Checking condition if value of 1st field is 1 then do following.
print $2 substr(a[i],2) > file ##Printing $2 substring of value of a[i] which starts from 2nd position till end of line to output file.
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file ##Printing substrings 1st 1 to till value of $1-1 $2 and then substring from $1+1 till end of line.
}
}
}' input.txt textpos.txt ##Mentioning Input_file names here.
Using gawk:
$ awk 'NR > 1 && FNR == NR { r[$1] = $2; next } {
for (i in r) {
print substr($0, 1, i - 1) r[i] substr($0, i + 1) > "output" i ".txt"
}
}' textpos.txt input.txt
Using awk, abusing FS="" for the second file making each letter a column of its own:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash positions and letters to a
{
for(i in a) # for all positions
$i=a[i] # replace the letters in them
}1' textpos FS="" OFS="" file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Another using for and substr to build a variable char by char from a[] and $0:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash textpos to a
{
for(i=1;i<=length($1);i++) # for each position in $0
b=b ((i in a)?a[i]:substr($0,i,1)) # get char from a[] or $0, in that order
print b; b="" # output and reset b for next round
}' textpos file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT

Append values into key pair from two lines

My text file out put looks like this on two lines:
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
My desired output is too look like this:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
I'm having no luck trying this:
sed '$!N;s/|/\n/' foo
Any advice would be welcomed, thank you.
As you have just two lines, this can be a way:
$ paste -d' ' <(head -1 file | sed 's/|/\n/g') <(tail -1 file | sed 's/|/\n/g')
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
By pieces. Let's get the first line and replace every pipe | with a new line:
$ head -1 file | sed 's/|/\n/g'
DelayTimeThreshold
MaxDelayPerMinute
Name
And do the same with the last line:
$ tail -1 file | sed 's/|/\n/g'
10000
5
rca
Then it is just a matter of pasting both results with a space as delimiter:
paste -d' ' output1 output2
this awk one-liner would work for your requirement:
awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}' file
output:
kent$ echo "DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca"|awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}'
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Using the Array::Transpose module:
perl -MArray::Transpose -F'\|' -lane '
push #a, [#F]
} END {print for map {join " ", #$_} transpose(\#a)
' <<END
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
END
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
As a perl script:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $file1 = ('DelayTimeThreshold|MaxDelayPerMinute|Name');
my $file2 = ('10000|5|rca');
my #file1 = split('\|', $file1);
my #file2 = split('\|', $file2);
my %hash;
#hash{#file1} = #file2;
print Dumper \%hash;
Output:
$VAR1 = {
'Name' => 'rca',
'DelayTimeThreshold' => '10000',
'MaxDelayPerMinute' => '5'
};
OR:
for (my $i = 0; $i < $#file1; $i++) {
print "$file1[$i] $file2[$i]\n";
}
Output:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Suppose you have a file that contains a single header row with column names, followed by multiple detail rows with column values, for example,
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|abc
20001|6|def
30002|7|ghk
40003|8|jkl
50004|9|mnp
The following code would print that file, using the names from the first row, paired with values from each subsequent (detail) row,
#!/bin/perl -w
use strict;
my ($fn,$fh)=("header.csv"); #whatever the file is named...
open($fh,"< $fn") || error "cannot open $fn";
my ($count,$line,#names,#vals)=(0);
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; } #first line is names
for (my $ndx=0; $ndx<=$#names; $ndx++) { #print each
print "$names[$ndx] $vals[$ndx]\n";
}
}
Suppose you want to keep around each row, annotated with names, in an array,
my %row;
my #records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
push(#records,\%row);
}
Maybe you want to refer to the rows by some key column,
my %row;
my %records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
$records{$vals[0]}=\%row;
}

How to sort output data into columns and rows

I have an output that looks like this, where the first number corresponds to the count of the type below (e.g. 72 for Type 4, etc)
72
Type
4
51
Type
5
66
Type
6
78
Type
7
..etc
Is there a way to organize this data to look something like this:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
etc..
Essentially, the question is how to take a single column of data and sort /organize it into something more readable using bash, awk, python, etc. (Ideally, in bash, but interested to know how to do in Python).
Thank you.
Use paste to join 3 consecutive lines from stdin, then just rearrange the fields.
paste - - - < file | awk '{print $2, $3, "=", $1, "times"}'
It's simple enough with Python to read three lines of data at a time:
def perthree(iterable):
return zip(*[iter(iterable)] * 3)
with open(inputfile) as infile:
for count, type_, type_num in perthree(infile):
print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
The .strip() calls remove any extra whitespace, including the newline at the end of each line of input text.
Demo:
>>> with open(inputfile) as infile:
... for count, type_, type_num in perthree(infile):
... print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
...
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
In Bash:
#!/bin/bash
A=() I=0
while read -r LINE; do
if (( (M = ++I % 3) )); then
A[M]=$LINE
else
printf "%s %s = %s times\n" "${A[2]}" "$LINE" "${A[1]}"
fi
done
Running bash script.sh < file creates:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
Note: With a default IFS ($' \t\n'), read would remove leading and trailing spaces by default.
Try this awk one liner:
$ awk 'NR%3==1{n=$1}NR%3==2{t=$1}NR%3==0{print t,$1,"=",n,"times"}' file
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
How it works?
awk '
NR%3==1{ # if we are on lines 1,4,7, etc (NR is the record number (or the line number)
n=$1 # set the variable n to the first (and only) word
}
NR%3==2{ # if we are on lines 2,5,7, etc
t=$1 # set the variable t to the first (and only) word
}
NR%3==0{ # if we are on lines 3,6,9, etc
print t,$1,"=",n,"times" # print the desired output
}' file

intersperse the lines of two different files

I have to do a simple task, but I don't know how to do it and I'm staked. I need to intersperse the lines of two different files each 4 lines:
File 1:
1
2
3
4
5
6
7
8
9
10
11
12
FILE 2:
A
B
C
D
E
F
G
H
I
J
K
L
Desired result:
1
2
3
4
A
B
C
D
5
6
7
8
E
F
G
H
9
10
11
12
I
J
K
L
I'm looking for a sed, awk or python script, or any other bash command.
Thanks for your time!!
I tried to do it using specific python libraries that recognize the 4 lines modules of each files. But It doesn't work and now I trying to do it without this libraries, but don't know how.
import sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def main(forward,reverse):
for F, R in zip ( SeqIO.parse(forward, "fastq"), SeqIO.parse(reverse, "fastq") ):
fastq_out_F = SeqRecord( F.seq, id = F.id, description = "" )
fastq_out_F.letter_annotations["phred_quality"] = F.letter_annotations["phred_quality"]
fastq_out_R = SeqRecord( R.seq, id = R.id, description = "" )
fastq_out_R.letter_annotations["phred_quality"] = R.letter_annotations["phred_quality"]
print fastq_out_F.format("fastq"),
print fastq_out_R.format("fastq"),
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
This might work for you:(using GNU sed)
sed -e 'n;n;n;R file2' -e 'R file2' -e 'R file2' -e 'R file2' file1
or using paste/bash:
paste -d' ' <(paste -sd' \n' file1) <(paste -sd' \n' file2) | tr ' ' '\n'
or:
parallel -N4 --xapply 'printf "%s\n%s\n" {1} {2}' :::: file1 :::: file2
It can be done in pure bash:
f1=""; f2=""
while test -z "$f1" -o -z "$f2"; do
{ read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE" && \
read LINE && echo "$LINE"; } || f1=end;
{ read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE" && \
read -u 3 LINE && echo "$LINE"; } || f2=end;
done < f1 3< f2
The idea is to use a new file descriptor (3 in this case) and read from stdin and this file descriptor at the same time.
A mix of paste and sed can also be used if you do not have GNU sed:
paste -d '\n' f1 f2 | sed -e 'x;N;x;N;x;N;x;N;x;N;x;N;x;N;s/^\n//;H;s/.*//;x'
If you are not familiar with sed, there is a 2nd buffer called the hold space where you can save data. The x command exchanges the current buffer with the hold space, the N command appends one line to the current buffer, and the H command appends the current buffer to the hold space.
So the first x;N save the current line (from f1 because of paste) in the hold space and read the next line (from f2 because of paste), then each x;N;x;N read a new line from f1 and f2, and the script finishes by removing the new line from the 4 lines of f2, puts the lines from f2 at the end of the lines of f1, clean the hold space for the next run and print the 8 lines.
Try this, changing the appropriate filename values for f1 and f2.
awk 'BEGIN{
sectionSize=4; maxSectionCnt=sectionSize; maxSectionCnt++
notEof1=notEof2=1
f1="file1" ; f2="file2"
while (notEof1 && notEof2) {
if (notEof1) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f1 >0 ) { print "F1:" i":" $0 } else {notEof1=0}
}
}
if (notEof2) {
for (i=1;i<maxSectionCnt;i++) {
if (getline < f2 >0 ) { print "F2:" i":" $0 } else {notEof2=0}
}
}
}
}'
You can also remove the "F1: i":" etc record header. I added that help debug code.
As Pastafarianist rightly points out, you may need to modify this if you have expectations about what will happen if the files are not the same size, etc.
I hope this helps.
The code you posted looks extremely complicated. There is a rule of thumb with programming: there is always a simpler solution. In your case, way simpler.
First thing you should do is determine the limitations of the input. Are you going to process really big files? Or are they going to be only of one-or-two-kilobyte size? It matters.
Second thing: look at the tools you have. With Python, you've got file objects, lists, generators and so on. Try to combine these tools to produce the desired result.
In your particular case, there are some unclear points. What should the script do if the input files have different size? Or one of them is empty? Or the number of lines is not a factor of four? You should decide how to handle corner cases like these.
Take a look at the file object, xrange, list slicing and list comprehensions. If you prefer doing it the cool way, you can also take a look at the itertools module.

how to trim file - for rows which with the same value in two columns, conserve only the row with max in another columns

I am now facing a file trimming problem. I would like to trim rows in a tab-delimited file.
The rule is: for rows which with the same value in two columns, preserve only the row with the largest value in the third column. There may be different numbers of such redundant rows defined by two columns. If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).
(1) My file looks like (tab-delimited, with several millions of rows):
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
(2) The output I want:
1 100 25 T
1 101 30 A
1 102 40 T
This problem is faced by my real study, not home-work. I expect to have your helps on that, because I have restricted programming skills. I prefer an computation-efficient way, because there is so many rows in my data file. Your help will be very valuable to me.
Here's a solution that will rely on the input file already being sorted appropriately. It will scan line-by-line for lines with similar start (e.g. two first columns identical), check the third column value and preserve the line with the highest value - or the line that came first in the file. When a new start is found, it prints the old line, and begins checking again.
At the end of the input file, the max line in memory is printed out.
use warnings;
use strict;
my ($max_line, $start, $max) = parse_line(scalar <DATA>);
while (<DATA>) {
my ($line, $nl_start, $nl_max) = parse_line($_);
if ($nl_start eq $start) {
if ($nl_max > $max) {
$max_line = $line;
$max = $nl_max;
}
} else {
print $max_line;
$start = $nl_start;
$max = $nl_max;
$max_line = $line;
}
}
print $max_line;
sub parse_line {
my $line = shift;
my ($start, $max) = $line =~ /^([^\t]+\t[^\t]+\t)(\d+)/;
return ($line, $start, $max);
}
__DATA__
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
The output is:
1 100 25 T
1 101 30 A
1 102 40 A
You stated
If there is a tie for the largest
value in the third column, preserve
the first one (after ordering the
file).
which is rather cryptic. Then you asked for output that seemed to contradict this, where the last value was printed instead of the first.
I am assuming that what you meant is "preserve the first value". If you indeed meant "preserve the last value", then simply change the > sign in if ($nl_max > $max) to >=. This will effectively preserve the last value equal instead of the first.
If you however implied some kind of sort, which "after ordering the file" seems to imply, then I do not have enough information to know what you meant.
In python you can try the following code:
res = {}
for line in (line.split() for line in open('c:\\inpt.txt','r') if line):
line = tuple(line)
if not line[:2] in res:
res[line[:2]] = line[2:]
continue
elif res[line[:2]][0] <= line[3]:
res[line[:2]] = line[2:]
f = open('c:\\tst.txt','w')
[f.write(line) for line in ('\t'.join(k+v)+'\n' for k,v in res.iteritems())]
f.close()
Here's one approach
use strict;
use warnings;
use constant
{ LINENO => 0
, LINE => 1
, SCORE => 2
};
use English qw<$INPUT_LINE_NUMBER>;
my %hash;
while ( <> ) {
# split the line to get the fields
my #fields = split /\t/;
# Assemble a key for everything except the "score"
my $key = join( '-', #fields[0,1] );
# locally cache the score
my $score = $fields[SCORE];
# if we have a score, and the current is not greater, then next
next unless ( $hash{ $key } and $score > $hash{ $key }[SCORE];
# store the line number, line text, and score
$hash{ $key } = [ $INPUT_LINE_NUMBER, $_, $score ];
}
# sort by line number and print out the text of the line stored.
foreach my $struct ( sort { $a->[LINENO] <=> $b->[LINENO] } values %hash ) {
print $struct->[LINE];
}
In Python too, but cleaner imo
import csv
spamReader = csv.reader(open('eggs'), delimiter='\t')
select = {}
for row in spamReader:
first_two, three = (row[0], row[1]), row[2]
if first_two in select:
if select[first_two][2] > three:
continue
select[first_two] = row
spamWriter = csv.writer(open('ham', 'w'), delimiter='\t')
for line in select:
spamWrite.writerow(select[line])

Categories