Remove duplicates in multifasta, where entries are paired

Remove duplicates in multifasta, where entries are paired - python

Hi my input looks like:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
The entries in the fasta file are paired so that the ref is paired with the sample# below it.
I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.
So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.

Here is one way to do it:
Make a list from every two records (4 lines), each line an element of the list. Then compare DNA sequences in each list. If they are not equal, print the ref and sample.
Note: This script assumes that your FASTA record is a two line record (as in your example), not wrapped into multi line FASTA. If you have wrapped FASTA record, you need to convert it to two line records, or I suggest you parse your FASTA file with Biopython Module.
counter =0
my_records =[]
with open("input.fasta") as f:
for line in f:
counter+=1
my_records.append(line.strip())
if counter % 4 == 0:
if my_records[1] != my_records[3]:
[print(item) for item in my_records]
my_records =[]

This awk script seems to solve your problem. If it does, please mark this as correct. If it does not, please post a comment.
Mac_3.2.57$cat fastaScrubber-v0.awk
{
if(NR%4==2){
ref=$0
}else if(NR%4==3){
samN=$0
}else if(NR%4==0){
sam=$0
if(sam!=ref){
printf(">ref\n%s\n%s\n%s\n",ref ,samN ,sam)
}
}
}
Mac_3.2.57$cat fasta0 | awk '{if(NR%4==2||NR%4==0){print}}' | uniq -c
2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$awk -f fastaScrubber-v0.awk fasta0
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$cat fasta0
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>sample3
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$

Alternate awk solution via bash script:
Pass the fasta file to be processed as a parameter to the script execution. For lines that start with >ref capture that line into variable rtag and the following line into the variable rtag. For lines that start with >sample capture that line into the variable stag and the following line into the variable s. If r does not equal s print all four captured lines as formatted line.
#!/bin/bash
fasta="${1:-input.fasta}"
awk '
/^>ref/{rtag=$0; getline r}
/^>sample/{
stag=$0; getline s
if(r!=s){
printf "%s\n%s\n%s\n%s\n", rtag, r, stag, s
}
}
' "$fasta"
Output:
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
input.fasta contents:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT

Related

Match Multiple Columns In Two Files - Output Only Those That Match Fully

File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).

grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1

Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson

Awk commands for changing letters in a file with multiple outputs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file which looks like this:
input.txt
THISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I have another file with positions of letters i want to change and the letter i want to change it to, such as this:
textpos.txt
Position Text_Change
1 A
2 B
3 X
(Actually there will be about 10,000 alphabet changes)
And I would like one separate output file for each text change, which should look like this:
output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I would like to learn how to do this in an awk command and a pythonic way as well, and was wondering what would be the best and quickest way to do this?
Thanks in advance.

Could you please try following(considering that your actual Input_files will be having same kind of data in them). This solution should take care of error Too many open files error while running awk command since I am closing the output files in awk code.
awk '
FNR==NR{
a[++count]=$0
next
}
FNR>1{
close(file)
file="output"(FNR-1)".txt"
for(i=1;i<=count;i++){
if($1==1){
print $2 substr(a[i],2) > file
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file
}
}
}' input.txt textpos.txt
3 output files named output1.txt, output2.txt and output3.txt and their content will be as follows.
cat output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Explanation: Adding explanation for above code here.
awk '
FNR==NR{ ##Condition FNR==NR will be TRUE when first file named input.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is increasing value of count and value is current line.
next ##next will skip all further statements from here.
}
FNR>1{ ##This condition will be executed when 2nd Input_file textpos.txt is being read(excluding its header).
close(file) ##Closing file named file whose value will be output file names, getting created further.
file="output"(FNR-1)".txt" ##Creating output file named output FNR-1(line number -1) and .txt in it.
for(i=1;i<=count;i++){ ##Starting a for loop from 1 to till count value.
if($1==1){ ##Checking condition if value of 1st field is 1 then do following.
print $2 substr(a[i],2) > file ##Printing $2 substring of value of a[i] which starts from 2nd position till end of line to output file.
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file ##Printing substrings 1st 1 to till value of $1-1 $2 and then substring from $1+1 till end of line.
}
}
}' input.txt textpos.txt ##Mentioning Input_file names here.

Using gawk:
$ awk 'NR > 1 && FNR == NR { r[$1] = $2; next } {
for (i in r) {
print substr($0, 1, i - 1) r[i] substr($0, i + 1) > "output" i ".txt"
}
}' textpos.txt input.txt

Using awk, abusing FS="" for the second file making each letter a column of its own:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash positions and letters to a
{
for(i in a) # for all positions
$i=a[i] # replace the letters in them
}1' textpos FS="" OFS="" file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Another using for and substr to build a variable char by char from a[] and $0:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash textpos to a
{
for(i=1;i<=length($1);i++) # for each position in $0
b=b ((i in a)?a[i]:substr($0,i,1)) # get char from a[] or $0, in that order
print b; b="" # output and reset b for next round
}' textpos file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT

How to use awk to select/remove fields from the end of a column after splitting?

In a file that has a particular column information I want to remove exactly 5 fields (i.e :PG:PB:PI:PW:PC (separator is ':') from the end of the lines, not from the beginning.
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
Assuming that the above data is from column #3 of the file, I wrote the following code:
awk 'BEGIN{FS=OFS="\t"} { split($3, a,":")} {print ($1, $2, a[1]":"a[2]":"a[3]":"a[4]":"a[5])}' awk_test.vcf
This code splits and selects the first 5 fields, but I want to remove the last 5 fields. Selecting from the first fields won't work since certain fields like PGT , PID are inserted in certain lines. Only, removing from the end works.
Expected output:
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PGT:PID:PL
GT:AD:DP:GQ:PGT:PID:PL
Thanks for helping me with the code for first part of my question.
But, the script isn't working for my another file which has the following data. Here I want to update the 9th column with the same purpose. The columns are tab separated. But, what I want to do remains basically the same.
2 1463 . T TG 433.67 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=247;ExcessHet=2.9800;FS=0.000;MQ=21.25;MQRankSum=0.00;QD=33.36;ReadPosRankSum=-6.740e-01;SOR=0.784;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:76,0:76:0,0,0:./.:.:.:./.:. ./.:55,0:55:0,0,0:.:.:.:.:. ./.:68,0:68:0,0,0:.:.:.:.:. ./.:48,0:48:0,0,0:.:.:.:.:.
2 1466 . TG T 395.82 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=1.01;ClippingRankSum=0.00;DP=287;ExcessHet=5.1188;FS=7.707;MQ=18.00;MQRankSum=0.00;QD=17.21;ReadPosRankSum=1.28;SOR=0.074;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1467 . G T 1334.42 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=287;ExcessHet=4.8226;FS=1.328;MQ=23.36;MQRankSum=0.00;QD=28.65;ReadPosRankSum=-4.310e-01;SOR=0.566;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1516 . C T 5902.93 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.287;ClippingRankSum=0.00;DP=411;ExcessHet=0.5065;FS=1.489;InbreedingCoeff=0.3492;MQ=59.77;MQRankSum=0.00;QD=28.38;ReadPosRankSum=-7.100e-02;SOR=0.553;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 1/1:1,108:109:99:3935,286,0:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:78,0:78:99:0,120,1800:.:.:.:.:.
2 1584 . CT C 164.08 PASS AC=0;AF=0.00;AN=8;DP=717;ExcessHet=0.0812;FS=0.000;InbreedingCoeff=0.9386;MQ=60.00;QD=32.82;SOR=3.611;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 0/0:172,0:172:99:0,120,1800:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:321,0:321:99:0,120,1800:.:.:.:.:.
2 1609 . C A 604.68 PASS AC=0;AF=0.00;AN=0;DP=386;ExcessHet=0.1158;FS=0.000;InbreedingCoeff=0.8938;MQ=12.32;QD=31.09;SOR=1.061;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:0,0:0:0,0,0:./.:.:.:./.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:386,0:386:0,0,0:.:.:.:.:.
2 1612 . TGTGAGCTATTTCTTTTACATTTTTCTTTAGATTCTAGGTTAAATTGTGAAGCTGATTATCTTTTTTGTTTACAG T 1298.76 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.1047;FS=0.000;InbreedingCoeff=0.8896;MQ=60.02;QD=29.54;SOR=1.179;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0:0:.:0,0,0:./.:.:.:./.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. 1/1:0,3:3:99:1355,582,0:.:.:.:.:.
2 1657 . T A,* 3118.91 PASS AC=0,2;AF=0.00,1.00;AN=2;BaseQRankSum=0.578;ClippingRankSum=0.00;DP=4;ExcessHet=1.9114;FS=3.474;InbreedingCoeff=0.0821;MQ=26.68;MQRankSum=0.841;QD=28.10;ReadPosRankSum=-5.960e-01;SOR=0.821;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0,0:0:.:0,0,0,0,0,0:./.:.:.:./.:. ./.:1,0,0:1:.:0,0,0,0,0,0:.:.:.:.:. ./.:0,0,0:0:.:0,0,0,0,0,0:.:.:.:.:. 2/2:0,0,3:3:99:1355,1360,1393,582,615,0:.:.:.:.:.
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=1595;ExcessHet=0.0577;FS=0.621;InbreedingCoeff=0.6496;MQ=60.00;MQRankSum=0.00;QD=5.46;ReadPosRankSum=0.307;SOR=0.773;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:389,92:481:99:1748,0,12243:0|1:.,.,.,.,.:935:|:0.5 0/0:318,0:318:99:0,120,1800:.:.:.:.:. 0/1:270,53:323:99:990,0,9096:.:.:.:.:. 0/0:473,0:473:99:0,120,1800:.:.:.:.:.
2 2781 . T G 435.07 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.624;ClippingRankSum=0.00;DP=2146;ExcessHet=3.4523;FS=8.450;InbreedingCoeff=-0.0856;MQ=60.06;MQRankSum=-4.630e-01;QD=1.27;ReadPosRankSum=-5.980e+00;SOR=1.436;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,34:343:99:0|1:2781_T_G:469,0,12941:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2786 . C G 39.69 PASS AC=0;AF=0.00;AN=8;BaseQRankSum=0.881;ClippingRankSum=0.00;DP=2145;ExcessHet=4.3933;FS=0.000;InbreedingCoeff=-0.1367;MQ=52.41;MQRankSum=-1.356e+00;QD=1.13;ReadPosRankSum=0.577;SOR=0.527;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:0,120,1800:0/0:.:.:0/0:. 0/0:342,0:342:99:0,120,1800:.:.:.:.:. 0/0:492,0:492:99:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:0,120,1800:.:.:.:.:.
2 2787 . T C 993.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-2.967e+00;ClippingRankSum=0.00;DP=2153;ExcessHet=3.8663;FS=4.941;InbreedingCoeff=-0.1076;MQ=60.06;MQRankSum=-5.100e-01;QD=2.84;ReadPosRankSum=-3.689e+00;SOR=0.875;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,41:350:99:0|1:2781_T_G:1027,0,13619:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2792 . A G 745.21 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.271;ClippingRankSum=0.00;DP=2176;ExcessHet=5.9256;FS=5.964;InbreedingCoeff=-0.2087;MQ=59.48;MQRankSum=-4.920e-01;QD=1.83;ReadPosRankSum=-3.100e-02;SOR=1.389;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:332,41:373:99:0|1:2781_T_G:705,0,13295:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
I also tried adding FS/OFS parameters but it isn't working.

After some clarification what the file looks like, here is my updated answer:
You can simply use
awk 'BEGIN{FS=OFS="\t"} {$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile
Here we use the standard awk field splitting, since your file is tab-separated.
We further do a regular expression replacement scoped to $9, which is the colon-separated string you want to change.
The regular expression works the same as in the old answer, in which I had the impression that the line consists only of the colon-separated string.
Old Answer
Since you wrote "pipe to python" in your comment, maybe you are open to an sed solution?
sed -r "s/(:[^:]+){5}$//" yourfile
Here we replace (s/...// replace the ... with nothing), the ... means:
from the end of line ($)
five ({5})
occurences of colon (:)
followed by something (+)
not a colon ([^:])
And this can again be "translated" to awk:
awk -F: 'BEGIN{FS=OFS="\t"} {$0 = gensub(/(:[^:]+){5}$/,"","1")} 1' yourfile

Maybe not the best awk solution but works:
awk -F: '{printf($1); for (i=2;i<=NF-5;i++) printf(":%s",$i); printf("\n"); }' file.txt
split the fields naturally according to colon
print first field, and then other fields minus the 5 last ones (using NF: number of fields preset variable), with leading colon.
print a linefeed to end the line.
EDIT: I knew there was better to do using awk. As Lars commented, this is way simpler and cleaner:
awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}'
use separator value instead of hardcoded colon
compose string instead of printing all fields
print string in the end
If you want to use it within a python script, I'd suggest that you write that in python, simpler & faster:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter=":")
cw = csv.writer(fw,delimiter=":")
for row in cr:
cw.writerow(row[:-5]) # write the row but the 5 last fields
you can omit the with part if you already have open handles.
EDIT: since you heavily edited your question after my answer, now you want to remove the 5 last "fields" from one particular field (tab-separated). Lars has answered properly awk-wise, let me propose my python solution:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter="\t")
cw = csv.writer(fw,delimiter="\t")
for row in cr:
row[8]=":".join(row[8].split(":")[:-5]) # remove 5 last "fields" from 8th field
cw.writerow(row) # write the modified row

Extracting data from file with differing amounts of columns

I have a tab separated file that appears as so:
NM_000014 chr12 - 36 9220303 9220778 9221335 9222340 9223083 9224954 9225248 9227155 9229351 9229941 9230296 9231839 9232234 9232689 9241795 9242497 9242951 9243796 9246060 9247568 9248134 9251202 9251976 9253739 9254042 9256834 9258831 9259086 9260119 9261916 9262462 9262909 9264754 9264972 9265955 9268359 9220435 9220820 9221438 9222409 9223174 9225082 9225467 9227379 9229532 9230016 9230453 9231927 9232411 9232773 9241847 9242619 9243078 9244025 9246175 9247680 9248296 9251352 9252119 9253803 9254270 9256996 9258941 9259201 9260240 9262001 9262631 9262930 9264807 9265132 9266139 9268558 A2M 1
NM_000016 chr1 + 12 76190031 76194085 76198328 76198537 76199212 76200475 76205664 76211490 76215103 76216135 76226806 76228376 76190502 76194173 76198426 76198607 76199313 76200556 76205795 76211599 76215244 76216231 76227055 76229363 ACADM 1
As you can tell if you scroll to end of the lines, there are differing amounts of columns corresponding to the numbers listed. What I want to do is output the very last number before the gene name (A2M and ACADM in this case) to a file. Is there any way to do this? I've been trying to figure out a way using unix's awk, however I don't believe this will work due to the differing amounts of columns.
Any help is appreciated

Use $(NF-1) like so where NF is the number fields for that line:
awk '{print $(NF-1)}' /tmp/genes.txt
A2M
ACADM
Your posted example has spaces for delimiters. You may need to change the field separator to tabs if you file is truly tab delimited. Then it would be:
awk -F $'\t' {print $(NF-1)}' file_name
If you want the number before that name:
$ awk '{print $(NF-2)}' /tmp/genes.txt
9268558
76229363

Try:
awk '{ print $(NF-1) }' FILE
NF always provides the number of fields, so you can use that in an awk variable to dynamically set the field based on the field length.

Are all of your lines structured in the same manner. If so it is pretty straightforward:
for line in myLines:
data = line.split[-3]

How do I quickly match the fields of two files that are sorted but one is a subset of the other

I have two sorted files and want to merge them to make a third, but I need the output to be sorted. One column in the second file is a subset of the first and any place the second file doesn't match the first should be filled in with a NA. The files are large (~20,000,000) records each so loading things into memory is tough and speed is an issue.
File 1 looks like this:
1 a
2 b
3 c
4 d
5 e
File 2 looks like this:
1 aa
2 bb
4 dd
5 ee
And the the output should be like this
1 a aa
2 b bb
3 c NA
4 d cc
5 e ee

join is your friend here.
join -a 1 file1 file2
should do the trick. The only difference to your example output is that the unpairable lines are printed directly from file1, i.e. without the NA.
Edit: Here is a version that also handles the NAs:
join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2

If I understand you correctly:
File #1 and file #2 will have the same lines
However, some lines will be missing from file #2 that are in file #1.
AND, most importantly, the lines will be sorted in each file.
That means if I get a line from file #2, and the keep reading through file #1, I'll find a matching line sooner or later. Therefore, we want to read a line from file #2, keep looking through file #1 until we find the matching line, and when we do find one, we want to print out both values.
I would imagine some sort of algorithm like this:
Read first line from file #2
While read line from file #1
if line from file #2 > line from file #1
write line from file #1 and "NA"
else
write line from file #1 and file #2
Read another line from file #2
fi
done
There should be some form of error checking (what if you find the line from file #1 to be greater than the line from file #2? That means line #1 is missing the line.) And, there should be some boundary checking (what if you run out of lines from file #2 before you finish file #1?)
This sounds like a school assignment, so I really don't want to give an actual answer. However, the algorithm is there. All you need to do is implement it in your favorite language.
If it isn't a school assignment, and you need more help, just post a comment on this answer, and I'll do what I can.
To the DNA Biologist
#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use constant {
TEXT1 => "foo1.txt",
TEXT2 => "foo2.txt",
};
open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);
my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
chomp $line1;
my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
if (not defined $line2) {
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 lt $lineNum2) { #Use "<" if numeric match and not string match
say "$lineNum1 - $value1 - NA";
}
elsif ($lineNum1 eq $lineNum2) {
say "$lineNum1 - $value1 - $value2";
$line2 = <FILE2>;
if (defined $line2) {
chomp $line2;
($lineNum2, $value2) = split(/\s+/, $line2, 2);
}
}
else {
die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
}
}
It wasn't thoroughly tested, but it worked on some short sample files.

You can do it all in shell:
sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined

Here's a Python solution:
"""merge two files based on matching first columns"""
def merge_files(file1, file2, merge_file):
with (open(file1) as file1,
open(file2) as file2,
open(merge_file, 'w')) as merge:
for line2 in file2:
index2, value2 = line2.split(' ', 1)
for line1 in file1:
index1, value1 = line1.split(' ', 1)
if index1 != index2:
merge.write(line1)
continue
merge.write("%s %s %s" % (index1, value1[:-1], value2))
break
for line1 in file1: # grab any remaining lines in file1
merge.write(line1)
if __name__ == '__main__':
merge_files('test1.txt','test2.txt','test3.txt')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates in multifasta, where entries are paired - python

Related

Match Multiple Columns In Two Files - Output Only Those That Match Fully

Awk commands for changing letters in a file with multiple outputs [closed]

How to use awk to select/remove fields from the end of a column after splitting?

Extracting data from file with differing amounts of columns

How do I quickly match the fields of two files that are sorted but one is a subset of the other

Categories

Resources