Extracting data from file with differing amounts of columns - python

I have a tab separated file that appears as so:
NM_000014 chr12 - 36 9220303 9220778 9221335 9222340 9223083 9224954 9225248 9227155 9229351 9229941 9230296 9231839 9232234 9232689 9241795 9242497 9242951 9243796 9246060 9247568 9248134 9251202 9251976 9253739 9254042 9256834 9258831 9259086 9260119 9261916 9262462 9262909 9264754 9264972 9265955 9268359 9220435 9220820 9221438 9222409 9223174 9225082 9225467 9227379 9229532 9230016 9230453 9231927 9232411 9232773 9241847 9242619 9243078 9244025 9246175 9247680 9248296 9251352 9252119 9253803 9254270 9256996 9258941 9259201 9260240 9262001 9262631 9262930 9264807 9265132 9266139 9268558 A2M 1
NM_000016 chr1 + 12 76190031 76194085 76198328 76198537 76199212 76200475 76205664 76211490 76215103 76216135 76226806 76228376 76190502 76194173 76198426 76198607 76199313 76200556 76205795 76211599 76215244 76216231 76227055 76229363 ACADM 1
As you can tell if you scroll to end of the lines, there are differing amounts of columns corresponding to the numbers listed. What I want to do is output the very last number before the gene name (A2M and ACADM in this case) to a file. Is there any way to do this? I've been trying to figure out a way using unix's awk, however I don't believe this will work due to the differing amounts of columns.
Any help is appreciated

Use $(NF-1) like so where NF is the number fields for that line:
awk '{print $(NF-1)}' /tmp/genes.txt
A2M
ACADM
Your posted example has spaces for delimiters. You may need to change the field separator to tabs if you file is truly tab delimited. Then it would be:
awk -F $'\t' {print $(NF-1)}' file_name
If you want the number before that name:
$ awk '{print $(NF-2)}' /tmp/genes.txt
9268558
76229363

Try:
awk '{ print $(NF-1) }' FILE
NF always provides the number of fields, so you can use that in an awk variable to dynamically set the field based on the field length.

Are all of your lines structured in the same manner. If so it is pretty straightforward:
for line in myLines:
data = line.split[-3]

Related

Python or Bash - How do you add words listed in a file in the middle of two sentences and put the output into another file?

I dont know how to define the variable 'set' and ' ' to then put a name and another ' ' before adding the last word 'username' and 'admin' in 'file2' for each name listed in 'file1'.
file1 = [/home/smith/file1.txt]
file2 = [/home/smith/file2.txt]
file3 = file1 + file2
Example:
[file1 - Names]
smith
jerry
summer
aaron
[file2 - Sentences]
set username
set admin
[file3 - Output]
set smith username
set smith admin
set jerry username
set jerry admin
set summer username
set summer admin
set aaron username
set aaron admin
Can you be more specific about your problem? And have you already tried something? If that is the case, please share it.
The way I see it you can open file2, read every line, split the two words on the space (and add it to a list for example). Then you can create a new string for every set of words you've created in that list. Loop on every line in file1. For every line in file1: take the first word from file2, add a space. Add the actual line from file1. And at last you add another space and the second word from.
You now have a new string which you can append to a new file for example. You should problably append that string to the new file in the same loop where you created the string.
But then again, I'm not shure if this is an answer to your question.
Try this one in Bash, it answers your question
#!/bin/bash
file1=".../f1.txt"
file2=".../f2.txt"
file3=".../f3.txt"
while read p1; do
while read p2; do
word1=$(echo $p2 | cut -f1 -d' ')
word2=$(echo $p2 | cut -f2 -d' ')
echo " $word1 $p1 $word2" >> $file3
done < $file2
done < $file1
Something like this, perhaps..
names = file("/home/smith/file1.txt").readlines()
commands = file("/home/smith/file2.txt").readlines()
res = []
for name in names:
for command in commands:
command = command.split(" ")
res.append(" ".join([command[0],name,command[1]]))
file("/home/smith/file3.txt","w").write("\n".join(res))
I'm sure this is not the prettiest way, but should work. But why do you want to do something like this...?
Yet another solution using utilities only:
join -1 2 -2 3 file1 file2 | awk '{printf "%s %s %s\n", $2, $1, $3}' > file3

Match Multiple Columns In Two Files - Output Only Those That Match Fully

File 1:
1075908|2178412|brown_eyeshorty#att.net|Claude|Desmangles
175908|2178412|naim.kazi#webtv.net|Naim|Kazi
175972|212946872418|gil_maynard#hotmail.com|Munster|Herman
175972|212946872418|meghanj4#lycos.com|Meghan|Judge
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
176086|2480881|lourdsneil#gmail.com|Lourds|Herman
File 2:
89129090|Sadiq|Islam
212946872418|Anna|Balint
255875|Charlene|Johnson
89234902|Bob|Brown
09123789|Fabio|Vanetti
I would like to extract lines where ALL values match on the following basis:
Column 2 in File 1 matches with Column 1 in File 2.
Column 4 in File 1 matches with Column 2 in File 2.
Column 5 in File 1 matches with Column 3 in File 2.
The expected output for the example is:
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
The two inputs I'm working with are both very large (11Gb and 3Gb respectively).
The only potential (messy) workaround I can think of is to combine the values to be joined into a single additional column and then use Join (I'm very new to this).
grep -f <(sed 's,|,|[^|]*|,' file2) file1
Returns
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson
Explanations :
First command :
sed 's,|,|[^|]*|,' file2
Transforms file2 into a list of patterns to search in file 1 and returns :
89129090|[^|]*|Sadiq|Islam
212946872418|[^|]*|Anna|Balint
255875|[^|]*|Charlene|Johnson
89234902|[^|]*|Bob|Brown
09123789|[^|]*|Fabio|Vanetti
Second command :
grep -f <(command1) file1
Searchs patterns in file1
Could you please try following.
awk -F'|' '
FNR==NR{
a[$2,$4,$5]=(a[$2,$4,$5]?a[$2,$4,$5] ORS:"")$0
next
}
(($1,$2,$3) in a){
print a[$1,$2,$3]
}' Input_file1 Input_file2
Output will be as follows.
175972|212946872418|quenchia#gmail.com|Anna|Balint
176046|255875|keion#netscape.net|Charlene|Johnson
176046|255875|keion112#netscape.net|Charlene|Johnson

add first entry into next line and shift data after comma & underscore from previous line to the next line

Data after underscore and comma will go to next line and data from beginning to pipe add before them.
sample data:
1.2.4.0/24|24151_24409_24406
37.99.128.0/19|47794_47795,48695
37.142.128.0/17|21450,65555
expected result should be:
1.2.4.0/24|24151
1.2.4.0/24|24409
1.2.4.0/24|24406
37.99.128.0/19|47794
37.99.128.0/19|47795
37.99.128.0/19|48695
37.142.128.0/17|21450
37.142.128.0/17|65555
Is there a way to do it?
With awk:
awk -F '[|,_]' '{for (i=2; i<=NF; i++) print $1 "|" $i}' file
Output:
1.2.4.0/24|24151
1.2.4.0/24|24409
1.2.4.0/24|24406
37.99.128.0/19|47794
37.99.128.0/19|47795
37.99.128.0/19|48695
37.142.128.0/17|21450
37.142.128.0/17|65555
The variable NF is set to the total number of fields in the input record.
gawk solution:
awk -F'|' '{ n=split($2,a,/[_,]/); for(i=1;i<=n;i++) print $1,a[i] }' OFS='|' file
The output:
1.2.4.0/24|24151
1.2.4.0/24|24409
1.2.4.0/24|24406
37.99.128.0/19|47794
37.99.128.0/19|47795
37.99.128.0/19|48695
37.142.128.0/17|21450
37.142.128.0/17|65555
split($2,a,/[_,]/) - divide the 2nd field value into pieces defined by pattern /[_,]/ and returns the number of elements created n
This might work for you (GNU sed):
sed -r 's/((.*)\|[^_,]*)[_,]/\1\n\2|/;l;P;D' file
Replace the first _ or , by the first two fields separated from the next two fields by a newline. Print the first line then repeat the process until all fields on the pattern space are accounted for.

How to use awk to select/remove fields from the end of a column after splitting?

In a file that has a particular column information I want to remove exactly 5 fields (i.e :PG:PB:PI:PW:PC (separator is ':') from the end of the lines, not from the beginning.
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC
Assuming that the above data is from column #3 of the file, I wrote the following code:
awk 'BEGIN{FS=OFS="\t"} { split($3, a,":")} {print ($1, $2, a[1]":"a[2]":"a[3]":"a[4]":"a[5])}' awk_test.vcf
This code splits and selects the first 5 fields, but I want to remove the last 5 fields. Selecting from the first fields won't work since certain fields like PGT , PID are inserted in certain lines. Only, removing from the end works.
Expected output:
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PL
GT:AD:DP:GQ:PGT:PID:PL
GT:AD:DP:GQ:PGT:PID:PL
Thanks for helping me with the code for first part of my question.
But, the script isn't working for my another file which has the following data. Here I want to update the 9th column with the same purpose. The columns are tab separated. But, what I want to do remains basically the same.
2 1463 . T TG 433.67 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=247;ExcessHet=2.9800;FS=0.000;MQ=21.25;MQRankSum=0.00;QD=33.36;ReadPosRankSum=-6.740e-01;SOR=0.784;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:76,0:76:0,0,0:./.:.:.:./.:. ./.:55,0:55:0,0,0:.:.:.:.:. ./.:68,0:68:0,0,0:.:.:.:.:. ./.:48,0:48:0,0,0:.:.:.:.:.
2 1466 . TG T 395.82 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=1.01;ClippingRankSum=0.00;DP=287;ExcessHet=5.1188;FS=7.707;MQ=18.00;MQRankSum=0.00;QD=17.21;ReadPosRankSum=1.28;SOR=0.074;set=InDels GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1467 . G T 1334.42 PASS AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=287;ExcessHet=4.8226;FS=1.328;MQ=23.36;MQRankSum=0.00;QD=28.65;ReadPosRankSum=-4.310e-01;SOR=0.566;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:95,0:95:0,0,0:./.:.:.:./.:. ./.:64,0:64:0,0,0:.:.:.:.:. ./.:75,0:75:0,0,0:.:.:.:.:. ./.:53,0:53:0,0,0:.:.:.:.:.
2 1516 . C T 5902.93 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.287;ClippingRankSum=0.00;DP=411;ExcessHet=0.5065;FS=1.489;InbreedingCoeff=0.3492;MQ=59.77;MQRankSum=0.00;QD=28.38;ReadPosRankSum=-7.100e-02;SOR=0.553;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 1/1:1,108:109:99:3935,286,0:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:78,0:78:99:0,120,1800:.:.:.:.:.
2 1584 . CT C 164.08 PASS AC=0;AF=0.00;AN=8;DP=717;ExcessHet=0.0812;FS=0.000;InbreedingCoeff=0.9386;MQ=60.00;QD=32.82;SOR=3.611;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:122,0:122:99:0,120,1800:0/0:.:.:0/0:. 0/0:172,0:172:99:0,120,1800:.:.:.:.:. 0/0:102,0:102:99:0,120,1800:.:.:.:.:. 0/0:321,0:321:99:0,120,1800:.:.:.:.:.
2 1609 . C A 604.68 PASS AC=0;AF=0.00;AN=0;DP=386;ExcessHet=0.1158;FS=0.000;InbreedingCoeff=0.8938;MQ=12.32;QD=31.09;SOR=1.061;set=SNPs GT:AD:DP:PL:PG:PB:PI:PW:PC ./.:0,0:0:0,0,0:./.:.:.:./.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:0,0:0:0,0,0:.:.:.:.:. ./.:386,0:386:0,0,0:.:.:.:.:.
2 1612 . TGTGAGCTATTTCTTTTACATTTTTCTTTAGATTCTAGGTTAAATTGTGAAGCTGATTATCTTTTTTGTTTACAG T 1298.76 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=0.1047;FS=0.000;InbreedingCoeff=0.8896;MQ=60.02;QD=29.54;SOR=1.179;set=InDels GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0:0:.:0,0,0:./.:.:.:./.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. ./.:0,0:0:.:0,0,0:.:.:.:.:. 1/1:0,3:3:99:1355,582,0:.:.:.:.:.
2 1657 . T A,* 3118.91 PASS AC=0,2;AF=0.00,1.00;AN=2;BaseQRankSum=0.578;ClippingRankSum=0.00;DP=4;ExcessHet=1.9114;FS=3.474;InbreedingCoeff=0.0821;MQ=26.68;MQRankSum=0.841;QD=28.10;ReadPosRankSum=-5.960e-01;SOR=0.821;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC ./.:0,0,0:0:.:0,0,0,0,0,0:./.:.:.:./.:. ./.:1,0,0:1:.:0,0,0,0,0,0:.:.:.:.:. ./.:0,0,0:0:.:0,0,0,0,0,0:.:.:.:.:. 2/2:0,0,3:3:99:1355,1360,1393,582,615,0:.:.:.:.:.
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=1595;ExcessHet=0.0577;FS=0.621;InbreedingCoeff=0.6496;MQ=60.00;MQRankSum=0.00;QD=5.46;ReadPosRankSum=0.307;SOR=0.773;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:389,92:481:99:1748,0,12243:0|1:.,.,.,.,.:935:|:0.5 0/0:318,0:318:99:0,120,1800:.:.:.:.:. 0/1:270,53:323:99:990,0,9096:.:.:.:.:. 0/0:473,0:473:99:0,120,1800:.:.:.:.:.
2 2781 . T G 435.07 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.624;ClippingRankSum=0.00;DP=2146;ExcessHet=3.4523;FS=8.450;InbreedingCoeff=-0.0856;MQ=60.06;MQRankSum=-4.630e-01;QD=1.27;ReadPosRankSum=-5.980e+00;SOR=1.436;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,34:343:99:0|1:2781_T_G:469,0,12941:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2786 . C G 39.69 PASS AC=0;AF=0.00;AN=8;BaseQRankSum=0.881;ClippingRankSum=0.00;DP=2145;ExcessHet=4.3933;FS=0.000;InbreedingCoeff=-0.1367;MQ=52.41;MQRankSum=-1.356e+00;QD=1.13;ReadPosRankSum=0.577;SOR=0.527;set=SNPs GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:0,120,1800:0/0:.:.:0/0:. 0/0:342,0:342:99:0,120,1800:.:.:.:.:. 0/0:492,0:492:99:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:0,120,1800:.:.:.:.:.
2 2787 . T C 993.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-2.967e+00;ClippingRankSum=0.00;DP=2153;ExcessHet=3.8663;FS=4.941;InbreedingCoeff=-0.1076;MQ=60.06;MQRankSum=-5.100e-01;QD=2.84;ReadPosRankSum=-3.689e+00;SOR=0.875;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:309,41:350:99:0|1:2781_T_G:1027,0,13619:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
2 2792 . A G 745.21 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.271;ClippingRankSum=0.00;DP=2176;ExcessHet=5.9256;FS=5.964;InbreedingCoeff=-0.2087;MQ=59.48;MQRankSum=-4.920e-01;QD=1.83;ReadPosRankSum=-3.100e-02;SOR=1.389;set=SNPs GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC 0/0:620,0:620:99:.:.:0,120,1800:0/0:.:.:0/0:. 0/1:332,41:373:99:0|1:2781_T_G:705,0,13295:.:.:.:.:. 0/0:492,0:492:99:.:.:0,120,1800:.:.:.:.:. 0/0:691,0:691:99:.:.:0,120,1800:.:.:.:.:.
I also tried adding FS/OFS parameters but it isn't working.
After some clarification what the file looks like, here is my updated answer:
You can simply use
awk 'BEGIN{FS=OFS="\t"} {$9 = gensub(/(:[^:]+){5}$/,"","1",$9)} 1' yourfile
Here we use the standard awk field splitting, since your file is tab-separated.
We further do a regular expression replacement scoped to $9, which is the colon-separated string you want to change.
The regular expression works the same as in the old answer, in which I had the impression that the line consists only of the colon-separated string.
Old Answer
Since you wrote "pipe to python" in your comment, maybe you are open to an sed solution?
sed -r "s/(:[^:]+){5}$//" yourfile
Here we replace (s/...// replace the ... with nothing), the ... means:
from the end of line ($)
five ({5})
occurences of colon (:)
followed by something (+)
not a colon ([^:])
And this can again be "translated" to awk:
awk -F: 'BEGIN{FS=OFS="\t"} {$0 = gensub(/(:[^:]+){5}$/,"","1")} 1' yourfile
Maybe not the best awk solution but works:
awk -F: '{printf($1); for (i=2;i<=NF-5;i++) printf(":%s",$i); printf("\n"); }' file.txt
split the fields naturally according to colon
print first field, and then other fields minus the 5 last ones (using NF: number of fields preset variable), with leading colon.
print a linefeed to end the line.
EDIT: I knew there was better to do using awk. As Lars commented, this is way simpler and cleaner:
awk -F: '{s= $1; for(i = 2; i<= NF-5;i++) s= s FS $i; print s}'
use separator value instead of hardcoded colon
compose string instead of printing all fields
print string in the end
If you want to use it within a python script, I'd suggest that you write that in python, simpler & faster:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter=":")
cw = csv.writer(fw,delimiter=":")
for row in cr:
cw.writerow(row[:-5]) # write the row but the 5 last fields
you can omit the with part if you already have open handles.
EDIT: since you heavily edited your question after my answer, now you want to remove the 5 last "fields" from one particular field (tab-separated). Lars has answered properly awk-wise, let me propose my python solution:
import csv
with open("file.txt") as fr, open("out.txt","w",newline="") as fw:
cr = csv.reader(fr,delimiter="\t")
cw = csv.writer(fw,delimiter="\t")
for row in cr:
row[8]=":".join(row[8].split(":")[:-5]) # remove 5 last "fields" from 8th field
cw.writerow(row) # write the modified row

Comparing field#1 in two adjacent lines and printing fields#2 from all the lines where field#1 is identical

Can be considered a variant of this bash question that I looked up before asking this question here.
Following is a sample file:
C=b933cda8ce0/0 p=880080
C=b933cdd6580/0 p=880080
C=b933d02a240/0 p=880080
C=b933d059610/0 p=880080
C=b933d1c8690/0 p=880080
C=b933d2c1b60/0 p=1560315
C=b933d2c1b60/0 p=880080
C=b933d32f240/0 p=1229793
C=b933d32f240/0 p=123412
Output here should be:
C=b933d2c1b60/0 p1=1560315 p2=880080
C=b933d32f240/0 p1=1229793 p2=123412
I need to print out all the values of field#2 against field#1 from all the lines where field#1 matches.
Although I got the job done using the following long one-liner but it doesn't really seem elegant/efficient to me:
d=0; q=0; cat file |while read -r c p; do if [[ $c = $d ]]; then printf "$c\t$p\t$q\n"; fi; d=$c; q=$p; done
Code could be in any of the langs/tools tagged.
awk to the rescue
awk '{
c[$1]++;
sub("p","p"c[$1],$2);
sep=(c[$1]>1)?FS:"";
a[$1]=a[$1] sep $2
}
END {
for(i in a) print i, a[i]
}' file
concatenate second fields based on the key (first field). Suffix p with the index of the parameter (counted in c). There is a formatting trick to not to double space the first and second fields.

Categories