I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.
The code is as follows:
from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)
gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
fields = re.split("\t", line.rstrip())
gene_fams[fields[0]].append[fields[1]]
for fam in gene_fams.keys():
output_filename = str(fam) + ".fasta"
outh = open(output_filename, "w")
for id in gene_fams[fam]:
if id in all_seqs:
outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
else:
print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
quit()
outh.close()
The list looks like this:
1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1
The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.
So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.
So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.
Any and all help is appreciated,
Thanks,
JT
Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:
gene_fams[fields[0]].append[fields[1]]
i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:
from Bio import SeqIO
from collections import defaultdict
SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"
all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)
with open(FAMILY_FILE_NAME) as gene_families_file:
for line in gene_families_file:
family_id, gene_id = line.rstrip().split()
gene_families[family_id].append(gene_id)
for family_id, gene_ids in gene_families.items():
output_filename = family_id + ".fasta"
with open(output_filename, "w") as output:
for gene_id in gene_ids:
assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)
output.write(all_sequences[gene_id].format("fasta"))
I have a file with lines of DNA in a file called 'DNASeq.txt'. I need a code to read each line and split each line at random places (inserting spaces) throughout the line. Each line needs to be split at different places.
EX: I have:
AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ
And I need something like this:
AAA ADSF DFAFDDSAF ADF ADSF AFD AFAD
I have tried (!!!very new to python!!):
import random
for x in range(10):
print(random.randint(50,250))
but that prints me random numbers. Is there some way to get a random number generated as like a variable?
You can read a file line wise, write each line character-wise in a new file and insert spaces randomly:
Create demo file without spaces:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJG
ASDFJSDGFIJ
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGO""")
Read and rewrite demo file:
import random
max_no_space = 9 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDF SFD GHJEQWRJIJG
A SDFJ SDG FIJ
SADFJSD FJ JDSFJIDFJG I JSRGJSDJ FIDJFG
The pattern of spaces is random. You can influence it by settin max_no_spaces or remove the randomness to split after max_no_spaces all the time
Edit:
This way of writing 1 character at a time if you need to read 200+ en block is not very economic, you can do it with the same code like so:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGG
ASDFJSDGFIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJK
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJF
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG""")
import random
min_no_space = 10
max_no_space = 20 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if no_space > min_no_space:
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDFSFDGHJEQ WRJIJSADFJSDF JJDSFJIDFJGIJ SRGJSDJFIDJFGG
ASDFJSDGFIJSA DFJSDFJJDSFJIDF JGIJSRGJSDJFIDJ FGSADFJSDFJJ DSFJIDFJGIJK
SADFJ SDFJJDSFJIDFJG IJSRGJSDJFIDJ FGSADFJSDFJJDS FJIDFJGIJSRG JSDJFIDJF
SDFJG IKDSFGOROHPTLPASDMKFGD OKRAMGSADFJSDF JJDSFJIDFJGI JSRGJSDJFIDJFG
If you want to split your DNA fixed amount of times (10 in my example) here's what you could try:
import random
DNA = 'AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ'
splitted_DNA = ''
for split_idx in sorted(random.sample(range(len(DNA)), 10)):
splitted_DNA += DNA[len(splitted_DNA)-splitted_DNA.count(' ') :split_idx] + ' '
splitted_DNA += DNA[split_idx:]
print(splitted_DNA) # -> AAACCCHT HTH D AF HD SA F JANFAJDSNFA DK FAFJ
import random
with open('source', 'r') as in_file:
with open('dest', 'w') as out_file:
for line in in_file:
newLine = ''.join(map(lambda x:x+' '*random.randint(0,1), line)).strip() + '\n'
out_file.write(newLine)
Since you mentioned being new, I'll try to explain
I'm writing the new sequences to another file for precaution. It's
not safe to write to the file you are reading from.
The with constructor is so that you don't need to explicitly close
the file you opened.
Files can be read line by line using for loop.
''.join() converts a list to a string.
map() applies a function to every element of a list and returns the
results as a new list.
lambda is how you define a function without naming it. lambda x:
2*x doubles the number you feed it.
x + ' ' * 3 adds 3 spaces after x. random.randint(0, 1) returns
either 1 or 0. So I'm randomly selecting if I'll add a space after
each character or not. If the random.randint() returns 0, 0 spaces are added.
You can toss a coin after each character whether to add space there or not.
This function takes string as input and returns output with space inserted at random places.
def insert_random_spaces(str):
from random import randint
output_string = "".join([x+randint(0,1)*" " for x in str])
return output_string
We have a .dat file that is completely unreadable in a text-editor. This .dat file holds information about materials and I used on a machine which can display the data.
Now we are going to use the files for other purposes end need to convert them to a readable format.
When I open the .dat file in NOtepad++ this is what I see:
2ñg½x¹¾T?B½ÛÁ#½^fÓ¼":°êȽ¸ô»YY‚½g *½$S)¼¤“è¼F„J¼c1¼$ ¼*‡Ç»½Ú7¼F]S¼Ê(Ï<(‚¤½Y´½å½#ø;N‡;o¸¼¨*S:ΣC¼ÎÀR½žO<š_å¼T÷½p4>½8«»«=ýÆZ<¿[=²”<æt¼pc»q³<×R<ï4¼}Ž‚8pýw<~ï»z†:Qš¼^Kp;XI=<Ѷ ¼
j½…é-=*Ý=;-X7½ßÓ:<ÐZ<Ás!=²LÀ;æã=võu<„4½§V9„燺ý$D<"Š|»å€,<E{=+»¥;2wN¼¸rF=h®<ç[=²=é\=Îý<…À¦¼Î,è¼u…<#_.¼¾Ã¨9æ3½Å°“<ª×½°ÇD¼JÝþ»ph{=Ÿv8;Ne¼’Q; ´{»(ì¿<6Þï»éõ¼*p½©m¼ÝM–<ròä¼½™™¼Õö=j|½±‰Í;2¥C¼¯ 輓?½>¼:„3» ù¼¦k
¼wÞ¹¼Öm‚»=T¼êy¦¼k[…»ÎÉO¼Žc¼$ï½ÖN;H¼4Ø:8¸ž¼dLý¼ø9ø»cI(;4뼈Q¼ž7½,h?¼À ɼy½Å’œ¼¶Åº¼å"±¼bžu¼ Z;½¨½øáY¼ZÖ»2
½ð^š<Þ„§<»ƒ<#±c<f<ŸPÝ;‹œlºÐöï»ö²ñ;ÜŠb=¦';f´<ò=¬3B<\mÛ¼¹©»åB<»Xô;€ºp»¸ ±¼‰Øâ¼7Ug¼€÷ø¼lËû»j}»²‘ô;wu½®ö²¼Ÿ„¼ŠÉ¼ÖV8 Š¼‹÷¯¼ål¼é°ª¼‹o4½ðî$<4Q:.A<
<Ž¬ë<^·G<n
œ<¶l<: è;’MÜ9êÁa<’¢T;~&¼gY®»"P¼¤µº;$H=½…o<6ëæ»ûÒ¼Ê,<‚p½¯À¼#êw»Ír¥¼¸wغA:«<TDI»Nºµ<€ŠMºwnܸ·6:CÕj<àÆ:Dr<7ëo9STÏ<G¼R?M<:)N;.3 <†L<ºZ=I,Y<ñF;iÙ.» pºå0<;:=Tʪ;—ÄË;?'й0Ž:J’J<jR¯»´/½Ô”ؼ•¥˜¼hμd™<9¼iˆ‘<(Šd<ɇÖ#·³È‚»#O><Úo<Ó¸ <ëî;ÒQ<õöî<#Nm¼öw4¼’O¼v <:3<
We know the data in the .dat file has the following format:
MaterialBase ThicknessBase ThicknessIterated Pixel Value
Plastic 0 0 1 -5.662651e-02
Plastic 0 0 2 -1.501216e-01
Plastic 0 0 3 -4.742368e-02
By searching lots of code on, also here of course, I came up with the following code:
import time
import binascii
import csv
import serial
import numpy as np
with open('validationData.dat.201805271617', 'rb') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('|').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
The error I get now is:
File "c:\Users\joost.bazelmans\Documents\python\dat2csv.py", line 15, in
<module>
newLine = line.strip('|').split()
TypeError: a bytes-like object is required, not 'str'
It looks lik the script is reading the file, but then it cannot split it by the | character. But I am lost now. Any ideas on how to continue?
Edit 2018-07-23 13:00
Based on guidot's reply we tried to work with.struct. We are now able to get a list of floating point values out of the .dat file. This is also what the R-script does. But after that the R-script does translate the floats to readable data.
import struct
datavector = []
with open('validationData.dat.201805271617', "rb") as f:
n = 1000000
count = 0
byte = f.read(4) # same as size = 4 in R
while count < n and byte != b"":
datavector.append(struct.unpack('f',byte))
count += 1
byte = f.read(4)
print(datavector)
the result we get is something like this:
[(-0.05662650614976883,), (-0.1501215696334839,), (-0.047423675656318665,), (-0.04705987498164177,), (-0.025805648416280746,), (0.0006194132147356868,), (-0.09810388088226318,), (-0.007468236610293388,), (-0.06364697962999344,), (-0.04153480753302574,), (-0.010334763675928116,), (-0.028390713036060333,), (-0.01236063800752163,), (-0.010809036903083324,), (-0.0195484422147274,), (-0.006089110858738422,), (-0.011221584863960743,), (-0.012900656089186668,), (0.02528800442814827,), (-0.0803263783454895,), (-0.03630480542778969,), (-0.03244496509432793,), (0.007571130990982056,), (0.004120028577744961,), (-0.022513896226882935,), (0.0008055367507040501,), (-0.011940909549593925,), (-0.05145340412855148,), (0.008258728310465813,), (-0.02799968793988228,), (-0.035880401730537415,), (-0.04643672704696655,), (-0.005221989005804062,), (0.03542486950755119,), (0.013353106565773487,), (0.035976167768239975,), (0.008336232975125313,), (-0.01492307148873806,), (-0.003470425494015217,), (0.02190450392663479,), (0.012822589837014675,), (-0.008801682852208614,), (6.225423567229882e-05,), (0.015136107802391052,), (-0.007297097705304623,), (0.0010259768459945917,), (-0.018891485407948494,), (0.0036666016094386578,), (0.01155313104391098,), (-0.009809211827814579,), (-0.03696637228131294,), (0.04245902970433235,), (0.002897093538194895,), (-0.04476182535290718,), (0.011403053067624569,), (0.01330728828907013,), (0.03941703215241432,), (0.005868517793715,), (0.031955622136592865,), (0.015012135729193687,), (-0.0439620167016983,), (0.00014146660396363586,), (-0.0010368679650127888,), (0.011971709318459034,), (-0.003853448200970888,), (0.010528777725994587,), (0.06129004433751106,), (0.00505771255120635,), (-0.012601660564541817,), (0.01481446623802185,), (0.019019771367311478,), (0.004633020609617233,), (-0.021741455420851707,), (-0.033449672162532806,), (-0.021316081285476685,), (0.00593474181368947,), (0.0030296281911432743,), (0.023055575788021088,), (0.0256675872951746,), (0.03663543984293938,), (0.044298700988292694,), (0.01264342200011015,), (0.032493121922016144,), (-0.06546197831630707,), (0.031123168766498566,), (0.005013703368604183,), (-0.006611336953938007,), (-0.041526272892951965,), (0.0007577596697956324,), (0.030475322157144547,), (0.034476157277822495,), (-0.015037396922707558,), (0.07587681710720062,)]
Now the question is how to convert these flaoting point numbers to readable content
Since you opened your file in binary format, mode='rb', I think you probably just need to specify a bytes-like character to strip:
newLine = line.strip(b'|').split()
I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)
You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.