Python CSV module, add column to the side, not the bottom - python

I am new in python, and I need some help. I made a python script that takes two columns from a file and copies them into a "new file". However, every now and then I need to add columns to the "new file". I need to add the columns on the side, not the bottom. My script adds them to the bottom. Someone suggested using CSV, and I read about it, but I can't make it in a way that it adds the new column to the side of the previous columns. Any help is highly appreciated.
Here is the code that I wrote:
import sys
import re
filetoread = sys.argv[1]
filetowrite = sys.argv[2]
newfile = str(filetowrite) + ".txt"
openold = open(filetoread,"r")
opennew = open(newfile,"a")
rline = openold.readlines()
number = int(len(rline))
start = 0
for i in range (len(rline)) :
if "2theta" in rline[i] :
start = i
for line in rline[start + 1 : number] :
words = line.split()
word1 = words[1]
word2 = words[2]
opennew.write (word1 + " " + word2 + "\n")
openold.close()
opennew.close()
Here is the second code I wrote, using CSV:
import sys
import re
import csv
filetoread = sys.argv[1]
filetowrite = sys.argv[2]
newfile = str(filetowrite) + ".txt"
openold = open(filetoread,"r")
rline = openold.readlines()
number = int(len(rline))
start = 0
for i in range (len(rline)) :
if "2theta" in rline[i] :
start = i
words1 = []
words2 = []
for line in rline[start + 1 : number] :
words = line.split()
word1 = words[1]
word2 = words[2]
words1.append([word1])
words2.append([word2])
with open(newfile, 'wb') as file:
writer = csv.writer(file, delimiter= "\n")
writer.writerow(words1)
writer.writerow(words2)
These are some samples of input files:
https://dl.dropbox.com/u/63216126/file5.txt
https://dl.dropbox.com/u/63216126/file6.txt
My first script works "almost" great, except that it writes the new columns at the bottom and I need them at side of the previous columns.

The proper way to use writerow is to give it a single list that contains the data for all the columns.
words.append(word1)
words.append(word2)
writer.writerow(words)

Related

Lines missing in python

I am writing a code in python where I am removing all the text after a specific word but in output lines are missing. I have a text file in unicode which have 3 lines:
my name is test1
my name is
my name is test 2
What I want is to remove text after word "test" so I could get the output as below
my name is test
my name is
my name is test
I have written a code but it does the task but also removes the second line "my name is"
My code is below
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
with open(r"test.txt", "w") as fp:
fp.write(txt)
It looks like if there is no keyword found the index become -1.
So you are avoiding the lines w/o keyword.
I would modify your if by adding the condition as follows:
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index > 0:
txt += line[:index + len(splitStr)] + "\n"
elif index < 0:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
No need to add \n because the line already contains it.
Your code does not append the line if the splitStr is not defined.
txt = ""
with open(r"test.txt", 'r') as fp:
for line in fp.readlines():
splitStr = "test"
index = line.find(splitStr)
if index != -1:
txt += line[:index + len(splitStr)] + "\n"
else:
txt += line
with open(r"test.txt", "w") as fp:
fp.write(txt)
In my solution I simulate the input file via io.StringIO. Compared to your code my solution remove the else branch and only use one += operater. Also splitStr is set only one time and not on each iteration. This makes the code more clear and reduces possible errore sources.
import io
# simulates a file for this example
the_file = io.StringIO("""my name is test1
my name is
my name is test 2""")
txt = ""
splitStr = "test"
with the_file as fp:
# each line
for line in fp.readlines():
# cut somoething?
if splitStr in line:
# find index
index = line.find(splitStr)
# cut after 'splitStr' and add newline
line = line[:index + len(splitStr)] + "\n"
# append line to output
txt += line
print(txt)
When handling with files in Python 3 it is recommended to use pathlib for that like this.
import pathlib
file_path = pathlib.Path("test.txt")
# read from wile
with file_path.open('r') as fp:
# do something
# write back to the file
with file_path.open('w') as fp:
# do something
Suggestion:
for line in fp.readlines():
i = line.find('test')
if i != -1:
line = line[:i]

Find length of a contig in one fasta, using the header of another fasta as query in python

I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like to save the output of this script as a variable to then use later on in the same pipeline.
Below is an updated version of the script using code provided by Lucía Balestrazzi.
Additional information: The following with-statement is nested inside a larger for-loop that cycles through subsamples of an original genome. The first subsample fasta in my directory has a single sequence ">chr1:0-40129801" with a length of 40129801. I'm trying to write out a text file "OUTPUT" that has some basic information about each subsample fasta. This text file will be used as an input for another program downstream.
Header names in the original fasta file are chr1, chr2, etc... while the header names in the subsample fastas are something along the lines of:
batch1.fa >chr1:0-40k
batch2.fa >chr1:40k-80k
...etc...
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
#Example of "LINE"... >chr1:0-40129801
HEADER = re.sub('>','',LINE)
#HEADER = chr1:0-40129801
HEADER2 = re.sub('\n','',HEADER)
#HEADER2 = chr1:0-40129801 (no return character on the end)
CONTIG = HEADER2.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER2.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
#START = 0
END = int(PART2_HEADER.split("-")[1])
#END = 40129801
LENGTH = END-START
#LENGTH = 40129801 minus 0 = 40129801
#This is where I'm stuck...
ORIGINAL_CONTIG_LENGTH = len(record_dict[CONTIG]) #This returns "KeyError: 1"
#ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample).
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
#OUTPUT = 0 chr1:0-40129801 40129801 chr1 223705999
OUTPUT.close()
I'm relatively new to bioinformatics. I know I'm messing up on how I'm using the dictionary, but I'm not quite sure how to fix it.
Any advice would be greatly appreciated. Thanks!
You can do it this way:
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
print(len(record_dict["chr1"]))
or
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
seq = record_dict["chr1"]
print(len(seq))
EDIT: Alternative code
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta")
names = record_dict.keys()
for HEADER in names:
#HEADER = chr1:0-40129801
ORIGINAL_CONTIG_LENGTH = len(record_dict[HEADER])
CONTIG = HEADER.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
END = int(PART2_HEADER.split("-")[1])
LENGTH = END-START
The idea is that you define the dict once, get the value of its keys (all the contigs headers) and store them as a variable, and then loop through the headers extracting the info you need. No need to loop through the file.
Cheers
This works, just changed the "CONTIG" variable to a string. Thanks Lucía for all your help the last couple of days!
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
#Example of "LINE"... >chr1:0-40129801
HEADER = re.sub('>','',LINE)
#HEADER = chr1:0-40129801
HEADER2 = re.sub('\n','',HEADER)
#HEADER2 = chr1:0-40129801 (no return character on the end)
CONTIG = HEADER2.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER2.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
#START = 0
END = int(PART2_HEADER.split("-")[1])
#END = 40129801
LENGTH = END-START
#LENGTH = 40129801 minus 0 = 40129801
#This is where I'm stuck...
ORIGINAL_CONTIG_LENGTH = len(record_dict[str(CONTIG)])
#ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample).
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
#OUTPUT = 0 chr1:0-40129801 40129801 chr1 223705999
OUTPUT.close()

How to check if two different strings from two different files are present in third file?

I want to check if two different strings which are from two different files are present in third file and if they are present then write that line to fourth file. The set of strings are IPv4 addresses. I am getting empty file even if the strings are present in both files. Also I want to implement multi-threading/multiprocessing to speed up the process if possible. Thank you so much for any suggestion/help in advance.
slave_list text file has entry as below:
hostA 192.168.15.32
hostB 192.168.15.33
hostC 192.168.15.37
static_ip_list text file has entry as below:
192.168.100.10
192.168.100.12
192.168.100.14
slave_logfile has entry as below:
1536043051.176 59320 192.168.100.10 TCP_MISS/200 21830 CONNECT www.google.com:443 - 192.168.15.32
Code:
from datetime import datetime, timedelta
import os
import string
import sys
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
append_log = open('/home/top10_domain_accessed/logs/append_logs.txt', 'a')
def file_path (slave_list):
count = 1
while(count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave_name = string[0]
slave_ip = string[1]
log_path = "/LOGS/%s/%s" %(slave_name, yr_month)
slave_logfile = os.path.join(log_path, file_name)
if os.path.exists(slave_logfile):
log_read = open(slave_logfile, 'r')
for line in log_read:
if slave_ip in line:
with open(static_ip_list) as ip_list:
for static_ip in ip_list:
static_ip = static_ip.rstrip()
if static_ip in line:
append_log.write(line + '\n')
else:
pass
count = count + 1
if __name__ == '__main__':
file_path(slave_list)

Pick parts from a txt file and copy to another file with python

I'm in trouble here. I need to read a file. Txt file that contains a sequence of records, check the records that I want to copy them to a new file.
The file content is like this (this is just an example, the original file has more than 30 000 lines):
AAAAA|12|120 #begin file
00000|46|150 #begin register
03000|TO|460
99999|35|436 #end register
00000|46|316 #begin register
03000|SP|467
99999|33|130 #end register
00000|46|778 #begin register
03000|TO|478
99999|33|457 #end register
ZZZZZ|15|111 #end file
The records that begin with 03000 and have the characters 'TO' must be written to a new file. Based on the example, the file should look like this:
AAAAA|12|120 #begin file
00000|46|150 #begin register
03000|TO|460
99999|35|436 #end register
00000|46|778 #begin register
03000|TO|478
99999|33|457 #end register
ZZZZZ|15|111 #end file
Code:
file = open("file.txt",'r')
newFile = open("newFile.txt","w")
content = file.read()
file.close()
# here I need to check if the record exists 03000 characters 'TO', if it exists, copy the recordset 00000-99999 for the new file.
I did multiple searches and found nothing to help me.
Thank you!
with open("file.txt",'r') as inFile, open("newFile.txt","w") as outFile:
outFile.writelines(line for line in inFile
if line.startswith("03000") and "TO" in line)
If you need the previous and the next line, then you have to iterate inFile in triads. First define:
def gen_triad(lines, prev=None):
after = current = next(lines)
for after in lines:
yield prev, current, after
prev, current = current, after
And then do like before:
outFile.writelines(''.join(triad) for triad in gen_triad(inFile)
if triad[1].startswith("03000") and "TO" in triad[1])
import re
pat = ('^00000\|\d+\|\d+.*\n'
'^03000\|TO\|\d+.*\n'
'^99999\|\d+\|\d+.*\n'
'|'
'^AAAAA\|\d+\|\d+.*\n'
'|'
'^ZZZZZ\|\d+\|\d+.*')
rag = re.compile(pat,re.MULTILINE)
with open('fifi.txt','r') as f,\
open('newfifi.txt','w') as g:
g.write(''.join(rag.findall(f.read())))
For files with additional lines between lines beginning with 00000, 03000 and 99999, I didn't find simpler code than this one:
import re
pat = ('(^00000\|\d+\|\d+.*\n'
'(?:.*\n)+?'
'^99999\|\d+\|\d+.*\n)'
'|'
'(^AAAAA\|\d+\|\d+.*\n'
'|'
'^ZZZZZ\|\d+\|\d+.*)')
rag = re.compile(pat,re.MULTILINE)
pit = ('^00000\|.+?^03000\|TO\|\d+.+?^99999\|')
rig = re.compile(pit,re.DOTALL|re.MULTILINE)
def yi(text):
for g1,g2 in rag.findall(text):
if g2:
yield g2
elif rig.match(g1):
yield g1
with open('fifi.txt','r') as f,\
open('newfifi.txt','w') as g:
g.write(''.join(yi(f.read())))
file = open("file.txt",'r')
newFile = open("newFile.txt","w")
content = file.readlines()
file.close()
newFile.writelines(filter(lambda x:x.startswith("03000") and "TO" in x,content))
This seems to work. The other answers seem to only be writing out records that contain '03000|TO|' but you have to write out the record before and after that as well.
import sys
# ---------------------------------------------------------------
# ---------------------------------------------------------------
# import file
file_name = sys.argv[1]
file_path = 'C:\\DATA_SAVE\\pick_parts\\' + file_name
file = open(file_path,"r")
# ---------------------------------------------------------------
# create output files
output_file_path = 'C:\\DATA_SAVE\\pick_parts\\' + file_name + '.out'
output_file = open(output_file_path,"w")
# create output files
# ---------------------------------------------------------------
# process file
temp = ''
temp_out = ''
good_write = False
bad_write = False
for line in file:
if line[:5] == 'AAAAA':
temp_out += line
elif line[:5] == 'ZZZZZ':
temp_out += line
elif good_write:
temp += line
temp_out += temp
temp = ''
good_write = False
elif bad_write:
bad_write = False
temp = ''
elif line[:5] == '03000':
if line[6:8] != 'TO':
temp = ''
bad_write = True
else:
good_write = True
temp += line
temp_out += temp
temp = ''
else:
temp += line
output_file.write(temp_out)
output_file.close()
file.close()
Output:
AAAAA|12|120 #begin file
00000|46|150 #begin register
03000|TO|460
99999|35|436 #end register
00000|46|778 #begin register
03000|TO|478
99999|33|457 #end register
ZZZZZ|15|111 #end file
Does it have to be python? These shell commands would do the same thing in a pinch.
head -1 inputfile.txt > outputfile.txt
grep -C 1 "03000|TO" inputfile.txt >> outputfile.txt
tail -1 inputfile.txt >> outputfile.txt
# Whenever I have to parse text files I prefer to use regular expressions
# You can also customize the matching criteria if you want to
import re
what_is_being_searched = re.compile("^03000.*TO")
# don't use "file" as a variable name since it is (was?) a builtin
# function
with open("file.txt", "r") as source_file, open("newFile.txt", "w") as destination_file:
for this_line in source_file:
if what_is_being_searched.match(this_line):
destination_file.write(this_line)
and for those who prefer a more compact representation:
import re
with open("file.txt", "r") as source_file, open("newFile.txt", "w") as destination_file:
destination_file.writelines(this_line for this_line in source_file
if re.match("^03000.*TO", this_line))
code:
fileName = '1'
fil = open(fileName,'r')
import string
##step 1: parse the file.
parsedFile = []
for i in fil:
##tuple1 = (1,2,3)
firstPipe = i.find('|')
secondPipe = i.find('|',firstPipe+1)
tuple1 = (i[:firstPipe],\
i[firstPipe+1:secondPipe],\
i[secondPipe+1:i.find('\n')])
parsedFile.append(tuple1)
fil.close()
##search criterias:
searchFirst = '03000'
searchString = 'TO' ##can be changed if and when required
##step 2: used the parsed contents to write the new file
filout = open('newFile','w')
stringToWrite = parsedFile[0][0] + '|' + parsedFile[0][1] + '|' + parsedFile[0][2] + '\n'
filout.write(stringToWrite) ##to write the first entry
for i in range(1,len(parsedFile)):
if parsedFile[i][1] == searchString and parsedFile[i][0] == searchFirst:
for j in range(-1,2,1):
stringToWrite = parsedFile[i+j][0] + '|' + parsedFile[i+j][1] + '|' + parsedFile[i+j][2] + '\n'
filout.write(stringToWrite)
stringToWrite = parsedFile[-1][0] + '|' + parsedFile[-1][1] + '|' + parsedFile[-1][2] + '\n'
filout.write(stringToWrite) ##to write the first entry
filout.close()
I know that this solution may be a bit long. But it is quite easy to understand. And it seems an intuitive way to do it. And I have already checked this with the Data that you have provided and it works perfectly.
Please tell me if you need some more explanation on the code. I will definitely add the same.
I tip (Beasley and Joran elyase) very interesting, but it only allows to get the contents of the line 03000. I would like to get the contents of the lines 00000 to line 99999.
I even managed to do here, but I am not satisfied, I wanted to make a more cleaner.
See how I did:
file = open(url,'r')
newFile = open("newFile.txt",'w')
lines = file.readlines()
file.close()
i = 0
lineTemp = []
for line in lines:
lineTemp.append(line)
if line[0:5] == '03000':
state = line[21:23]
if line[0:5] == '99999':
if state == 'TO':
newFile.writelines(lineTemp)
else:
linhaTemp = []
i = i+1
newFile.close()
Suggestions...
Thanks to all!

Combine two Python codes

I am very new in python, but I have been able to make few useful python codes (at least useful for my work). I would like to combine two of my codes, but I have a hard time making it work, I think I am completely lost in how the code should looks like.
The first code basically takes a file, read it, extract to columns from it, and then write the columns in a new file. I repeat this with several files:
import sys
import re
filetowrite = sys.argv[1]
filetoread = sys.argv[2]
newfile = str(filetowrite) + ".txt"
openold = open(filetoread,"r")
opennew = open(newfile,"w")
rline = openold.readlines()
number = int(len(rline))
start = 0
for i in range (len(rline)) :
if "2theta" in rline[i] :
start = i
opennew.write ("q" + "\t" + "I" + "\n")
opennew.write ("1/A" + "\t" + "1/cm" + "\n")
opennew.write (str(filetowrite) + "\t" + str(filetowrite) + "\n")
for line in rline[start + 1 : number] :
words = line.split()
word1 = (words[1])
word2 = (words[2])
opennew.write (word1 + "\t" + word2 + "\n")
openold.close()
opennew.close()
The second code takes the new previously created files and combine them in a way in which the columns are next to each other in the final file.
import sys
from itertools import izip
filenames = sys.argv[2:]
filetowrite = sys.argv[1]
newfile = str(filetowrite) + ".txt"
opennew = open(newfile, "w")
files = map(open, filenames)
for lines in izip(*files):
opennew.write(('\t'.join(i.strip() for i in lines))+"\n")
Any help in how to proceed to make a single code out of these two codes is highly appreciated.
All the best
Make each file into a function in one larger file, then call the functions as necessary. Make use of __main__ to do that.
import sys
import re
from itertools import izip
def func_for_file1():
# All of the code from File 1 goes here.
def func_for_file2():
# ALl of the code from File 2 goes here.
def main():
# Decide what order you want to call these methods.
func_for_file1()
func_for_file2()
if __name__ == '__main__':
main()

Categories