I want to go through each line of the a csv file and compare to see if the first field of line 1 is same as first field of next line and so on. If it finds a match then i would like ignore those two lines that contains the same fields and keep the lines where there is no match
Here is an example dataset (no_dup.txt)
Ac_Gene_ID M_Gene_ID
ENSGMOG00000015632 ENSORLG00000010573
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636
Basically i want to exclude line 1 and 2 since they contains the same fields (ENSGMOG00000015632) and keep lines 3 and 4
Here is the code i have tried but couldn't finish it
prev = None
with open("no_dup.txt", 'r') as fh_in:
for line in fh_in:
line = line.strip()
if line.startswith("E"):
line1 = line.split()
print "initial gene =", line1[0]
if prev is not None or prev!= line1[0]:
prev = line1[0]
I think a clean way of doing this would be to make a map of each entry -> list of lines.
entries = {}
with open('no_dup.txt', 'r') as fh_in:
for line in fg_in:
entry = line.split()[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
for matches in entries.iteritems():
if len(matches) == 1:
print matches[0]
You should note that this will NOT preserve the order of entries.
Your start looks good:
def filter_dups(iterable):
prev = None
for line in iterable:
if line.startswith("E"):
if prev.split(None, 1)[0] == line.split(None, 1)[0]:
prev = None
else:
if prev is not None:
yield prev
else:
prev = line
else:
yield line
prev = None
if prev is not None:
yield prev
with open("no_dup.txt", 'r') as fh_in:
with open("no_dup_out.txt", 'r') as fh_out:
fh_out.writelines(filter_dups(fh_in))
You can use this:
with open('a.txt','r') as inputFile:
lines = inputFile.readlines()
prev = lines[0]
for i in range(1, len(lines)):
cur = lines[i]
if prev.split()[0] != cur.split()[0]:
print prev.strip()
prev = cur
print lines[-1].strip()
Input:
ENSGMOG00000015632 ENSORLG00000010573
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636
Output:
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636
Related
I have a file which looks like following,
file= '/user/home/file.txt'
file
[SKY]
/user/home/repo/study
[EARTH]
/user/home/learn/objects
[LOCAL]
/user/home/teach/files
[SAMP]
VKP
RNP
SAS
[TYPE]
HGH
[SAMP_ID]
VKP_TP_MA
RNP_TP_NA
SAS_SAS
[ENV]
....
Now I need to transfer the items from [SAMP] and [SAMP_ID] to a list.
This is what I am doing which is giving what I need. But any better or elegant solution would be great.
So my lists are samp and samp_id, and here is the solution and I am using currently,
samp = []
samp_id = []
sampSection = False
samp_idection = False
for line in open(file, 'r'):
if len(line.strip()) == 0:
sampSection = False
continue
if line.strip() == '[SAMP]':
sampSection = True
continue
elif line.startswith('['):
sampSection = False
continue
if sampSection:
samp.append(line.strip())
continue
for line in open(file, 'r'):
if len(line.strip()) == 0:
samp_idection = False
continue
if line.strip() == '[SAMP_ID]':
samp_idection = True
continue
elif line.startswith('['):
samp_idection = False
continue
if samp_idection:
samp_id.append(line.strip())
continue
And samp and samp_id looks as follows,
samp =['VKP','RNP', 'SAS']
samp_id=['VKP_TP_MA','RNP_TP_NA', 'SAS_SAS']
It would be great if there is any simpler solutions in this case.
I would parse the entire file with a dict, without opening and iterating the file twice:
result = {}
current = None
with open("my_file.txt") as fd: #To close the file automatically
for line in fd:
line = line.strip()
if line.startswith('['):
current = line.strip('[]')
result[current] = []
continue
if current is None: continue
if line: result[current].append(line)
#Or just use the dictionary
samp = result['SAMP']
samp_id = result['SAMP_ID']
If you really do not want to keep any other tag:
fields = set(('SAMP','SAMP_ID'))
result = {}
current = None
with open("my_file.txt") as fd:
for line in fd:
line = line.strip()
if line.startswith('['):
current = line.strip('[]')
if current not in fields: current = None
else: result[current] = []
continue
if current is None: continue
if line: result[current].append(line)
>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
This is the text file which I am trying to read.
I want to read every gene in a different string and then add it in a list
There are header lines starting with ’>’ character to recognize if this is a start or end of a gene
with open('sequences1.txt') as input_data:
for line in input_data:
while line != ">":
list.append(line)
print(list)
When printed the list should display list should be
list =["ATGATGATGGCG","GGCATATCCGGATACC","TAGCTAGCCCGC"]
with open('sequences1.txt') as input_data:
sequences = []
gene = []
for line in input_data:
if line.startswith('>gene'):
if gene:
sequences.append(''.join(gene))
gene = []
else:
gene.append(line.strip())
sequences.append(''.join(gene)) # append last gene
print(sequences)
output:
['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']
You have multiple mistakes in your code, look here:
with open('sequences1.txt', 'r') as file:
list = []
for line in file.read().split('\n'):
if not line.startswith(">") and len(line$
list.append(line)
print(list)
Try this:
$ cat genes.txt
>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
$ python
>>> genes = []
>>> with open('genes.txt') as file_:
... for line in f:
... if not line.startswith('>'):
... genes.append(line.strip())
...
>>> print(genes)
['ATGATGATGGCG', 'GGCATATC', 'CGGATACC', 'TAGCTAGCCCGC']
sequences1.txt:
>gene1
ATGATGATGGCG
>gene2
GGCATATC
CGGATACC
>gene3
TAGCTAGCCCGC
and then:
desired_text = []
with open('sequences1.txt') as input_data:
content = input_data.readlines()
content = [l.strip() for l in content if l.strip()]
for line in content:
if not line.startswith('>'):
desired_text.append(line)
print(desired_text)
OUTPUT:
['ATGATGATGGCG', 'GGCATATC', 'CGGATACC', 'TAGCTAGCCCGC']
EDIT:
Sped-read it, fixed it with the desired output
with open('sequences1.txt') as input_data:
content = input_data.readlines()
# you may also want to remove empty lines
content = [l.strip() for l in content if l.strip()]
# flag
nextLine = False
# list to save the lines
textList = []
concatenated = ''
for line in content:
find_TC = line.find('gene')
if find_TC > 0:
nextLine = not nextLine
else:
if nextLine:
textList.append(line)
else:
if find_TC < 0:
if concatenated != '':
concatenated = concatenated + line
textList.append(concatenated)
else:
concatenated = line
print(textList)
OUTPUT:
['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']
I have many lines like the following:
>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG
>ENSG00000003137|ENST00000412253|CYP26B1|72370133;72362405|72370213;72362548|4025
AGCCTTTTTCTTCGACGATTTCCG
In this example ENSG00000003137 is name and 4732 which is the last one is length. as you see some names are repeated but they have different length.
I want to make a new file in which I only have those with the longest length. meaning the results would be like this:
>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG
I have made this code to split but don't know how to make the file I want:
file = open(“file.txt”, “r”)
for line in file:
if line.startswith(“>”):
line = line.split(“|”)
You'll need to read the file twice; the first time round, track the largest size per entry:
largest = {}
with open(inputfile) as f:
for line in f:
if line.startswith('>'):
parts = line.split('|')
name, length = parts[0][1:], int(parts[-1])
largest[name] = max(length, largest.get(name, -1))
then write out the copy in a second pass, but only those sections whose name and length match the extracted largest length from the first pass:
with open(inputfile) as f, open(outpufile, 'w') as out:
copying = False
for line in f:
if line.startswith('>'):
parts = line.split('|')
name, length = parts[0][1:], int(parts[-1])
copying = largest[name] == length
if copying:
out.write(line)
you have to do two types of handling in the loop, one that compares your 'length', and one that stores the CGTA when its needed. I wrote an example for you that reads those into dicts:
file = open("file.txt", "r")
myDict = {}
myValueDict = {}
action = 'remember'
geneDict = {}
for line in file:
if line.startswith(">"):
line = line.rstrip().split("|")
line_name = line[0]
line_number = int(line[-1])
if line_name in myValueDict:
if myValueDict[line_name] < line_number:
action = 'remember'
myValueDict[line_name] = line_number
myDict[line_name] = line
else:
action = 'forget'
else:
myDict[line_name] = line
myValueDict[line_name] = line_number
else:
if action == 'remember':
geneDict[line_name] = line.rstrip()
for key in myDict:
print(myDict[key])
for key in geneDict:
print(geneDict[key])
this ignores the lower length items. you can now store those dicts any way you want.
I would like to format the values of a dictionary in python. Here is the script that i have used to generate the output
entries = {}
entries1 = {}
with open('no_dup.txt', 'r') as fh_in:
for line in fh_in:
if line.startswith('E'):
line = line.strip()
line = line.split()
entry = line[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
with open('no_dup_out.txt', 'w') as fh_out:
for kee, val in entries.iteritems():
if len(val) == 1:
fh_out.write("{} \n".format(val))
with open('no_dup_out.txt', 'r') as fh_in2:
for line in fh_in2:
line = line.strip()
line = line.split()
entry = line[1]
if entry in entries1:
entries1[entry].append(line)
else:
entries1[entry] = [line]
with open('no_dup_out_final.txt', 'w') as fh_out2:
for kee, val in entries1.iteritems():
if len(val) == 1:
fh_out2.write("{} \n".format(val))
For example by running the above script i generated the following output
[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]]
[["[['ENSGMOG00000003752',", "'ENSORLG00000005385']]"]]
[["[['ENSGMOG00000003760',", "'ENSORLG00000005379']]"]]
[["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]
[["[['ENSGMOG00000003761',", "'ENSORLG00000005382']]"]]
And i would like to format it such as way that i remove all the parentheses and commas (ENSGMOG00000003747 ENSORLG00000006947) and output the rest as it is using tab delimited format. How can i do that?
If your list of lists is full_list, then you could have the following code give your desired output:
desired_list = ['\t'.join([element.split('\'')[1] for element in list_item[0]]) for list_item in full_list]
I am doing text processing and using 'readline()' function as follows:
ifd = open(...)
for line in ifd:
while (condition)
do something...
line = ifd.readline()
condition = ....
#Here when the condition becomes false I need to rewind the pointer so that the 'for' loop read the same line again.
ifd.fseek() followed by readline is giving me a '\n' character. How to rewind the pointer so that the whole line is read again.
>>> ifd.seek(-1,1)
>>> line = ifd.readline()
>>> line
'\n'
Here is my code
labtestnames = sorted(tmp)
#Now read each line in the inFile and write into outFile
ifd = open(inFile, "r")
ofd = open(outFile, "w")
#read the header
header = ifd.readline() #Do nothing with this line. Skip
#Write header into the output file
nl = "mrn\tspecimen_id\tlab_number\tlogin_dt\tfluid"
offset = len(nl.split("\t"))
nl = nl + "\t" + "\t".join(labtestnames)
ofd.write(nl+"\n")
lenFields = len(nl.split("\t"))
print "Reading the input file and converting into modified file for further processing (correlation analysis etc..)"
prevTup = (0,0,0)
rowComplete = 0
k=0
for line in ifd:
k=k+1
if (k==200): break
items = line.rstrip("\n").split("\t")
if((items[0] =='')):
continue
newline= list('' for i in range(lenFields))
newline[0],newline[1],newline[3],newline[2],newline[4] = items[0], items[1], items[3], items[2], items[4]
ltests = []
ltvals = []
while(cmp(prevTup, (items[0], items[1], items[3])) == 0): # If the same mrn, lab_number and specimen_id then fill the same row. else create a new row.
ltests.append(items[6])
ltvals.append(items[7])
pos = ifd.tell()
line = ifd.readline()
prevTup = (items[0], items[1], items[3])
items = line.rstrip("\n").split("\t")
rowComplete = 1
if (rowComplete == 1): #If the row is completed, prepare newline and write into outfile
indices = [labtestnames.index(x) for x in ltests]
j=0
ifd.seek(pos)
for i in indices:
newline[i+offset] = ltvals[j]
j=j+1
if (rowComplete == 0): #
currTup = (items[0], items[1], items[3])
ltests = items[6]
ltvals = items[7]
pos = ifd.tell()
line = ifd.readline()
items = line.rstrip("\n").split("\t")
newTup = (items[0], items[1], items[3])
if(cmp(currTup, newTup) == 0):
prevTup = currTup
ifd.seek(pos)
continue
else:
indices = labtestnames.index(ltests)
newline[indices+offset] = ltvals
ofd.write(newline+"\n")
The problem can be handled more simply using itertools.groupby. groupby can cluster all the contiguous lines that deal with the same mrn, specimen_id, and lab_num.
The code that does this is
for key, group in IT.groupby(reader, key = mykey):
where reader iterates over the lines of the input file, and mykey is defined by
def mykey(row):
return (row['mrn'], row['specimen_id'], row['lab_num'])
Each row from reader is passed to mykey, and all rows with the same key are clustered together in the same group.
While we're at it, we might as well use the csv module to read each line into a dict (which I call row). This frees us from having to deal with low-level string manipulation like line.rstrip("\n").split("\t") and instead of referring to columns by index numbers (e.g. row[3]) we can write code that speaks in higher-level terms such as row['lab_num'].
import itertools as IT
import csv
inFile = 'curious.dat'
outFile = 'curious.out'
def mykey(row):
return (row['mrn'], row['specimen_id'], row['lab_num'])
fieldnames = 'mrn specimen_id date lab_num Bilirubin Lipase Calcium Magnesium Phosphate'.split()
with open(inFile, 'rb') as ifd:
reader = csv.DictReader(ifd, delimiter = '\t')
with open(outFile, 'wb') as ofd:
writer = csv.DictWriter(
ofd, fieldnames, delimiter = '\t', lineterminator = '\n', )
writer.writeheader()
for key, group in IT.groupby(reader, key = mykey):
new = {}
row = next(group)
for key in ('mrn', 'specimen_id', 'date', 'lab_num'):
new[key] = row[key]
new[row['labtest']] = row['result_val']
for row in group:
new[row['labtest']] = row['result_val']
writer.writerow(new)
yields
mrn specimen_id date lab_num Bilirubin Lipase Calcium Magnesium Phosphate
4419529 1614487 26.2675 5802791G 0.1
3319529 1614487 26.2675 5802791G 0.3 153 8.1 2.1 4
5713871 682571 56.0779 9732266E 4.1
This seems to be a perfect use case for yield expressions. Consider the following example that prints lines from a file, repeating some of them at random:
def buflines(fp):
r = None
while True:
r = yield r or next(fp)
if r:
yield None
from random import randint
with open('filename') as fp:
buf = buflines(fp)
for line in buf:
print line
if randint(1, 100) > 80:
print 'ONCE AGAIN::'
buf.send(line)
Basically, if you want to process an item once again, you send it back to the generator. On the next iteration you will be reading the same item once again.