Formatting files to be written - python

Hi please how can I loop over a text file, identify lines with 0s at the last index of such a line, and delete those lines while retrieving the ones not deleted. Then also format the output to be tuples.
input.txt = 1 2 0
1 3 0
11 4 0.058529
...
...
...
97 7 0.0789
Desired output should look like this
[(11,4,{'volume': 0.058529})]
Thank you

Pass inplace=1 to the fileinput.input() to modify the file in place. Everything that is printed inside the loop is written to the file:
import fileinput
results = []
for line in fileinput.input('input.txt', inplace=1):
data = line.split()
if data[-1].strip() == '0':
print line.strip()
else:
results.append(tuple(map(int, data[:-1])) + ({'volume': float(data[-1])}, ))
print results
If the input.txt contains:
1 2 0
1 3 0
11 4 0.058529
97 7 0.0789
the code will print:
[(11, 4, {'volume': 0.058529}),
(97, 7, {'volume': 0.0789})]
And the contents of the input.txt becomes:
1 2 0
1 3 0

Related

Change format and combine info from 2 columns

First time posting here so please be patient!
I have a file that looks like that:
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
...
I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
...
Any idea on how I could do that would be much appreciated!
Here's a method that works:
lines = open('test.txt','r').read().splitlines()
place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
for ct in row.split()[1:]:
a,b = ct.split(':')
counts[i][place[a]] = int(b)
out_str = '\n'.join([lines[0]] + ['{:<4}{},{},{},{}'.format(i+1,*ct)
for i,ct in enumerate(counts)])
with open('output.txt','w') as f:
f.write(out_str)
The resulting file reads
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file.
for each line you can do as follows (we will take the first line as an example):
remove useless spaces using strip
G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like
resulting_table=[]
for line in file: #You can easily find how to read a file line by line
split_line=line.strip().split(' ')
A,T,G,C=0,0,0,0
for pair in split_line:
element=pair.split(':')
if element[0]=='A':
A=element[1]
elif element[0]=='T':
...
resulting_table.append([A,T,G,C])
And here you go ! You can then transform it easily into a dataframe or a numpy array
This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner
sourse.txt
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
import re
template = ('A', 'C', 'G', 'T')
def proc_line(line: str):
index, *elements = re.findall(r'\S+', line)
data = dict([*map(lambda x: x.split(':'), elements)])
return f'{index}\t' + ','.join([data.get(item, '0') for item in template]) + '\n'
with open('source.txt', encoding='utf-8') as file:
header, *lines = file.readlines()
with open('output.txt', 'w', encoding='utf-8') as new_file:
new_file.writelines(
[header] + list(map(proc_line, lines))
)
output.txt
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
$ awk -F'[ :]+' '
NR>1 {
delete f
f[$2] = $3
f[$4] = $5
$0 = sprintf("%s %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
}
1' file
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0

How to add a string to rows having some keywords using python

I have several files (named as mod0.msh, mod1.msh and so on) and want to add a string ( lower_dimensional_block) at the end of some rows of these files using python. At the moment I am giving the number of lines and add the string at the end but I want to use some words of the lines rather than numbers. This is some first lines of my files:
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1"
1 11 "W_2"
2 8 "fault2"
...
I have also a list which has the number of lines which I want to add string:
adding_str = [6,7]
This is also my code:
from fileinput import FileInput
for idx in range(2):# it means I have two files
with FileInput(f'mod{idx}.msh', inplace=True, backup='.bak') as in_file:
for i, line in enumerate(in_file, start=1):
for j in keywords:
print(
line.rstrip(),
end=' lower_dimensional_block\n' if j in line else '\n'
)
But, I have a list of key words and want to add the string at the end of each line that has one of these key words:
keywords=['W_1', 'W_2']
I do appreciate any help to do such thing in python. This is my expected output:
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1" lower_dimensional_block
1 11 "W_2" lower_dimensional_block
2 8 "fault2"
...
Is it what you expect?
import fileinput
import re
keywords=['W_1', 'W_2']
KWDS = re.compile(fr'''\d+ \d+ "({'|'.join(keywords)})"'''
files = [f'mod{idx}.msh' for idx in range(2)]
with fileinput.input(files, inplace=True, backup='.bak') as in_file:
for line in in_file:
print(f'{line.rstrip()} lower_dimensional_block'
if KWDS.match(line) else line.rstrip())
>>> %cat mod0.msh
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1" lower_dimensional_block
1 11 "W_2" lower_dimensional_block
2 8 "fault2"
>>> %cat mod0.msh.bak
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1"
1 11 "W_2"
2 8 "fault2"
>>> KWDS
re.compile(r'\d+ \d+ "(W_1|W_2)"', re.UNICODE)

extract specific portions of a text file

I have a text file as follows:
A B C D E
1 1 2 1 1e8
2 1 2 3 1e5
3 2 3 2 2000
50 2 3 2 2000
80 2 3 2 2000
...
1 2 5 6 1000
4 2 4 3 1e4
50 3 6 4 5000
120 3 5 2 2000
...
2 3 2 3 5000
3 3 4 5 1e9
4 3 2 3 1e6
7 3 2 3 43
...
I need a code to go through this text file and extract lines with the same number in first columns[A] and save in different files,
for example for the first column = 1 and ...
1 1 2 1 1e8
1 2 5 6 1000
I wrote code with while loop, but the problem is that this file is very big and with while loop it does this work for the numbers which does not exist in text and it takes very very long to finish,
Thanks for your help
Warning
Both of the examples below will overwrite files called input_<number>.txt in the path they are run in.
Using awk
rm input_[0-9]*.txt; awk '/^[0-9]+[ \t]+/{ print >> "input_"$1".txt" }' input.txt
The front part /^[0-9]+[ \t]+/ does a regex match to select only lines which start with an integer number, the second part { print >> "input_"$1".txt" } prints those lines into a file named input_<number>.txt, with the corresponding lines for every number found in the first column of the file.
Using Python
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
d = {}
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
try:
d[ind].write(line)
except KeyError:
d[ind] = open(name + "_{}".format(ind) + ext, "w")
d[ind].write(line)
for dd in d.values():
dd.close()
Using Python (avoiding too many open file handles)
In this case you have to remove any old output files before you run the code manually, using rm input_[0-9]*.txt
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
with open(name + "_{}".format(ind) + ext, "a") as d:
d.write(line)
Raising the limit of the number of open file handles
If you are sudoer on your machine, you can increase the limit of open file handles for a process by using ulimit -n <number>, as per this answer.

Read text file into list based on specific criterion

I have a text file with the following content:
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
I want to read the above text file into a list such that program starts reading from the row where first column has a value greater than zero.
How do I do it in Python 3.5?
I tried genfromtxt() but i can only skip fixed number of lines from top. Since I will be reading different files i need something else.
This is one way with csv module.
import csv
from io import StringIO
mystr = StringIO("""\
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
2 50 45 4
""")
res = []
# replace mystr with open('file.csv', 'r')
with mystr as f:
reader = csv.reader(mystr, delimiter=' ', skipinitialspace=True)
next(reader) # skip header
for line in reader:
row = list(map(int, filter(None, line))) # convert to integers
if row[0] > 0: # apply condition
res.append(row)
print(res)
[[1, 66, 78, 9], [2, 50, 45, 4]]
lst = []
flag = 0
with open('a.txt') as f:
for line in f:
try:
if float(line.split()[0].strip('.')) >0:
flag = 1
if flag == 1:
lst += [float(i.strip('.')) for i in line.split()]
except:
pass

Losing lines from two text files over iteration

I have two text files (A and B), like this:
A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...
B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...
What I have to do is read the two files, than do a new text file like this one:
1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...
Using for loops, i created the function (using Python):
def find(arch, i):
l = arch
for line in l:
lines = line.split('\t')
if i == int(lines[0]):
write on the text file
else:
break
Then I call the function like this:
for i in range(1,3):
find(o, i)
find(r, i)
What happens is that I lose some data, because the first line that contains a different number is read, but it's not on the final .txt file. In this example, 2 stringhere 4 and 2stringhere 1 are lost.
Is there any way to avoid this?
Thanks in advance.
If the files fit in memory:
with open('A') as file1, open('B') as file2:
L = file1.read().splitlines()
L.extend(file2.read().splitlines())
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result
It is an efficient method if total number of lines is under a million. Otherwise and especially if you have many sorted files; you could use heapq.merge() to combine them.
In your loop, when the line does not start with the same value as i you break, but you have already consumed one line so when the function is called a second time with i+1, it starts at the second valid line.
Either read the whole files in memory beforehands (see #J.F.Sebastian 's answer), or, if that is not an option, replace your function with something like:
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split('\t')
if line != "" and i == int(lines[0]): # Need to catch end of file
print " ".join(lines),
else:
l.seek(-len(line), 1) # Need to 'unread' the last read line
break
This version 'rewinds' the cursor so that the next call to readline reads the correct line again. Note that mixing the implicit for line in l with the seek call is disouraged, hence the while True.
Exemple:
$ cat t.py
o = open("t1")
r = open("t2")
print o
print r
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split(' ')
if line != "" and i == int(lines[0]):
print " ".join(lines),
else:
l.seek(-len(line), 1)
break
for i in range(1, 3):
find(o, i)
find(r, i)
$ cat t1
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$
There may be a less complicated way to accomplish this. The following also keeps the lines in the order they appear in the files, as it appears you want to do.
lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))
The first three lines read the input files into a single array of lines.
The fourth line ensures that each line has exactly one newline at the end. If you're sure both files will end in a newline, you can omit this line.
The fifth line defines the key for sorting as the integer version of the first word of the string.
The sixth line sorts the lines and writes the result to the output file.

Categories