First time posting here so please be patient!
I have a file that looks like that:
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
...
I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
...
Any idea on how I could do that would be much appreciated!
Here's a method that works:
lines = open('test.txt','r').read().splitlines()
place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
for ct in row.split()[1:]:
a,b = ct.split(':')
counts[i][place[a]] = int(b)
out_str = '\n'.join([lines[0]] + ['{:<4}{},{},{},{}'.format(i+1,*ct)
for i,ct in enumerate(counts)])
with open('output.txt','w') as f:
f.write(out_str)
The resulting file reads
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file.
for each line you can do as follows (we will take the first line as an example):
remove useless spaces using strip
G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like
resulting_table=[]
for line in file: #You can easily find how to read a file line by line
split_line=line.strip().split(' ')
A,T,G,C=0,0,0,0
for pair in split_line:
element=pair.split(':')
if element[0]=='A':
A=element[1]
elif element[0]=='T':
...
resulting_table.append([A,T,G,C])
And here you go ! You can then transform it easily into a dataframe or a numpy array
This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner
sourse.txt
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
import re
template = ('A', 'C', 'G', 'T')
def proc_line(line: str):
index, *elements = re.findall(r'\S+', line)
data = dict([*map(lambda x: x.split(':'), elements)])
return f'{index}\t' + ','.join([data.get(item, '0') for item in template]) + '\n'
with open('source.txt', encoding='utf-8') as file:
header, *lines = file.readlines()
with open('output.txt', 'w', encoding='utf-8') as new_file:
new_file.writelines(
[header] + list(map(proc_line, lines))
)
output.txt
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
$ awk -F'[ :]+' '
NR>1 {
delete f
f[$2] = $3
f[$4] = $5
$0 = sprintf("%s %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
}
1' file
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
Related
I have several files (named as mod0.msh, mod1.msh and so on) and want to add a string ( lower_dimensional_block) at the end of some rows of these files using python. At the moment I am giving the number of lines and add the string at the end but I want to use some words of the lines rather than numbers. This is some first lines of my files:
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1"
1 11 "W_2"
2 8 "fault2"
...
I have also a list which has the number of lines which I want to add string:
adding_str = [6,7]
This is also my code:
from fileinput import FileInput
for idx in range(2):# it means I have two files
with FileInput(f'mod{idx}.msh', inplace=True, backup='.bak') as in_file:
for i, line in enumerate(in_file, start=1):
for j in keywords:
print(
line.rstrip(),
end=' lower_dimensional_block\n' if j in line else '\n'
)
But, I have a list of key words and want to add the string at the end of each line that has one of these key words:
keywords=['W_1', 'W_2']
I do appreciate any help to do such thing in python. This is my expected output:
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1" lower_dimensional_block
1 11 "W_2" lower_dimensional_block
2 8 "fault2"
...
Is it what you expect?
import fileinput
import re
keywords=['W_1', 'W_2']
KWDS = re.compile(fr'''\d+ \d+ "({'|'.join(keywords)})"'''
files = [f'mod{idx}.msh' for idx in range(2)]
with fileinput.input(files, inplace=True, backup='.bak') as in_file:
for line in in_file:
print(f'{line.rstrip()} lower_dimensional_block'
if KWDS.match(line) else line.rstrip())
>>> %cat mod0.msh
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1" lower_dimensional_block
1 11 "W_2" lower_dimensional_block
2 8 "fault2"
>>> %cat mod0.msh.bak
$MeshFormat
2.2 0 8
$EndMeshFormat
$PhysicalNames
13
1 10 "W_1"
1 11 "W_2"
2 8 "fault2"
>>> KWDS
re.compile(r'\d+ \d+ "(W_1|W_2)"', re.UNICODE)
While reading a file in python, I was wondering how to get the next n lines when we encounter a line that meets my condition.
Say there is a file like this
mangoes:
1 2 3 4
5 6 7 8
8 9 0 7
7 6 8 0
apples:
1 2 3 4
8 9 0 9
Now whenever we find a line starting with mangoes, I want to be able to read all the next 4 lines.
I was able to find out how to do the next immediate line but not next n immediate lines
if (line.startswith("mangoes:")):
print(next(ifile)) #where ifile is the input file being iterated over
just repeat what you did
if (line.startswith("mangoes:")):
for i in range(n):
print(next(ifile))
Unless it's a huge file and you don't want to read all lines into memory at once you could do something like this
n = 4
with open(fn) as f:
lines = f.readlines()
for idx, ln in enumerate(lines):
if ln.startswith("mangoes"):
break
mangoes = lines[idx:idx+n]
This would give you a list of the n number of lines, including the word mangoes. if you did idx=idx+1 then you'd skip the title too.
With itertools.islice feature:
from itertools import islice
with open('yourfile') as ifile:
n = 4
for line in ifile:
if line.startswith('mangoes:'):
mango_lines = list(islice(ifile, n))
From your input sample the resulting mango_lines list would be:
['1 2 3 4 \n', '5 6 7 8\n', '8 9 0 7\n', '7 6 8 0\n']
I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?
With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None
Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)
I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
I want to extract data between empty lines and write it in new files. It is hard to know how many empty lines are in file (means you also dont know how many new files you will be writing ; thus it seems very hard to write new files since u dont know how many new files will you be writing. Can anyone guide me? Thank you. I hope my question is clear.
Unless your file is very large, split all into individual sections using re, splitting on 2 or more whitespace chars
import re
with open("in.txt") as f:
lines = re.split("\s{2,}",f.read())
print lines
['#01textline1\n1 2 3 4 5 6\n2 3 5 6 7 3\n3 5 6 7 6 4\n4 6 7 8 9 9', '1 2 3 6 4 7\n3 5 7 7 8 4\n4 6 6 7 8 5', '3 4 5 6 7 8\n4 6 7 8 8 9']
Just iterate over lines and write your new files each iteration
Reading files is not data-mining. Please choose more appropriate tags...
Splitting a file on empty lines is trivial:
num = 0
out = open("file-0", "w")
for line in open("file"):
if line == "\n":
num = num + 1
out.close()
out = open("file-"+num, "w")
continue
out.write(line)
out.close()
As this approach is reading just one line at a time, file size does not matter. It should process data as fast as your disk can handle it, with near-constant memory usage.
Perl would have had a neat trick, because you can set the input record separator to two newlines via $/="\n\n"; and then process the data one record at a time as usual... I could not find something similar in python; but the hack with "split on empty lines" is not bad either.
Here is a start:
with open('in_file') as input_file:
processing = False
i = 0
for line in input_file:
if line.strip() and not processing:
out_file = open('output - {}'.format(i), 'w')
out_file.write(line)
processing = True
i += 1
elif line.strip():
out_file.write(line)
else:
processing = False
out_file.close()
This code keeps track of whether a file is being currently written to, with the processing flag. It resets the flag when it sees a blank line. The code also creates a new file upon seeing an empty line.
Hope it helps.
I have 2 file of the following form:
file1:
work1
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
file2:
work2
2 3 4 5 5
2 4 7 8 9
work1
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
work3
1 7 8 9 10
Now I want to compare to file and wherever say the header (work1) is equal..I want to compare the subsequent sections and print the line at which the difference is found. E.g.
work1 (file1)
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
work1 (file2)
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
Now I want to print the line where difference occurs i.e. "1 2 4 4 5"
For doing so I have written the following code:
with open("file1",) as r, open("file2") as w:
for line in r:
if "work1" in line:
for line1 in w:
if "work1" in line1:
print "work1"
However, from here on I am confused as to how can I read both the files parallely. Can someone please help me with this...as I am not getting after comparing "work1"'s how should I read the files parallelly
You would probably want to try out itertools module in Python.
It contains a function called izip that can do what you need, along with a function called islice. You can iterate through the second file until you hit the header you were looking for, and you could slice the header up.
Here's a bit of the code.
from itertools import *
w = open('file2')
for (i,line) in enumerate(w):
if "work1" in line:
iter2 = islice(open('file2'), i, None, 1) # Starts at the correct line
f = open('file1')
for (line1,line2) in izip(f,iter2):
print line1, line2 # Place your comparisons of the two lines here.
You're guaranteed now that on the first run through of the loop you'll get "work1" on both lines. After that you can compare. Since f is shorter than w, the iterator will exhaust itself and stop once you hit the end of f.
Hopefully I explained that well.
EDIT: Added import statement.
EDIT: We need to reopen file2. This is because iterating through iterables in Python consumes the iterable. So, we need to pass a brand new one to islice so it works!
with open('f1.csv') as f1, open('f2.csv') as f2 :
i=0
break_needed = False
while True :
r1, r2 = f1.readline(), f2.readline()
if len(r1) == 0 :
print "eof found for f1"
break_needed = True
if len(r2) == 0 :
print "eof found for f2"
break_needed = True
if break_needed :
break
i += 1
if r1 != r2 :
print " line %i"%i
print "file 1 : " + r1
print "file 2 : " + r2