extract specific portions of a text file

extract specific portions of a text file - python

I have a text file as follows:
A B C D E
1 1 2 1 1e8
2 1 2 3 1e5
3 2 3 2 2000
50 2 3 2 2000
80 2 3 2 2000
...
1 2 5 6 1000
4 2 4 3 1e4
50 3 6 4 5000
120 3 5 2 2000
...
2 3 2 3 5000
3 3 4 5 1e9
4 3 2 3 1e6
7 3 2 3 43
...
I need a code to go through this text file and extract lines with the same number in first columns[A] and save in different files,
for example for the first column = 1 and ...
1 1 2 1 1e8
1 2 5 6 1000
I wrote code with while loop, but the problem is that this file is very big and with while loop it does this work for the numbers which does not exist in text and it takes very very long to finish,
Thanks for your help

Warning
Both of the examples below will overwrite files called input_<number>.txt in the path they are run in.
Using awk
rm input_[0-9]*.txt; awk '/^[0-9]+[ \t]+/{ print >> "input_"$1".txt" }' input.txt
The front part /^[0-9]+[ \t]+/ does a regex match to select only lines which start with an integer number, the second part { print >> "input_"$1".txt" } prints those lines into a file named input_<number>.txt, with the corresponding lines for every number found in the first column of the file.
Using Python
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
d = {}
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
try:
d[ind].write(line)
except KeyError:
d[ind] = open(name + "_{}".format(ind) + ext, "w")
d[ind].write(line)
for dd in d.values():
dd.close()
Using Python (avoiding too many open file handles)
In this case you have to remove any old output files before you run the code manually, using rm input_[0-9]*.txt
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
with open(name + "_{}".format(ind) + ext, "a") as d:
d.write(line)
Raising the limit of the number of open file handles
If you are sudoer on your machine, you can increase the limit of open file handles for a process by using ulimit -n <number>, as per this answer.

Related

Taking the specific column for each line in a txt file python

I have two txt files.
First one is contains a number for each line like this:
22
15
32
53
.
.
and the other file contains 20 continuous numbers for each line like this:
0.1 2.3 4.5 .... 5.4
3.2 77.4 2.1 .... 8.1
....
.
.
According to given number in first txt I want to separate the other files. For example, in first txt for first line I have 22, that means I will take first line with 20 column and second line with two column and other columns of second line I will remove. Then I will look second line of first txt (it is 15), that means I will take 15 column from third line of other file and other columns of third line I will remove and so on. How can I make this?
with open ('numbers.txt', 'r') as f:
with open ('contiuousNumbers.txt', 'r') as f2:
with open ('results.txt', 'w') as fOut:
for line in f:
...
Thanks.

For the number on each line you iterate through the first file, make that number a target total to read, so that you can use a while loop to keep using next on the second file object to read the numbers and decrement the number of numbers from the total until the total reaches 0. Use the lower number of the total and the number of numbers to slice the numbers so that you output just the requested number of numbers:
for line in f:
output = []
total = int(line)
while total > 0:
try:
items = next(f2).split()
output.extend(items[:min(total, len(items))])
total -= len(items)
except StopIteration:
break
fOut.write(' '.join(output) + '\n')
so that given the first file with:
3
6
1
5
and the second file with:
2 5
3 7
2 1
3 6
7 3
2 2
9 1
3 4
8 7
1 2
3 8
the output file will have:
2 5 3
2 1 3 6 7 3
2
9 1 3 4 8

Change specific column values from a text file using Python

I have a text file as following:
1 1 2 1 1e8
2 1 2 3 1e5
3 2 3 2 2000
4 2 5 6 1000
5 2 4 3 1e4
6 3 6 4 5000
7 3 5 2 2000
8 3 2 3 5000
9 3 4 5 1e9
10 3 2 3 1e6
My question is that how can I change one column(for example divide all the data from the last column to 900) and save the whole data in a new file?
Thanks in advance

You can use numpy library. The code below should do it.
import numpy as np
# assume the data is in.txt file
dat = np.loadtxt('in.txt')
# -1 means the last column
dat[:, -1] = dat[:, -1]/900
# write the result to the out.txt file
# fmt is the format while writing to the file
# %.2f means that it will save it to the 2 digits precision
np.savetxt('out.txt', dat, fmt='%.2f')

for each line of your file,
vals = line.split(' ')
my_val = float(vals[4])/900
output = open('filename', 'wb')
output.write(my_val +'\n')

Using a simple iteration.
res = []
with open(filename, "r") as infile:
for line in infile: # Iterate each line
val = line.split() # Split line by space
res.append( str(float(val[-1])/900) ) # Use negative index to get last element and divide.
with open(filename, "w") as outfile: #Open file to write
for line in res:
outfile.write(line+"\n") #Write Data

You can try the following:
import numpy as np
#Suppose your text data name is "CH3CH2OH_g.txt"
file = np.loadtxt('CH3CH2OH_g.txt')
z= (file[:, -1]/900)
np.savetxt('LogCH3CH2OH.txt', file, fmt='%.2f')

Losing lines from two text files over iteration

I have two text files (A and B), like this:
A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...
B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...
What I have to do is read the two files, than do a new text file like this one:
1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...
Using for loops, i created the function (using Python):
def find(arch, i):
l = arch
for line in l:
lines = line.split('\t')
if i == int(lines[0]):
write on the text file
else:
break
Then I call the function like this:
for i in range(1,3):
find(o, i)
find(r, i)
What happens is that I lose some data, because the first line that contains a different number is read, but it's not on the final .txt file. In this example, 2 stringhere 4 and 2stringhere 1 are lost.
Is there any way to avoid this?
Thanks in advance.

If the files fit in memory:
with open('A') as file1, open('B') as file2:
L = file1.read().splitlines()
L.extend(file2.read().splitlines())
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result
It is an efficient method if total number of lines is under a million. Otherwise and especially if you have many sorted files; you could use heapq.merge() to combine them.

In your loop, when the line does not start with the same value as i you break, but you have already consumed one line so when the function is called a second time with i+1, it starts at the second valid line.
Either read the whole files in memory beforehands (see #J.F.Sebastian 's answer), or, if that is not an option, replace your function with something like:
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split('\t')
if line != "" and i == int(lines[0]): # Need to catch end of file
print " ".join(lines),
else:
l.seek(-len(line), 1) # Need to 'unread' the last read line
break
This version 'rewinds' the cursor so that the next call to readline reads the correct line again. Note that mixing the implicit for line in l with the seek call is disouraged, hence the while True.
Exemple:
$ cat t.py
o = open("t1")
r = open("t2")
print o
print r
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split(' ')
if line != "" and i == int(lines[0]):
print " ".join(lines),
else:
l.seek(-len(line), 1)
break
for i in range(1, 3):
find(o, i)
find(r, i)
$ cat t1
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$

There may be a less complicated way to accomplish this. The following also keeps the lines in the order they appear in the files, as it appears you want to do.
lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))
The first three lines read the input files into a single array of lines.
The fourth line ensures that each line has exactly one newline at the end. If you're sure both files will end in a newline, you can omit this line.
The fifth line defines the key for sorting as the integer version of the first word of the string.
The sixth line sorts the lines and writes the result to the output file.

Python : How do i get a new column for every file I read?

Im trying to read 3 text files and combine them into a single output file. so far so good, the only problem is that I need to create columns for every file i read. right now I have all the extracted data from the files in a single column.
#!/usr/bin/env python
import sys
usage = 'Usage: %s infile' % sys.argv[0]
i=3 #start position
outfile = open('outfil.txt','w')
while i<len(sys.argv):
try:
infilename = sys.argv[i]
ifile = open(infilename, 'r')
outfile.write(infilename+'\n')
for line in ifile:
outfile.write(line)
print line
except:
print usage; sys.exit[i]
i+=1;
right now my output file looks like this:
test1.txt
a
b
c
d
test2.txt
e
f
g
h
test3.txt
i
j
k
l

Open input files one after another, collect data into the list of lists. Then, zip() the data and writer via csv writer with space as a delimiter:
#!/usr/bin/env python
import csv
import sys
usage = 'Usage: %s infile' % sys.argv[0]
data = []
for filename in sys.argv[1:]:
with open(filename) as f:
data.append([line.strip() for line in f])
data = zip(*data)
with open('outfil.txt', 'w') as f:
writer = csv.writer(f, delimiter=" ")
writer.writerows(data)
Assuming you have:
1.txt with the following contents:
1
2
3
4
5
2.txt with the following contents:
6
7
8
9
10
Then, if you save the code to test.py and run it as python test.py 1.txt 2.txt, in outfil.txt you will get:
1 6
2 7
3 8
4 9
5 10

$ cat a
1
2
3
4
5
6
7
8
9
10
$ cat b
51
52
53
54
55
56
57
58
59
60
>>> import itertools
>>> for (i,j) in itertools.izip(open('a'), open('b')):
... print i.strip(), '---', j.strip()
...
1 --- 51
2 --- 52
3 --- 53
4 --- 54
5 --- 55
6 --- 56
7 --- 57
8 --- 58
9 --- 59
10 --- 60
>>>

Comparing 2 files line by line

I have 2 file of the following form:
file1:
work1
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
file2:
work2
2 3 4 5 5
2 4 7 8 9
work1
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
work3
1 7 8 9 10
Now I want to compare to file and wherever say the header (work1) is equal..I want to compare the subsequent sections and print the line at which the difference is found. E.g.
work1 (file1)
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
work1 (file2)
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
Now I want to print the line where difference occurs i.e. "1 2 4 4 5"
For doing so I have written the following code:
with open("file1",) as r, open("file2") as w:
for line in r:
if "work1" in line:
for line1 in w:
if "work1" in line1:
print "work1"
However, from here on I am confused as to how can I read both the files parallely. Can someone please help me with this...as I am not getting after comparing "work1"'s how should I read the files parallelly

You would probably want to try out itertools module in Python.
It contains a function called izip that can do what you need, along with a function called islice. You can iterate through the second file until you hit the header you were looking for, and you could slice the header up.
Here's a bit of the code.
from itertools import *
w = open('file2')
for (i,line) in enumerate(w):
if "work1" in line:
iter2 = islice(open('file2'), i, None, 1) # Starts at the correct line
f = open('file1')
for (line1,line2) in izip(f,iter2):
print line1, line2 # Place your comparisons of the two lines here.
You're guaranteed now that on the first run through of the loop you'll get "work1" on both lines. After that you can compare. Since f is shorter than w, the iterator will exhaust itself and stop once you hit the end of f.
Hopefully I explained that well.
EDIT: Added import statement.
EDIT: We need to reopen file2. This is because iterating through iterables in Python consumes the iterable. So, we need to pass a brand new one to islice so it works!

with open('f1.csv') as f1, open('f2.csv') as f2 :
i=0
break_needed = False
while True :
r1, r2 = f1.readline(), f2.readline()
if len(r1) == 0 :
print "eof found for f1"
break_needed = True
if len(r2) == 0 :
print "eof found for f2"
break_needed = True
if break_needed :
break
i += 1
if r1 != r2 :
print " line %i"%i
print "file 1 : " + r1
print "file 2 : " + r2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract specific portions of a text file - python

Related

Taking the specific column for each line in a txt file python

Change specific column values from a text file using Python

Losing lines from two text files over iteration

Python : How do i get a new column for every file I read?

Comparing 2 files line by line

Categories

Resources