Losing lines from two text files over iteration - python

I have two text files (A and B), like this:
A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...
B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...
What I have to do is read the two files, than do a new text file like this one:
1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...
Using for loops, i created the function (using Python):
def find(arch, i):
l = arch
for line in l:
lines = line.split('\t')
if i == int(lines[0]):
write on the text file
else:
break
Then I call the function like this:
for i in range(1,3):
find(o, i)
find(r, i)
What happens is that I lose some data, because the first line that contains a different number is read, but it's not on the final .txt file. In this example, 2 stringhere 4 and 2stringhere 1 are lost.
Is there any way to avoid this?
Thanks in advance.

If the files fit in memory:
with open('A') as file1, open('B') as file2:
L = file1.read().splitlines()
L.extend(file2.read().splitlines())
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result
It is an efficient method if total number of lines is under a million. Otherwise and especially if you have many sorted files; you could use heapq.merge() to combine them.

In your loop, when the line does not start with the same value as i you break, but you have already consumed one line so when the function is called a second time with i+1, it starts at the second valid line.
Either read the whole files in memory beforehands (see #J.F.Sebastian 's answer), or, if that is not an option, replace your function with something like:
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split('\t')
if line != "" and i == int(lines[0]): # Need to catch end of file
print " ".join(lines),
else:
l.seek(-len(line), 1) # Need to 'unread' the last read line
break
This version 'rewinds' the cursor so that the next call to readline reads the correct line again. Note that mixing the implicit for line in l with the seek call is disouraged, hence the while True.
Exemple:
$ cat t.py
o = open("t1")
r = open("t2")
print o
print r
def find(arch, i):
l = arch
while True:
line=l.readline()
lines = line.split(' ')
if line != "" and i == int(lines[0]):
print " ".join(lines),
else:
l.seek(-len(line), 1)
break
for i in range(1, 3):
find(o, i)
find(r, i)
$ cat t1
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$

There may be a less complicated way to accomplish this. The following also keeps the lines in the order they appear in the files, as it appears you want to do.
lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))
The first three lines read the input files into a single array of lines.
The fourth line ensures that each line has exactly one newline at the end. If you're sure both files will end in a newline, you can omit this line.
The fifth line defines the key for sorting as the integer version of the first word of the string.
The sixth line sorts the lines and writes the result to the output file.

Related

Taking the specific column for each line in a txt file python

I have two txt files.
First one is contains a number for each line like this:
22
15
32
53
.
.
and the other file contains 20 continuous numbers for each line like this:
0.1 2.3 4.5 .... 5.4
3.2 77.4 2.1 .... 8.1
....
.
.
According to given number in first txt I want to separate the other files. For example, in first txt for first line I have 22, that means I will take first line with 20 column and second line with two column and other columns of second line I will remove. Then I will look second line of first txt (it is 15), that means I will take 15 column from third line of other file and other columns of third line I will remove and so on. How can I make this?
with open ('numbers.txt', 'r') as f:
with open ('contiuousNumbers.txt', 'r') as f2:
with open ('results.txt', 'w') as fOut:
for line in f:
...
Thanks.
For the number on each line you iterate through the first file, make that number a target total to read, so that you can use a while loop to keep using next on the second file object to read the numbers and decrement the number of numbers from the total until the total reaches 0. Use the lower number of the total and the number of numbers to slice the numbers so that you output just the requested number of numbers:
for line in f:
output = []
total = int(line)
while total > 0:
try:
items = next(f2).split()
output.extend(items[:min(total, len(items))])
total -= len(items)
except StopIteration:
break
fOut.write(' '.join(output) + '\n')
so that given the first file with:
3
6
1
5
and the second file with:
2 5
3 7
2 1
3 6
7 3
2 2
9 1
3 4
8 7
1 2
3 8
the output file will have:
2 5 3
2 1 3 6 7 3
2
9 1 3 4 8

extract specific portions of a text file

I have a text file as follows:
A B C D E
1 1 2 1 1e8
2 1 2 3 1e5
3 2 3 2 2000
50 2 3 2 2000
80 2 3 2 2000
...
1 2 5 6 1000
4 2 4 3 1e4
50 3 6 4 5000
120 3 5 2 2000
...
2 3 2 3 5000
3 3 4 5 1e9
4 3 2 3 1e6
7 3 2 3 43
...
I need a code to go through this text file and extract lines with the same number in first columns[A] and save in different files,
for example for the first column = 1 and ...
1 1 2 1 1e8
1 2 5 6 1000
I wrote code with while loop, but the problem is that this file is very big and with while loop it does this work for the numbers which does not exist in text and it takes very very long to finish,
Thanks for your help
Warning
Both of the examples below will overwrite files called input_<number>.txt in the path they are run in.
Using awk
rm input_[0-9]*.txt; awk '/^[0-9]+[ \t]+/{ print >> "input_"$1".txt" }' input.txt
The front part /^[0-9]+[ \t]+/ does a regex match to select only lines which start with an integer number, the second part { print >> "input_"$1".txt" } prints those lines into a file named input_<number>.txt, with the corresponding lines for every number found in the first column of the file.
Using Python
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
d = {}
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
try:
d[ind].write(line)
except KeyError:
d[ind] = open(name + "_{}".format(ind) + ext, "w")
d[ind].write(line)
for dd in d.values():
dd.close()
Using Python (avoiding too many open file handles)
In this case you have to remove any old output files before you run the code manually, using rm input_[0-9]*.txt
import sys
import os
fn = sys.argv[1]
name, ext = os.path.splitext(fn)
with open(fn, 'r') as f:
for line in f:
ind = line.split()[0]
try:
ind = int(int)
except ValueError:
continue
with open(name + "_{}".format(ind) + ext, "a") as d:
d.write(line)
Raising the limit of the number of open file handles
If you are sudoer on your machine, you can increase the limit of open file handles for a process by using ulimit -n <number>, as per this answer.

Iterate File Saving Blocks and Skipping Lines

I have data in blocks with non-data lines between the blocks. This code has been working but is not robust. How do I extract blocks and skip non-data blocks without consuming a line in the index test? I'm looking for a straight python solution without loading packages.
I've searched for a relevant example and I'm happy to delete this question if the answer exists.
from __future__ import print_function
BLOCK_DATA_ROWS = 3
SKIP_ROWS = 2
block = 0
with open('array1.dat', 'rb') as f:
for i in range (2):
block += 1
for index, line in enumerate(f):
if index == BLOCK_DATA_ROWS:
break
print(block, 'index', index, 'line', line.rstrip('\r\n'))
for index, line in enumerate(f):
if index == SKIP_ROWS:
break
print(' skip index', index, 'line', line.rstrip('\r\n'))
Input
1
2
3
4
5
6
7
8
9
Output
1 index 0 line 1
1 index 1 line 2
1 index 2 line 3
skip index 0 line 5
skip index 1 line 6
2 index 0 line 8
2 index 1 line 9
Edit
I also want to use a similar iteration approach with an excel sheet:
for row in ws.iter_rows()
In the code posted, the line 4 is read, and the condition index == BLOCK_DATA_ROWS is met, leaving the first loop towards the second one. As f is a generator, when it is called in the second loop, it returns the next element to iterate over, and line 4 has already been returned to loop 1 (it is not printed, but the value is used).
This has to be taken into account in the code. One option is to combine both conditions in the same loop:
from __future__ import print_function
BLOCK_DATA_ROWS = 3
SKIP_ROWS = 2
block = 1
with open('array1.dat', 'r') as f:
index = 0
for line in f:
if index < BLOCK_DATA_ROWS:
print(block, 'index', index, 'line', line.rstrip('\r\n'))
elif index < BLOCK_DATA_ROWS+SKIP_ROWS:
print(' skip index', index, 'line', line.rstrip('\r\n'))
index += 1
if index == BLOCK_DATA_ROWS+SKIP_ROWS: # IF!!, not elif
index = 0
block += 1
The for i in range(2) has also been removed, and now the code will work for any number of blocks, not just 2.
Which returns:
1 index 0 line 1
1 index 1 line 2
1 index 2 line 3
skip index 3 line 4
skip index 4 line 5
2 index 0 line 6
2 index 1 line 7
2 index 2 line 8
skip index 3 line 9
skip index 4 line 10

Missing whitespace when printing in a loop

I have this strange problem when following a reference, this code:
for r in range(10):
for c in range(r):
print "",
for c in range(10-r):
print c,
print ""
should print out something like this:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0
but Instead I am getting this:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0
Can anyone explain to me what is causing in indent on right side, it seems so simple but I have no clue what I can do to fix this?
You were printing the leading spaces incorrectly. You were printing empty quotes ("") which is printing only a single space. When you do print c, there is a space printed after c is printed. You should print " " instead to get the correct spacing. This is a very subtle thing to notice.
for r in range(10):
for c in range(r):
print " ", #print it here
for c in range(10-r):
print c,
print ""
Test
If you want to format it just so, it might be better to just let Python do it for you instead of counting explicit and the hidden implicit spaces. See the string formatting docs for what {:^19} means and more.
for i in range(10):
nums = ' '.join(str(x) for x in range(10 - i))
#print '{:^19}'.format(nums) # reproduces your "broken" code
print '{:>19}'.format(nums) # your desired output
Using the print function is a good alternative sometimes, as you can eliminate hidden spaces by setting the keyword argument end to an empty string:
from __future__ import print_function # must be at the top of the file.
# ...
print(x, end='')
You are simply not creating enough indent on the left side (there is no such thing as right side indent while printing).
For every new line you want to increase the indent by two spaces, because you are adding a number+whitespace on the line above. "", automatically adds one whitespace (this is why there is whitespaces between the numbers). Since you need to add two, simply add a whitespace within the quotes, like this: " ",.
The extra whitespace is filling the space of the number in the line above. The comma in "", is only filling the space between the numbers. To clarify: " ", uses the same space as c,, two characters, while "", only uses one character.
Here is your code with the small fix:
for r in range(10):
for c in range(r):
print " ", # added a whitespace here for correct indent
for c in range(10-r):
print c,
print ""

Formatting files to be written

Hi please how can I loop over a text file, identify lines with 0s at the last index of such a line, and delete those lines while retrieving the ones not deleted. Then also format the output to be tuples.
input.txt = 1 2 0
1 3 0
11 4 0.058529
...
...
...
97 7 0.0789
Desired output should look like this
[(11,4,{'volume': 0.058529})]
Thank you
Pass inplace=1 to the fileinput.input() to modify the file in place. Everything that is printed inside the loop is written to the file:
import fileinput
results = []
for line in fileinput.input('input.txt', inplace=1):
data = line.split()
if data[-1].strip() == '0':
print line.strip()
else:
results.append(tuple(map(int, data[:-1])) + ({'volume': float(data[-1])}, ))
print results
If the input.txt contains:
1 2 0
1 3 0
11 4 0.058529
97 7 0.0789
the code will print:
[(11, 4, {'volume': 0.058529}),
(97, 7, {'volume': 0.0789})]
And the contents of the input.txt becomes:
1 2 0
1 3 0

Categories