Find all string positions in file - python

If i want to find position of string in a file i can do
f = open('file.txt', 'r')
lines = f.read()
posn = lines.find('string')
What if the string occured several times in the file and I want to find all the positions where it occurs? I have a list of strings so right now my code is
for string in list:
f = open('file.txt', 'r')
lines = f.read()
posn = lines.find(string)
My code is incomplete, it only finds the first position of each string in the list

You can use the following
import re
a = open("file", "r")
g = a.read()
ma = re.finditer('test', g)
for t in ma:
print t.start(), t.end()
Possible output
8 12
16 20
For example:
g='hahahatesthahatesthahahatest'
ma=re.finditer('test',g)
for t in ma:
print t.start(), t.end()
Output
6 10
14 18
24 28
print g[t.start():t.end()] gives you test as expected

You can just use enumerate :
>>> s='this is a string'
>>> def find_pos(s,sub):
... return [i for i,j in enumerate(s) if j==sub]
...
>>> find_pos(s,'s')
[3, 6, 10]

This will return where your pattern is present in your file. Uses re.finditer.
import re
with open('your.file') as f:
text = f.read()
positions = [m.span() for m in re.finditer('pattern', text)]

Related

How can I merge each two lines of a large text file into a Python list?

I have a .txt file that is split into multiple lines, but each two of these lines I would like to merge into a single line of a list. How do I do that?
Thanks a lot!
What I have is organized like this:
[1 2 3 4
5 6]
[1 2 3 4
5 6 ]
while what I need would be:
[1 2 3 4 5 6]
[1 2 3 4 5 6]
data =[]
with open(r'<add file path here >','r') as file:
x = file.readlines()
for i in range(0,len(x),2):
data.append(x[i:i+2])
new =[' '.join(i) for i in data]
for i in range(len(new)):
new[i]=new[i].replace('\n','')
new_file_name = r'' #give new file path here
with open(new_file_name,'w+') as file:
for i in new:
file.write(i+'\n')
Try This
final_data = []
with open('file.txt') as a:
fdata= a.readlines()
for ln in range(0,len(fdata),2):
final_data.append(" ".join([fdata[ln].strip('\n'), fdata[ln+1].strip('\n')]))
print (final_data)
I feel you can use a regex for solving this scenario :
#! /usr/bin/env python2.7
import re
with open("textfilename.txt") as r:
text_data = r.read()
independent_lists = re.findall(r"\[(.+?)\]",r ,re.DOTALL)
#now that we have got each independent_list we can next work on
#turning it into a list
final_list_of_objects = [each_string.replace("\n"," ").split() for each_string in independent_lists]
print final_list_of_objects
However if you do not want them to be as a list object and rather just want the outcome without the newline characters inbetween the list then:
#! /usr/bin/env python2.7
import re
with open("textfilename.txt") as r:
text_data = r.read()
new_txt = ""
for each_char in text_data:
if each_char == "[":
bool_char = True
elif each_char == "]":
bool_char = False
elif each_char == "\n" and bool_char:
each_char = " "
new_txt += each_char
new_txt = re.sub(r"\s+", " ", new_txt) # to remove multiple space lines between numbers
You can do two things here:
1) If the text file was created by writing using numpy's savetxt function, you can simply use numpy.loadtxt function with appropriate delimiter.
2) Read file in a string and use a combination of replace and split functions.
file = open(filename,'r')
dataset = file.read()
dataset = dataset.replace('\n',' ').replace('] ',']\n').split('\n')
dataset = [x.replace('[','').replace(']','').split(' ') for x in dataset]
with open('test.txt') as file:
new_data = (" ".join(line.strip() for line in file).replace('] ',']\n').split('\n')) # ['[1 2 3 4 5 6]', ' [1 2 3 4 5 6 ]']
with open('test.txt','w+') as file:
for data in new_data:
file.write(data+'\n')
line.rstrip() removes just the trailing newline('\n') from the line.
you need to pass all read and stripped lines to ' '.join(), not
each line itself. Strings in python are sequences to, so the string
contained in line is interpreted as separate characters when passed on
it's own to ' '.join().

Find sum of numbers in line

This is what I have to do:
Read content of a text file, where two numbers separated by comma are on each line (like 10, 5\n, 12, 8\n, …)
Make a sum of those two numbers
Write into new text file two original numbers and the result of summation = like 10 + 5 = 15\n, 12 + 8 = 20\n, …
So far, I've got this:
import os
import sys
relative_path = "Homework 2.txt"
if not os.path.exists(relative_path):
print "not found"
sys.exit()
read_file = open(relative_path, "r")
lines = read_file.readlines()
read_file.close()
print lines
path_output = "data_result4.txt"
write_file = open(path_output, "w")
for line in lines:
line_array = line.split()
print line_array
You need to have a good understanding of python to understand this.
First, read the file, and get all of the lines by splitting it with a line feed (\n)
For each expression, calculate the answer and write it. Remember, you need to cast the numbers to integers so that they can be added together.
with open('Original.txt') as f:
lines = f.read().split('\n')
with open('answers.txt', 'w+') as f:
for expression in lines: # expression should be in format '12, 8'
nums = [int(i) for i in expression.split(', ')]
f.write('{} + {} = {}\n'.format(nums[0], nums[1], nums[0] + nums[1]))
# That should write '12 + 8 = 20\n'
Make your last for loop look like this:
for line in lines:
splitline = line.strip().split(",")
summation = sum(map(int, splitline))
write_file.write(" + ".join(splitline) + " = " + str(summation) + "\n")
One beautiful thing about that way is that you can have as many numbers as you want on a line, and it will still display correctly.
Seems like the input File is csv so just use the csv reader module in python.
Input File Homework 2.txt
1, 2
1,3
1,5
10,6
The script
import csv
f = open('Homework 2.txt', 'rb')
reader = csv.reader(f)
result = []
for line in list(reader):
nums = [int(i) for i in line]
result.append(["%(a)s + %(b)s = %(c)s" % {'a' : nums[0], 'b' : nums[1], 'c' : nums[0] + nums[1] }])
f = open('Homework 2 Output.txt', 'wb')
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in result:
writer.writerow(line)
The output file is then Homework 2 Output.txt
1 + 2 = 3
1 + 3 = 4
1 + 5 = 6
10 + 6 = 16

Python use for loop to read specific multiply lines from txt files

I want use python to read specific multiply lines from txt files. For example ,read line 7 to 10, 17 to 20, 27 to 30 etc.
Here is the code I write, but it will only print out the first 3 lines numbers. Why? I am very new to use Python.
with open('OpenDR Data.txt', 'r') as f:
for poseNum in range(0, 4):
Data = f.readlines()[7+10*poseNum:10+10*poseNum]
for line in Data:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
This simplifies the math a little to choose the relevant lines. You don't need to use readlines().
with open('OpenDR Data.txt', 'r') as fp:
for idx, line in enumerate(fp, 1):
if idx % 10 in [7,8,9,0]:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
with open('OpenDR Data.txt') as f:
lines = f.readlines()
for poseNum in range(0, 4):
Data = lines[7+10*poseNum:10+10*poseNum]
You should only call readlines() once, so you should do it outside the loop:
with open('OpenDR Data.txt', 'r') as f:
lines = f.readlines()
for poseNum in range(0, 4):
Data = lines[7+10*poseNum:10+10*poseNum]
for line in Data:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
You can use a combination list slicing and comprehension.
start = 7
end = 10
interval = 10
groups = 3
with open('data.txt') as f:
lines = f.readlines()
mult_lines = [lines[start-1 + interval*i:end + interval*i] for i in range(groups)]
This will return a list of lists containing each group of lines (i.e. 7 thru 10, 17 thru 20).

How to add or replace some string at a particular column position in a text file

How to add or replace some string at a particular column position in a text file:
for example i have one sentence in a particular file example given below:
Roxila almost lost
Roxila almost lost
Roxila almost lost
Roxila almost lost
Roxila almost lost
"enumerate()" gives some thing like this
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
R o x i l a a l m o s t l o s t
now i want to mutate index "6" which is a "space" with "*" at each row. like this:
Roxila*almost lost
how can i do this with python. Please help
You can use slicing to get a new string and fileinput module to update the existing file:
Slicing demo:
>>> s = "Roxila almost lost"
'Roxila almost lost'
>>> s [:6] + '*' + s[7:]
'Roxila*almost lost'
Updating the file:
import fileinput
for line in fileinput.input('foo.txt', inplace=True):
print line[:6] + '*' + line[7:],
If your first string change, which means the length, in that case slicing won't work:
Better to use this way:
>>> s.split(' ')
['Roxila', 'almost', 'lost']
>>> p = s.split(' ')
>>> p[0]+'*'+' '.join(p[1:])
'Roxila*almost lost'
>>>
for line in f:
line = line.rstrip()
newline = line[:6] + '*' + line[7:]
print newline
Another approach, using replace
with open("yourfile.txt", "r") as file:
lines = file.read().split("\n")
newlines = []
for line in lines:
newline = line.replace(" ", "*", 1)
newlines.append(newline)
with open("newfile.txt", "w") as newfile:
newfile.write("\n".join(newlines))

How to parse data in a variable length delimited file?

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.
Sample text file :
# # # #
Techy Inn Val NJ
Found the position of # using this code :
1 f = open('sample.txt', 'r')
2 i = 0
3 positions = []
4 for line in f:
5 if line.find('#') > 0:
6 print line
7 for each in line:
8 i += 1
9 if each == '#':
10 positions.append(i)
1 7 11 15 => Positions
So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:
Here's a way to read fixed width fields using regexp
>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>
Off the top of my head:
f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]
Update with tested, fixed code:
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
posns[-1] = len(line)
fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
print fields
Input file:
# # #
Foo BarBaz
123456789abcd
Debug output:
'# # #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']
Robustification notes:
This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).
Final(?) update: Leapfrooging #gnibbler's suggestion to use slice(): set up the slices once before looping.
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
fields = [line[sl].rstrip() for sl in slices]
print fields
Adapted from John Machin's answer
>>> header = "# # # #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']
You can also write the last line like this
>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]
For the other example you give in the comments, you just need to have the correct header
>>> header = "# # # #"
>>> row = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>
Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.
Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)
sample="""
HOTEL CAT ST DEP ##
Test line Techy Inn Val NJ FT FT
"""
data=sample.splitlines()[1:]
def fields(header,line):
previndex=0
prevchar=''
for index,char in enumerate(header):
if char == '#' or (prevchar != char and prevchar == ' '):
if previndex or header[0] != ' ':
yield line[previndex:index]
previndex=index
prevchar = char
yield line[previndex:]
header,dataline = data
print list(fields(header,dataline))
Output
['Techy Inn ', 'Val ', 'NJ ', 'FT ', 'F', 'T']
One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.
Header from sample line:
' Techy_Inn Val NJ FT ##'
def parse(your_file):
first_line = your_file.next().rstrip()
slices = []
start = None
for e, c in enumerate(first_line):
if c != '#':
continue
if start is None:
start = e
continue
slices.append(slice(start, e))
start = e
if start is not None:
slices.append(slice(start, None))
for line in your_file:
parsed = [line[s] for s in slices]
yield parsed
f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
print [line[i:j].strip() for i, j in pos]
f.close()
How about this?
with open('somefile','r') as source:
line= source.next()
sizes= map( len, line.split("#") )[1:]
positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ]
for line in source:
fields = [ line[start,end] for start,end in positions ]
Is this what you're looking for?

Categories