How to find N lines containing a specific string given an offset reversely? With python on Unix.
Given a file:
a
babc1
c
abc1
abc2
d
e
f
Given the offset: 20 (that's "d"), the string: "abc", N: 2, the output should be:
strings:
# the babc1 will not count since we only need 2
abc1
abc2
offset: (we need to return the offset where the search ends)
10 ((the "a" in "abc1")
The above example is just a demo, the real file is a 33G log, that why I need to take offset as input and output.
I think what the core problem is that: how to reversely read lines from a file with a given offset? The offset is near the tail.
I tried to do it with bash, it was agony. Is there an elegant, efficient way to do it in python2? Besides we will run the script with suitable( an capsule of ansible), so the dependency should be as simple as possible.
You can use the following function:
from file_read_backwards import FileReadBackwards
def search(filename, file_size, offset, substring, n):
off = 0
with FileReadBackwards(filename) as f:
while off < (file_size - offset):
line = f.readline()
off += len(line)
found = 0
for line in f:
off += len(line)
if substring in line:
yield line
found += 1
if found >= n:
yield file_size - off - 1
return
Use it like this:
s = "s.txt"
file_size = 25
offset = 20
string = "abc"
n = 2
*found, offset = search(s, file_size, offset, string, n)
print(found, offset)
Prints:
['abc2', 'abc1'] 10
You can use seek to go to the offset in the file as follows:
def reverse_find(string, offset, count):
with open("FILENAME") as f:
f.seek(offset)
results = []
while offset > 1 and count > 0:
line = ""
char = ""
while char is not "\n":
offset -= 1
f.seek(offset)
char = f.read(1)
line = char + line
if string in line:
results = [line.strip()] + results
count -= 1
return results, offset + 1
print(reverse_find("abc", 20, 2))
This will return:
(['abc1', 'abc2'], 10)
Thanks for rassar. But I find the answer here https://stackoverflow.com/a/23646049/9782619. More efficient than Mackerel's, require less dependencies than rassar's.
Related
After reading a text, I need to add 1 to a sum if I find a ( character, and subtract 1 if I find a ) character in the text. I can't figure out what I'm doing wrong.
This is what I tried at first:
file = open("day12015.txt")
sum = 0
up = "("
for item in file:
if item is up:
sum += 1
else:
sum -= 1
print(sum)
I have this long text like the following example (((())))((((( .... If I find a ), I need to subtract 1, if I find a (, I need to add 1. How can I solve it? I'm always getting 0 as output even if I change my file manually.
your for loop only gets all the string in the file so you have to loop through the string to get your desired output.
Example .txt
(((())))(((((
Full Code
file = open("Data.txt")
sum = 0
up = "("
for string in file:
for item in string:
if item is up:
sum += 1
else:
sum -= 1
print(sum)
Output
5
Hope this helps.Happy Coding :)
So you need to sum +1 for "(" character and -1 for ")".
Do it directly specifying what to occur when you encounter this character. Also you need to read the lines from a file as you're opening it. In your code, you are substracting one for every case that is not "(".
file = open("day12015.txt")
total = 0
for line in file:
for character in line:
if character == "(":
total += 1
elif character == ")":
total -= 1
print(sum)
That's simply a matter of counting each character in the text. The sum is the difference between those counts. Look:
from pathlib import Path
file = Path('day12015.txt')
text = file.read_text()
total = text.count('(') - text.count(')')
For the string you posted, for example, we have this:
>>> p = '(((())))((((('
>>> p.count('(') - p.count(')')
5
>>>
Just for comparison and out of curiosity, I timed the str.count() and a loop approach, 1,000 times, using a string composed of 1,000,000 randoms ( and ). Here is what I found:
import random
from timeit import timeit
random.seed(0)
p = ''.join(random.choice('()') for _ in range(1_000_000))
def f():
return p.count('(') - p.count(')')
def g():
a, b = 0, 0
for c in p:
if c == '(':
a = a + 1
else:
b = b + 1
return a - b
print('f: %5.2f s' % timeit(f, number=1_000))
print('g: %5.2f s' % timeit(g, number=1_000))
f: 8.19 s
g: 49.34 s
It means the loop approach is 6 times slower, even though the str.count() one is iterating over p two times to compute the result.
I want to write all base-26 numbers (with letters of the alphabet as digits) of a certain length into an ASCII-file.
For length = 4 this would look like
aaaa
aaab
aaac
...
zzzx
zzzy
zzzz
I achieved this with the following recursive code:
def fuz(data, ll_str):
ll_str += 1
def for_once(data_once, ll_str_once):
tmp_str = ll_str_once
tmp_str -= 1
new_data = []
for m in data_once:
for i1 in range(97, 123):
new_data.append(m + chr(i1))
if tmp_str != 0:
return for_once(new_data, tmp_str)
else:
return data_once
return for_once(data, ll_str)
if __name__ == '__main__':
ll = 4
test = ['']
file_output = open("out.txt", 'a')
out_data = fuz(test, ll)
for out in out_data:
file_output.write(out + '\n')
file_output.close()
However, for any length > 4, this solution runs out of memory on my machine.
Therefore I look for an alternative without recursion - can anybody give me a hint how to do this?
This loop writes all base-26 numbers of length 4 (with letters as digits) in a file named out.txt.
base and length can be arbitrarily chosen - but prepare to be patient for higher values...
import itertools as it
base = 26
lngth = 4
with open('out.txt', 'w') as f:
for t in it.product(range(97, 97+base), repeat=lngth):
s = ''.join(map(chr, (t)))
f.write(s + chr(13))
At least it doesn't consume too much memory, as requested by the OP.
However, with base 26 a length 5 file had already 70MB and a length 6 file I stopped the writing process at 1.4GB; there Notepad++ was already not able to open it anymore. So everybody can think about the use of this code by himself.
so -----2-----3----5----2----3----- would become -----4-----5----7----4----5-----
if the constant was 2 and etc. for every individual line in the text file.
This would involve splitting recognising numbers in between strings and adding a constant to them e.g ---15--- becomes ---17--- not ---35---.
(basically getting a guitar tab and adding a constant to every fret number)
Thanks. Realised this started out vague and confusing so sorry about that.
lets say the file is:
-2--3--5---7--1/n-6---3--5-1---5
and im adding 2, it should become:
-4--5--7---9--3/n-8---5--7-3---7
Change the filename to something relevant and this code will work. Anything below new_string needs to be change for what you need, eg writing to a file.
def addXToAllNum(int: delta, str: line):
values = [x for x in s.split('-') if x.isdigit()]
values = [str(int(x) + delta) for x in values]
return '--'.join(values)
new_string = '' # change this section to save to new file
for line in open('tabfile.txt', 'r'):
new_string += addXToAllNum(delta, line) + '\n'
## general principle
s = '-4--5--7---9--3 -8---5--7-3---7'
addXToAllNum(2, s) #6--7--9--11--10--7--9--5--9
This takes all numbers and increments by the shift regardless of the type of separating characters.
import re
shift = 2
numStr = "---1----9---15---"
print("Input: " + numStr)
resStr = ""
m = re.search("[0-9]+", numStr)
while (m):
resStr += numStr[:m.start(0)]
resStr += str(int(m.group(0)) + shift)
numStr = numStr[m.end(0):]
m = re.search("[0-9]+", numStr)
resStr += numStr
print("Result:" + resStr)
Hi You Can use that to betwine every line in text file add -
rt = ''
f = open('a.txt','r')
app = f.readlines()
for i in app:
rt+=str(i)+'-'
print " ".join(rt.split())
import re
c = 2 # in this example, the increment constant value is 2
with open ('<your file path here>', 'r+') as file:
new_content = re.sub (r'\d+', lambda m : str (int (m.group (0)) + c), file.read ())
file.seek (0)
file.write (new_content)
I have a file that has sequence on line 2 and variable called tokenizer, which give me an old position value. I am trying to find the new position.. For example tokenizer for this line give me position 12, which is E by counting letters only until 12. So i need to figure out the new position by counting dashes...
---------------LL---NE--HVKTHTEEK---PF-ICTVCR-KS----------
This is what i have so far it still doesn't work.
with open(filename) as f:
countletter = 0
countdash = 0
for line, line2 in itertools.izip_longest(f, f, fillvalue=''):
tokenizer=line.split()[4]
print tokenizer
for i,character in enumerate(line2):
for countletter <= tokenizer:
if character != '-':
countletter += 1
if character == '-':
countdash +=1
my new position should be 32 for this example
First answer, edited by Chad D to make it 1-indexed (but incorrect):
def get_new_index(string, char_index):
chars = 0
for i, char in enumerate(string):
if char != '-':
chars += 1
if char_index == chars:
return i+1
Rewritten version:
import re
def get(st, char_index):
chars = -1
for i, char in enumerate(st):
if char != '-':
chars += 1
if char_index == chars:
return i
def test():
st = '---------------LL---NE--HVKTHTEEK---PF-ICTVCR-KS----------'
initial = re.sub('-', '', st)
for i, char in enumerate(initial):
print i, char, st[get_1_indexed(st, i)]
def get_1_indexed(st, char_index):
return 1 + get(st, char_index - 1)
def test_1_indexed():
st = '---------------LL---NE--HVKTHTEEK---PF-ICTVCR-KS----------'
initial = re.sub('-', '', st)
for i, char in enumerate(initial):
print i+1, char, st[get_1_indexed(st, i + 1) - 1]
my original text looks like this and the position i was interested in was 12 which is 'E'
Actually, it's K, assuming you're using zero indexed strings. Python uses zero indexing so unless you're jumping through hoops to 1-index things (and you're not) it will give you K. If you were running into issues, try addressing this.
Here's some code for you that does what you need it to (albeit with 0-indexing, not 1-indexing). This can be found online here:
def get_new_index(oldindex, str):
newindex = 0
for c in str:
if c != '-':
if oldindex == 0:
return newindex
oldindex -= 1
newindex += 1
return 1 / 0 # throw a shitfit if we don't find the index
This is a silly way to get the second line, it would be clearer to use an islice, or next(f)
for line, line2 in itertools.izip_longest(f, f, fillvalue=''):
Here count_letter seems to be an int while tokenizer is a str. Probably not what you expect.
for countletter <= tokenizer:
It's also a syntax error, so I think this isn't the code you are running
Perhaps you should have
tokenizer = int(line.split()[4])
to make tokenizer into an int
print tokenizer can be misleading because int and str look identical, so you see what you expect to see. Try print repr(tokenizer) instead when you are debugging.
once you make sure tokenizer is an int, you can change this line
for i,character in enumerate(line2[:tokenizer]):
How can you get the nth line of a string in Python 3?
For example
getline("line1\nline2\nline3",3)
Is there any way to do this using stdlib/builtin functions?
I prefer a solution in Python 3, but Python 2 is also fine.
Try the following:
s = "line1\nline2\nline3"
print s.splitlines()[2]
a functional approach
>>> import StringIO
>>> from itertools import islice
>>> s = "line1\nline2\nline3"
>>> gen = StringIO.StringIO(s)
>>> print next(islice(gen, 2, 3))
line3
`my_string.strip().split("\n")[-1]`
Use a string buffer:
import io
def getLine(data, line_no):
buffer = io.StringIO(data)
for i in range(line_no - 1):
try:
next(buffer)
except StopIteration:
return '' #Reached EOF
try:
return next(buffer)
except StopIteration:
return '' #Reached EOF
A more efficient solution than splitting the string would be to iterate over its characters, finding the positions of the Nth and the (N - 1)th occurence of '\n' (taking into account the edge case at the start of the string). The Nth line is the substring between those positions.
Here's a messy piece of code to demonstrate it (line number is 1 indexed):
def getLine(data, line_no):
n = 0
lastPos = -1
for i in range(0, len(data) - 1):
if data[i] == "\n":
n = n + 1
if n == line_no:
return data[lastPos + 1:i]
else:
lastPos = i;
if(n == line_no - 1):
return data[lastPos + 1:]
return "" # end of string
This is also more efficient than the solution which builds up the string one character at a time.
From the comments it seems as if this string is very large.
If there is too much data to comfortably fit into memory one approach is to process the data from the file line-by-line with this:
N = ...
with open('data.txt') as inf:
for count, line in enumerate(inf, 1):
if count == N: #search for the N'th line
print line
Using enumerate() gives you the index and the value of object you are iterating over and you can specify a starting value, so I used 1 (instead of the default value of 0)
The advantage of using with is that it automatically closes the file for you when you are done or if you encounter an exception.
Since you brought up the point of memory efficiency, is this any better:
s = "line1\nline2\nline3"
# number of the line you want
line_number = 2
i = 0
line = ''
for c in s:
if i > line_number:
break
else:
if i == line_number-1 and c != '\n':
line += c
elif c == '\n':
i += 1
Wrote into two functions for readability
string = "foo\nbar\nbaz\nfubar\nsnafu\n"
def iterlines(string):
word = ""
for letter in string:
if letter == '\n':
yield word
word = ""
continue
word += letter
def getline(string, line_number):
for index, word in enumerate(iterlines(string),1):
if index == line_number:
#print(word)
return word
print(getline(string, 4))
My solution (effecient and compact):
def getLine(data, line_no):
index = -1
for _ in range(line_no):index = data.index('\n',index+1)
return data[index+1:data.index('\n',index+1)]