Count the number of lines after each pattern - python

I have a file with lines, some lines have a particular pattern. The number of lines after each pattern differs and I want to count the number of lines after each pattern.
<pattern>
line 1
line 2
line 3
<pattern>
line 1
line 2
etc
my code:
for line in fp:
c = 0
if line.startswith("<"):
header = line.split(" ")
else:
c = c+1
The code I have captures the pattern as well as the lines, but I don't know how to stop before the next pattern and start another count after the pattern.

just save the c into an array and set c = 0
a is the array
l is the length of the array
array a;
l = 0;
for line in fp:
c = 0
if line.startswith("<"):
header = line.split(" ")
a[l] = c
c = 0
l = l+1
else:
c = c+1
To read the values you can read the array from 0 to l:
for i in range(0,l):
print "c%d is %d" % (i,a[i])

Related

Python: How to read space delimited data with different length in text file and parse it

I have space delimited data in a text file look like the following:
0 1 2 3
1 2 3
3 4 5 6
1 3 5
1
2 3 5
3 5
each line has different length.
I need to read it starting from line 2 ('1 2 3')
and parse it and get the following information:
Number of unique data = (1,2,3,4,5,6)=6
Count of each data:
count data (1)=3
count data (2)=2
count data (3)=5
count data (4)=1
count data (5)=4
count data (6)=1
Number of lines=6
Sort the data in descending order:
data (3)
data (5)
data (1)
data (2)
data (4)
data (6)
I did this:
file=open('data.txt')
csvreader=csv.reader(file)
header=[]
header=next(csvreader)
print(header)
rows=[]
for row in csvreader:
rows.append(row)
print(rows)
After this step, what should I do to get the expected results?
I would do something like this:
from collections import Counter
with open('data.txt', 'r') as file:
lines = file.readlines()
lines = lines[1:] # skip first line
data = []
for line in lines:
data += line.strip().split(" ")
counter = Counter(data)
print(f'unique data: {list(counter.keys())}')
print(f'count data: {list(sorted(counter.most_common(), key=lambda x: x[0]))}')
print(f'number of lines: {len(lines)}')
print(f'sort data: {[x[0] for x in counter.most_common()]}')
A simple brute force approach:
nums = []
counts = {}
for row in open('data.txt'):
if row[0] == '0':
continue
nums.extend( [int(k) for k in row.rstrip().split()] )
print(nums)
for n in nums:
if n not in counts:
counts[n] = 1
else:
counts[n] += 1
print(counts)
ordering = list(sorted(counts.items(), key=lambda k: -k[1]))
print(ordering)
Here is another approach
def getData(infile):
""" Read file lines and return lines 1 thru end"""
lnes = []
with open(infile, 'r') as data:
lnes = data.readlines()
return lnes[1:]
def parseData(ld):
""" Parse data and print desired results """
unique_symbols = set()
all_symbols = dict()
for l in ld:
symbols = l.strip().split()
for s in symbols:
unique_symbols.add(s)
cnt = all_symbols.pop(s, 0)
cnt += 1
all_symbols[s] = cnt
print(f'Number of Unique Symbols = {len(unique_symbols)}')
print(f'Number of Lines Processed = {len(ld)}')
for symb in unique_symbols:
print(f'Number of {symb} = {all_symbols[symb]}')
print(f"Descending Sort of Symbols = {', '.join(sorted(list(unique_symbols), reverse=True))}")
On executing:
infile = r'spaced_text.txt'
parseData(getData(infile))
Produces:
Number of Unique Symbols = 6
Number of Lines Processed = 6
Number of 2 = 2
Number of 5 = 4
Number of 3 = 5
Number of 1 = 3
Number of 6 = 1
Number of 4 = 1
Descending Sort of Symbols = 6, 5, 4, 3, 2, 1

index() function doesn't work, neither does the longer one to find the number of the line containing the max/min number

My code doesn't do what it's supposed to do - finding max/min and printing which line contains each of those values.
It does find the max/min, but it doesn't print the expected line. Here is my code:
eqlCounter = 0
octals = []
with open("D:\matura\Matura2017\Dane_PR2\liczby.txt", "r") as f:
for x in f:
lines = f.readline()
splited = lines.split()
toInt = int(splited[1], 8) #oct to int(dec)
octals.append(toInt)
if int(splited[0]) == toInt:
eqlCounter += 1
low = min(octals)
maxx = max(octals)
print("same nmbrs: ", eqlCounter) #a
print("min: ", min(octals),"at: ",octals.index(low))
print("max: ", max(octals),"at: ",octals.index(maxx))
Each line contains a decimal number(1st column) and an octal (2nd column). My code finds the smallest and the biggest octal numbers and then it prints them out as a decimals. It works fine until displaying the lines that contain such values.
40829 134773
28592 31652
15105 123071
18227 36440
51074 122407
23893 117256
30785 100453
39396 11072
50492 105177
36134 32555
OUTPUT:
same nmbrs: 0
min: 4666 at: 3
max: 40622 at: 2
The values were found correctly, but not in the 3rd line. 8 is supposed to be the correct output, since it's the line that contains that exact value.
Here is the correct version of your code. The issue is with the way you iterate over lines of the file. Also you need +1 if you want to see row 8 instead of row 7.
eqlCounter = 0
octals = []
with open("D:\liczby.txt", "r") as f:
for line in f.readlines():
splited = line.split()
toInt = int(splited[1], 8) #oct to int(dec)
octals.append(toInt)
if int(splited[0]) == toInt:
eqlCounter += 1
# print(splited[0],splited[1],toInt)
low = min(octals)
maxx = max(octals)
print("same nmbrs: ", eqlCounter) #a
print("min: ", min(octals),"at: ",octals.index(low)+1)
print("max: ", max(octals),"at: ",octals.index(maxx)+1)
result:
same nmbrs: 0
min: 4666 at: 8
max: 47611 at: 1
When executing your code you get a compile error:
Traceback (most recent call last):
File "app.py", line 5, in <module>
lines = f.readline()
ValueError: Mixing iteration and read methods would lose data
This is because you are doing a for loop on the lines of your input file while reading a line and jumping to the next one, that means you are skipping one line in each iteration.
Here is your code fixed:
eqlCounter = 0
octals = []
with open("D:\matura\Matura2017\Dane_PR2\liczby.txt", "r") as f:
lines = f.readlines()
for line in lines:
splited = line.split()
toInt = int(splited[1], 8) #oct to int(dec)
octals.append(toInt)
if int(splited[0]) == toInt:
eqlCounter += 1
low = min(octals)
maxx = max(octals)
print("same nmbrs: ", eqlCounter) #a
print("min: ", min(octals),"at: ",octals.index(low))
print("max: ", max(octals),"at: ",octals.index(maxx))

How to find N lines containing specific string with offset and reversely?

How to find N lines containing a specific string given an offset reversely? With python on Unix.
Given a file:
a
babc1
c
abc1
abc2
d
e
f
Given the offset: 20 (that's "d"), the string: "abc", N: 2, the output should be:
strings:
# the babc1 will not count since we only need 2
abc1
abc2
offset: (we need to return the offset where the search ends)
10 ((the "a" in "abc1")
The above example is just a demo, the real file is a 33G log, that why I need to take offset as input and output.
I think what the core problem is that: how to reversely read lines from a file with a given offset? The offset is near the tail.
I tried to do it with bash, it was agony. Is there an elegant, efficient way to do it in python2? Besides we will run the script with suitable( an capsule of ansible), so the dependency should be as simple as possible.
You can use the following function:
from file_read_backwards import FileReadBackwards
def search(filename, file_size, offset, substring, n):
off = 0
with FileReadBackwards(filename) as f:
while off < (file_size - offset):
line = f.readline()
off += len(line)
found = 0
for line in f:
off += len(line)
if substring in line:
yield line
found += 1
if found >= n:
yield file_size - off - 1
return
Use it like this:
s = "s.txt"
file_size = 25
offset = 20
string = "abc"
n = 2
*found, offset = search(s, file_size, offset, string, n)
print(found, offset)
Prints:
['abc2', 'abc1'] 10
You can use seek to go to the offset in the file as follows:
def reverse_find(string, offset, count):
with open("FILENAME") as f:
f.seek(offset)
results = []
while offset > 1 and count > 0:
line = ""
char = ""
while char is not "\n":
offset -= 1
f.seek(offset)
char = f.read(1)
line = char + line
if string in line:
results = [line.strip()] + results
count -= 1
return results, offset + 1
print(reverse_find("abc", 20, 2))
This will return:
(['abc1', 'abc2'], 10)
Thanks for rassar. But I find the answer here https://stackoverflow.com/a/23646049/9782619. More efficient than Mackerel's, require less dependencies than rassar's.

Python: How to increment the count when a variable repeats

I have a txt file which has following entries:
Rx = 34 // Counter gets incremented = 1, since the Rx was found for the first time
Rx = 2
Rx = 10
Tx = 2
Tx = 1
Rx = 3 // Counter gets incremented = 2, since the Rx was found for the first time after Tx
Rx = 41
Rx = 3
Rx = 19
I want to increment the count only for the 'Rx' that gets repeated for the first time and not for all the Rx in the text file My code is as follows:
import re
f = open("test.txt","r")
count = 0
for lines in f:
m = re.search("Rx = \d{1,2}", lines)
if m:
count +=1
print count
But this is giving me the count of all the Rx's in the txt file. I want the output as 2 and not 7.
Please help me out !
import re
f = open("test.txt","r")
count = 0
for lines in f:
m = re.search("Rx = \d{1,2}", lines)
if m:
count +=1
if count >=2:
break
print(m.group(0))
break the loop since you only needs to find out repeats.
import re
f = open("test.txt","r")
count = 0
for lines in f:
m = re.search("Rx = \d{1,2}", lines)
if m:
count +=1
if count >=2:
break
print count
By saying if m: it's going to continue to increment count as long as m != 0. If you'd like to only get the first 2, you need to introduce some additional logic.
if you want to find the count for the Rxes that are repeated 1x :
import re
rx_count = {}
with open("test.txt","r") as f:
count = 0
for lines in f:
if line.startswith('Rx'): rx_count[lines] = rx_count.get(lines,0)+1
now you have a counter dictionary in rx_count and we filter out all the values greater than 1, then sum those values together , and print out the count
rx_count = {k:v for k,v in rx_count.interitems() if v > 1}
count = sum(rx_count.values())
print count
To do exactly what you want, you're going need to keep track of which strings you've already seen.
You can do this by using a set to keep track of which you have seen until there is a duplicate, and then only counting occurrences of that string.
This example would do that
import re
count = 0
matches = set()
with open("test.txt", "r") as f:
for line in f:
m = re.search(r"Rx = \d{1,2}", line)
if not m:
# Skip the rest if no match
continue
if m.group(0) not in matches:
matches.add(m.group(0))
else:
# First string we saw
first = m.group(0)
count = 2
break
for line in f:
m = re.search(r"Rx = \d{1,2}", line)
## This or whatever check you want to do
if m.group(0) == first:
count += 1
print(count)

Python: Using readine() in "for line in file:" Loop

Lets say I have a text file that looks like:
a
b
start_flag
c
d
e
end_flag
f
g
I wish to iterate over this data line by line, but when I encounter a 'start_flag', I want to iterate until I reach an 'end_flag' and count the number of lines in between:
newline = ''
for line in f:
count = 0
if 'start_flag' in line:
while 'end_flag' not in newline:
count += 1
newline = f.readline()
print(str(count))
What is the expected behavior of this code? Will it iterate like:
a
b
start_flag
c
d
e
end_flag
f
g
Or:
a
b
start_flag
c
d
e
end_flag
c
d
e
end_flag
f
g
There shouldn't be any need to use readline(). Try it like this:
with open(path, 'r') as f:
count = 0
counting = False
for line in f:
if 'start_flag' in line:
counting = True
elif 'end_flag' in line:
counting = False
#do something with your count result
count = 0 #reset it for the next start_flag
if counting is True:
count += 1
This handles it all with the if statements in the correct order, allowing you to just run sequentially through the file in one go. You could obviously add more operations into this, and do things with the results, for example appending them to a list if you expect to run into multiple start and end flags.
Use this:
enter = False
count = 0
for line in f:
if 'start_flag' in line:
enter = True
if 'end_flag' in line:
print count
count = 0
enter = False
if enter is True:
count+=1

Categories