Iterate File Saving Blocks and Skipping Lines - python

I have data in blocks with non-data lines between the blocks. This code has been working but is not robust. How do I extract blocks and skip non-data blocks without consuming a line in the index test? I'm looking for a straight python solution without loading packages.
I've searched for a relevant example and I'm happy to delete this question if the answer exists.
from __future__ import print_function
BLOCK_DATA_ROWS = 3
SKIP_ROWS = 2
block = 0
with open('array1.dat', 'rb') as f:
for i in range (2):
block += 1
for index, line in enumerate(f):
if index == BLOCK_DATA_ROWS:
break
print(block, 'index', index, 'line', line.rstrip('\r\n'))
for index, line in enumerate(f):
if index == SKIP_ROWS:
break
print(' skip index', index, 'line', line.rstrip('\r\n'))
Input
1
2
3
4
5
6
7
8
9
Output
1 index 0 line 1
1 index 1 line 2
1 index 2 line 3
skip index 0 line 5
skip index 1 line 6
2 index 0 line 8
2 index 1 line 9
Edit
I also want to use a similar iteration approach with an excel sheet:
for row in ws.iter_rows()

In the code posted, the line 4 is read, and the condition index == BLOCK_DATA_ROWS is met, leaving the first loop towards the second one. As f is a generator, when it is called in the second loop, it returns the next element to iterate over, and line 4 has already been returned to loop 1 (it is not printed, but the value is used).
This has to be taken into account in the code. One option is to combine both conditions in the same loop:
from __future__ import print_function
BLOCK_DATA_ROWS = 3
SKIP_ROWS = 2
block = 1
with open('array1.dat', 'r') as f:
index = 0
for line in f:
if index < BLOCK_DATA_ROWS:
print(block, 'index', index, 'line', line.rstrip('\r\n'))
elif index < BLOCK_DATA_ROWS+SKIP_ROWS:
print(' skip index', index, 'line', line.rstrip('\r\n'))
index += 1
if index == BLOCK_DATA_ROWS+SKIP_ROWS: # IF!!, not elif
index = 0
block += 1
The for i in range(2) has also been removed, and now the code will work for any number of blocks, not just 2.
Which returns:
1 index 0 line 1
1 index 1 line 2
1 index 2 line 3
skip index 3 line 4
skip index 4 line 5
2 index 0 line 6
2 index 1 line 7
2 index 2 line 8
skip index 3 line 9
skip index 4 line 10

Related

Replacing values in pandas dataframe using nested loop based on conditions

I want to replace the first 3 values with 1 by a 0 if the current row value df.iloc[i,0] is 0 by iterating through the dataframe df. After replacing the values the dafaframe iteration should skip the new added value and start from the next index-in the following example from index 7.
If the last tow values in the dataframe are 1 this should be replaced as well by 0- Replacing two values is only happened if these values are the last values. In the example this is the case for the values with index 9 and 10.
original DataFrame:
index column 1
0 1
1 1
2 1
3 0
4 1
5 1
6 1
7 1
8 0
9 1
10 1
the new DataFrame what I want to have should look as follows:
index column 1
0 1
1 1
2 1
3 0
4 **0** --> new value
5 **0** --> new value
6 **0** --> new value
7 1
8 0
9 **0** --> new value
10 **0** --> new value
I type that code but it does not work.
for i in range(len(df)):
print(df.iloc[i,0])
if df.iloc[i,0]== 0 :
j= i + 1
while j <= i + 3:
df.iloc[j,1]= 0
j= j+ 1
i = i + 4 #this is used to skip the new values and starting by the next firt index
if (len(df)- i < 2) and (df.iloc[i,0]== 0): #replacing the two last values by 0 if the previous value is 0.
j= i + 1
while j <= len(df)
df.iloc[j,1]= 0
There are many issues you could improve and change in your code.
First it is usually not a good idea to use for i in range(len(df)): loop. It's not Pandas. Pandas has **df.size** (for use instead of len(df). And you loop in Python like:
for i, colmn_value in enumerate(df[colmn_name]):
if you definitely need the index ( in most cases, including this one in your question you don't ) or with
for colmn_value in df[colmn_name]:
I have provided at the bottom your corrected code which now works.
The issues I have fixed to make your code run are explained in the code so check them out. These issues were only usual 'traps' a beginner runs into learning how to code. The main idea was the right one.
You seem to have already programming experience in another programming language like C or C++, but ... don't expect a for i in range(N): Python loop to behave like a C-loop which increases the index value on each iteration, so you could change it in a loop to skip indices. You can't do the same in the Python for loop getting its values from range(), enumerate() or other iterable. If you want to change the index within the loop use the Python 'while' loop.
The code I provide here below for the same task in two versions (a longer one, not Pandas way, and another doing the same Pandas way) is using the 'trick' of counting down the replacements from 3 to 0 if a zero value was detected in the column and replaces the values only if countdown:.
Change VERBOSE to False to switch off printing lines which show how the code works under the hood. And as it is Python, the code explains mostly by itself using in Python available appropriate syntax sounding like speaking about what is to do.
VERBOSE = True
if VERBOSE: new_colmn_value = "**0**"
else: new_colmn_value = 0
new_colmn = []
countdown = 0
for df_colmn_val in df.iloc[:,0]: # i.e. "column 1"
new_colmn.append(new_colmn_value if countdown else df_colmn_val)
if VERBOSE:
print(f'{df_colmn_val=}, {countdown=}, new_colmn={new_colmn_value if countdown else df_colmn_val}')
if df_colmn_val == 0 and not countdown:
countdown = 4
if countdown: countdown -= 1
df.iloc[:,[0]] = new_colmn # same as df['column 1'] = new_colmn
print(df)
gives:
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=0, countdown=0, new_colmn=0
df_colmn_val=1, countdown=3, new_colmn=**0**
df_colmn_val=1, countdown=2, new_colmn=**0**
df_colmn_val=1, countdown=1, new_colmn=**0**
df_colmn_val=1, countdown=0, new_colmn=1
df_colmn_val=0, countdown=0, new_colmn=0
df_colmn_val=1, countdown=3, new_colmn=**0**
df_colmn_val=1, countdown=2, new_colmn=**0**
column 1
index
0 1
1 1
2 1
3 0
4 **0**
5 **0**
6 **0**
7 1
8 0
9 **0**
10 **0**
And now the Pandas way of doing the same:
ct = 0; nv ='*0*'
def ctF(row):
global ct # the countdown counter
r0 = row.iloc[0] # column 0 value in the row of the dataframe
row.iloc[0] = nv if ct else r0 # assign new or old value depending on counter
if ct: ct -= 1 # decrease the counter if not yet zero
else : ct = 3 if not ct and r0==0 else 0 # set counter if there is zero in row
df.apply(ctF, axis=1) # axis=1: work on rows (and not on columns)
print(df)
The code above uses the Pandas .apply() method which passes as argument a row of the DataFrame to the ctF function which then works on the row and assigning new values to its elements if necessary. So the looping over the rows is done outside Python which is usually faster in case of large DataFrames. A global variable in the ctF function makes sure that the next function call knows the countdown value set in previous call. The .apply() returns a column of values ( this feature is not used in code above ) which can be for example added as new column to the DataFrame df providing the results of processing all the rows.
Below your own code which I had fixed so that it runs now and does what it was written for:
for i in range(len(df)):
print(df.iloc[i,0])
if df.iloc[i,0]== 0 :
j= i + 1
while ( j <= i + 3 ) and j < df.size: # handles table end !!!
print(f'{i=} {j=}')
df.iloc[j, 0] = '**0**' # first column has index 0 !!!
j= j+ 1
# i = i + 4 # this is used to skip the new values and starting by the next firt index
# !!!### changing i in the loop will NOT do what you expect it to do !!!
# the next i will be just i+1 getting its value from range() and NOT i+4
this_is_not_necessary_as_it_is_handled_already_above = """
if (len(df)- i < 2) and (df.iloc[i,0]== 0): #replacing the two last values by 0 if the previous value is 0.
j= i + 1
while j <= len(df):
df.iloc[j,1]= 0
"""
printing:
1
1
1
0
i=3 j=4
i=3 j=5
i=3 j=6
**0**
**0**
**0**
1
0
i=8 j=9
i=8 j=10
**0**
**0**
column 1
index
0 1
1 1
2 1
3 0
4 **0**
5 **0**
6 **0**
7 1
8 0
9 **0**
10 **0**

Group by a range of numbers Python

I have a list of numbers in a python data frame and want to group these numbers by a specific range and count. The numbers range from 0 to 20 but lets say there might not be any number 6 in that case I want it to show 0.
dataframe column looks like
|points|
5
1
7
3
2
2
1
18
15
4
5
I want it to look like the below
range | count
1 2
2 2
3 1
4 1
5 2
6 0
7 ...
8
9...
I would iterate through the input lines and fill up a dict with the values.
All you have to do then is count...
import collections
#read your input and store the numbers in a list
lines = []
with open('input.txt') as f:
lines = [int(line.rstrip()) for line in f]
#pre fill the dictionary with 0s from 0 to the highest occurring number in your input.
values = {}
for i in range(max(lines)+1):
values[i] = 0
# increment the occurrence by 1 for any found value
for val in lines:
values[val] += 1
# Order the dict:
values = collections.OrderedDict(sorted(values.items()))
print("range\t|\tcount")
for k in values:
print(str(k) + "\t\t\t" + str(values[k]))
repl: https://repl.it/repls/DesertedDeafeningCgibin
Edit:
a slightly more elegant version using dict comprehension:
# read input as in the first example
values = {i : 0 for i in range(max(lines)+1)}
for val in lines:
values[val] += 1
# order and print as in the first example

Taking the specific column for each line in a txt file python

I have two txt files.
First one is contains a number for each line like this:
22
15
32
53
.
.
and the other file contains 20 continuous numbers for each line like this:
0.1 2.3 4.5 .... 5.4
3.2 77.4 2.1 .... 8.1
....
.
.
According to given number in first txt I want to separate the other files. For example, in first txt for first line I have 22, that means I will take first line with 20 column and second line with two column and other columns of second line I will remove. Then I will look second line of first txt (it is 15), that means I will take 15 column from third line of other file and other columns of third line I will remove and so on. How can I make this?
with open ('numbers.txt', 'r') as f:
with open ('contiuousNumbers.txt', 'r') as f2:
with open ('results.txt', 'w') as fOut:
for line in f:
...
Thanks.
For the number on each line you iterate through the first file, make that number a target total to read, so that you can use a while loop to keep using next on the second file object to read the numbers and decrement the number of numbers from the total until the total reaches 0. Use the lower number of the total and the number of numbers to slice the numbers so that you output just the requested number of numbers:
for line in f:
output = []
total = int(line)
while total > 0:
try:
items = next(f2).split()
output.extend(items[:min(total, len(items))])
total -= len(items)
except StopIteration:
break
fOut.write(' '.join(output) + '\n')
so that given the first file with:
3
6
1
5
and the second file with:
2 5
3 7
2 1
3 6
7 3
2 2
9 1
3 4
8 7
1 2
3 8
the output file will have:
2 5 3
2 1 3 6 7 3
2
9 1 3 4 8

Enumeration function Argument

Python enumeration function enumerate takes one argument start.
What is the use of this argument ?
If i write some code using this argument it shifts only index e.g.
>>a=[2,3,4,5,6,7]
>>for index,value in enumerate(a,start=2):
... print index,value
...
**2 2**
3 3
4 4
5 5
6 6
7 7
8 8
So index is changed to 2 ,Not Value.Value is still started from first element. Why this is so ? In place of this functionality ,It could be better if value is started from that index rather than starting element.
What was the thinking behind the implementation of this ?
enumerate() associates a sequence of integers with an iterable, i.e. it enumerates the items of the sequence. Its argument start is not meant to affect the starting position within the iterable, just the initial value from which to start counting.
One use is to start the enumeration from 1 instead of 0, e.g. if you wanted to number the lines in a file:
with open('file.txt') as f:
for line_number, line in enumerate(f, 1):
print(line_number, line)
This outputs the line numbers starting from 1, which is where most users would expect line numbering to begin.
If you want to skip the first n items in a sequence you can just slice it:
a = [2,3,4,5,6,7]
n = 2
for index, value in enumerate(a[n:]):
print index, value
outputs
0 4
1 5
2 6
3 7
but you might like to start the enumeration from 3 as well:
a = [2,3,4,5,6,7]
n = 2
for index, value in enumerate(a[n:], n+1):
print index, value
which would output
3 4
4 5
5 6
6 7
Some people - most I suppose - aren't used to an enumerated list starting at zero. At the very least, this makes it easy to format the output so the enumeration starts at one, e.g.
a = ['Fred', 'Ted', 'Bob', 'Alice']
for index, value in enumerate(a, start=1):
print index, value
will print out:
1 Fred
2 Ted
3 Bob
4 Alice
Without the start=1 parameter, enumerate(a) would print
0 Fred
1 Ted
2 Bob
3 Alice
https://docs.python.org/2/library/functions.html#enumerate

Read specific number of lines in python

I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
You do not need a loop to accomplish your purpose. Just use the index function on the list to get the index of the two lines and take all the lines between them.
Note that I changed your file.readlines() to strip trailing newlines.
(Using file.read().splitlines() can fail, if read() ends in the middle of a line of data.)
file1 = open("data.txt","r")
file2=open("newdata.txt","w")
lines = [ line.rstrip() for line in file1.readlines() ]
firstIndex = lines.index("#02textline2")
secondIndex = lines.index("#03textline3")
print firstIndex, secondIndex
file2.write("\n".join(lines[firstIndex + 1 : secondIndex]))
file1.close()
file2.close()
There is a line return character at the end of every line, so this:
if line == "#03textline3":
will never be true, as the line is actually "#03textline3\n". Why didn't you use the same syntax as the one you used for "#02textline2" ? It would have worked:
if "#03textline3" in line: # Or ' line == "#03textline3\n" '
break;
Besides, you have to correct your indentation for the always_print = True line.
Here's what I would suggest doing:
firstKey = "#02textline2"
secondKey = "#03textline3"
with open("data.txt","r") as fread:
for line in fread:
if line.rstrip() == firstKey:
break
with open("newdata.txt","w") as fwrite:
for line in fread:
if line.rstrip() == secondKey:
break
else:
fwrite.write(line)
This approach takes advantage of the fact that Python treats files like iterators. The first for loops iterates through the file iterator f until the first key is found. The loop breaks, but the iterator stays as the current position. When it gets picked back up, the second loops starts where the first let off. We then directly write the lines you want to a new file, and discard the rest
Advantages:
This does not load the entire file into memory, only the lines between firstKey and secondKey are stored, and only the lines before secondKey are ever read by the script
No entries are looked over or processed more than once
The context manager with is a safer way to consume files

Categories