Python: Reading lines from .txt file and calculating with them - python

I hope you are having pleasant holidays so far!
I am trying to read a .txt file in which values are stored and separated from each other by a line skip and then calculate with the values.
I am trying to figure out how to do this using a Python script.
Let's say this is the content of my text file:
0.1 #line(0)
1.0
2.0
0.2 #line(3)
1.1
2.1
0.3 #line(6)
1.2
2.2
...
Basically I would to implement an operation that calculates:
line(0)*line(1)*line(2) in the first step, writes it into another .txt file and then continues with line(3)*line(4)*line(5) and so on:
with open('/filename.txt') as file_:
for line in file_:
for i in range(0,999,1):
file = open('/anotherfile.txt')
file.write(str(line(i)*line(i+1)*line(i+2) + '\n')
i += 3
Does anyone have an idea how to get this working?
Any tips would be appreciated!
Thanks,
Steve

This would multiply three numbers at a time and write the product of the three into another file:
with open('numbers_in.txt') as fobj_in, open('numbers_out.txt', 'w') as fobj_out:
while True:
try:
numbers = [float(next(fobj_in)) for _ in range(3)]
product = numbers[0] * numbers[1] * numbers[2]
fobj_out.write('{}\n'.format(product))
except StopIteration:
break
Here next(fobj_in) always tries to read the next line.
If there is no more line a StopIteration exception is raised.
The except StopIteration: catches this exception and terminates the loop.
The list comprehension [float(next(fobj_in)) for _ in range(3)]
converts three numbers read from three lines into floating point numbers.
Now, multiplying the thee numbers is matter of indexing into the list numbers.

You can do this:
file = open('/anotherfile.txt','w')
i=0
temp=1
with open('/filename.txt') as file_:
for line in file_:
temp = temp*int(line)
if(i>1 && i%3==0):
file.write(str(temp)+'\n')
temp=1
i += 1

Related

S3 get_object..iter_lines() skipping lines with islice/zip

response = s3.get_object(Bucket=bucket, Key=file )
def generate_files(resp, N):
while True:
line = list(islice(resp["Body"].iter_lines(), 0, 10))
if not line:
break
yield line
return
However, when I call my generator, the result is not as expected.
for line in generate_files(response, 10):
print('--')
print([l.decode('utf-8') for l in line])
Instead of going from line 0 to 9, then 10 to 19, then 20 to 29 etc.. It skips ahead an arbitrary number of lines between generator calls. So it is returning lines 0 to 9. Then lines 17 to 26. Then lines 33 to 40 etc.. It's also moving forward. So it seems to be reading the stream even after islice's line call. I've also tried with zip and get the same result.
What am I missing here?
I think that the problem is in iter_lines, because iter_lines is not reentrant safe. Calling this method multiple times causes some of the received data being lost. In case you need to call it from multiple places, you should in my opinion use the resulting iterator object instead, e.g.:
r = requests.get('https://httpbin.org/stream/20', stream=True)
lines = r.iter_lines()
for line in lines:
print(line)
Sorry, but there is poorly off for the code, so I can't to check this.
More informations is in the url site: https://2.python-requests.org/projects/3/user/advanced/
Please, if you check it, give a feedback.

How to get approximate line number of large files

I have CSV files with up to 10M+ rows. I am attempting to get the total line numbers of a file so I can split the processing of each file into a multiprocessing approach. To do this, I will set a start and end line for each sub-process to handle. This cuts down my processing time from 180s to 110s for a file size of 2GB. However, in order to do this, It requires to know the line number count. If I attempt to get the exact line number count, it will take ~30seconds. I feel like this time is wasted as an approximate with the final thread possibly having to read an extra hundred thousand lines or so, would only add a couple seconds as apposed to the 30 seconds it takes to get the exact line count.
How would I go about getting an approximate line count for files? I would like this estimate to be within 1 million lines (Preferably within a couple hundred thousand lines). Would something like this be possible?
This will be horribly inaccurate but it will get the size of a row and divide it against the size of the file.
import sys
import csv
import os
with open("example.csv", newline="") as f:
reader = csv.reader(f)
row1 = next(reader)
_Size = sys.getsizeof(len("".join(row1)))
print("Size of Line 1 > ",_Size)
print("Size of File >",str(os.path.getsize("example.csv")))
print("Approx Lines >",(os.path.getsize("example.csv") / _Size))
(Edit) If you change the last line to
math.floor(os.path.getsize("example.csv") / _Size) It's actually
quite accurate
I'd suggest you split the file into chunks of similar size, before even parsing.
The example code below will split data.csv into 4 chunks of approximately equal size, by seeking and searching for the next line break. It'll then call launch_worker() for each chunk, indicating the start offset and length of the data that worker should handle.
Ideally you'd use a subprocess for each worker.
import os
n_workers = 4
# open the log file, and find out how long it is
f = open('data.csv', 'rb')
length_total = f.seek(0, os.SEEK_END)
# split the file evenly among n workers
length_worker = int(length_total / n_workers)
prev_worker_end = 0
for i in range(n_workers):
# seek to the next worker's approximate start
file_pos = f.seek(prev_worker_end + length_worker, os.SEEK_SET)
# see if we tried to seek past the end of the file... the last worker probably will
if file_pos >= length_total: # <-- (3)
# ... if so, this worker's chunk extends to the end of the file
this_worker_end = length_total
else:
# ... otherwise, look for the next line break
buf = f.read(256) # <-- (1)
next_line_end = buf.index(b'\n') # <-- (2)
this_worker_end = file_pos + next_line_end
# calculate how long this worker's chunk is
this_worker_length = this_worker_end - prev_worker_end
if this_worker_length > 0:
# if there is any data in the chunk, then try to launch a worker
launch_worker(prev_worker_end, this_worker_length)
# remember where the last worker got to in the file
prev_worker_end = this_worker_end + 1
Some expansion on markers in the code:
You'll need to make sure that the read() will consume at least an entire line. Alternatively you could loop to perform multiple read()s if you don't know how long a line could be upfront.
This assumes \n line endings... you may need to modify for your data.
The last worker will get slightly less data to handle that the others... this is because we always search-forwards for the next line break. The more workers you have, the less data the final worker gets. It's not very significant (~200-500 bytes in my testing).
Make sure you always use binary-mode, as text-mode can give you wonky seek()s / read()s.
An example launch_worker() would look like this:
def launch_worker(offset, length):
print('Starting a worker... using chunk %d - %d (%d bytes)...'
% ( offset, offset + length, length ))
with open('log.txt', 'rb') as f:
f.seek(offset, os.SEEK_SET)
worker_buf = f.read(length)
lines = worker_buf.split(b'\n')
print('First Line:')
print('\t' + str(lines[0]))
print('Last Line:')
print('\t' + str(lines[-1]))

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

converting a file of numbers in to a list

to put it simply i am trying to read a file that will eventually have nothing but numbers in them either seperated by spaces, commas, or new lines. I have read through alot of these post and fixed somethings. I learned they are imported as strings first. however i am running into an issue where its importing the numbers as list. so now i have a list of list. this would be fine except i cant have it checked by ints or have numbers added to it. the idea is to have each user asigned a number and then saved. im not worrying about saving right now im just worried about importing the numbers and being able to use them as individual numbers.
my code thus far:
fo1 = open('mach_uID_3.txt', 'a+')
t1 = fo1.read()
t2 = []
print t1
for x in t1.split():
print x
z = [int(n) for n in x.split()]
t2.append(z)
print t2
print t2[3]
fo1.close()
and the file its reading is.
0 1 2 25
34
23
my results are pretty ugly but here you go.
0 1 2 25
34
23
0
1
2
25
34
23
[[0], [1], [2], [25], [34], [23]]
[25]
Process finished with exit code 0
Use extend instead of append:
t2.extend(int(n) for n in x.split())
To have all the numbers in a single, flattened list, do this:
fo1 = open('mach_uID_3.txt', 'a+')
number_list = list(map(int, fo1.read().split())
fo1.close()
But it's better to open the file like this:
with open('mach_uID_3.txt', 'a+') as fo1:
number_list = list(map(int, fo1.read().split())
so you don't have to explicitly close it.

How to calculate the average of several .dat files using python?

So I have 50 - 60 .dat files all containing m rows and n columns of numbers. I need to take the average of all of the files and create a new file in the same format. I have to do this in python. Can anyone help me with this?
I've written some code..
I realize I have some incompatible types here, but I can't think of an alternative so I haven't change anything yet.
#! /usr/bin/python
import os
CC = 1.96
average = []
total = []
count = 0
os.chdir("./")
for files in os.listdir("."):
if files.endswith(".dat"):
infile = open(files)
cur = []
cur = infile.readlines()
for i in xrange(0, len(cur)):
cur[i] = cur[i].split()
total += cur
count += 1
average = [x/count for x in total]
#calculate uncertainty
uncert = []
for files in os.listdir("."):
if files.endswith(".dat"):
infile = open(files)
cur = []
cur = infile.readlines
for i in xrange(0, len(cur)):
cur[i] = cur[i].split()
uncert += (cur - average)**2
uncert = uncert**.5
uncert = uncert*CC
Here's a fairly time- and resource-efficient approach which reads in the values and calculates their averages for all the files in parallel, yet only reads in one line per file at a time -- however it does temporarily read the entire first .dat file into memory in order to determine how many rows and columns of numbers are going to be in each file.
You didn't say if your "numbers" were integer or float or what, so this reads them in as floating point (which will work even if they're not). Regardless, the averages are calculated and output as floating point numbers.
Update
I've modified my original answer to also calculate a population standard deviation (sigma) of the values in each row and column, as per your comment. It does this right after it computes their mean value so a second pass to re-read the all the data isn't necessary. In addition, in response to a suggestion made in the comments, a context manager has been added to ensure that all the input files are get closed.
Note that the standard deviations are only printed and are not written to the output file, but doing that to the same or separate file should to be easy enough to add.
from contextlib import contextmanager
from itertools import izip
from glob import iglob
from math import sqrt
from sys import exit
#contextmanager
def multi_file_manager(files, mode='rt'):
files = [open(file, mode) for file in files]
yield files
for file in files:
file.close()
# generator function to read, convert, and yield each value from a text file
def read_values(file, datatype=float):
for line in file:
for value in (datatype(word) for word in line.split()):
yield value
# enumerate multiple egual length iterables simultaneously as (i, n0, n1, ...)
def multi_enumerate(*iterables, **kwds):
start = kwds.get('start', 0)
return ((n,)+t for n, t in enumerate(izip(*iterables), start))
DATA_FILE_PATTERN = 'data*.dat'
MIN_DATA_FILES = 2
with multi_file_manager(iglob(DATA_FILE_PATTERN)) as datfiles:
num_files = len(datfiles)
if num_files < MIN_DATA_FILES:
print('Less than {} .dat files were found to process, '
'terminating.'.format(MIN_DATA_FILES))
exit(1)
# determine number of rows and cols from first file
temp = [line.split() for line in datfiles[0]]
num_rows = len(temp)
num_cols = len(temp[0])
datfiles[0].seek(0) # rewind first file
del temp # no longer needed
print '{} .dat files found, each must have {} rows x {} cols\n'.format(
num_files, num_rows, num_cols)
means = []
std_devs = []
divisor = float(num_files-1) # Bessel's correction for sample standard dev
generators = [read_values(file) for file in datfiles]
for _ in xrange(num_rows): # main processing loop
for _ in xrange(num_cols):
# create a sequence of next cell values from each file
values = tuple(next(g) for g in generators)
mean = float(sum(values)) / num_files
means.append(mean)
means_diff_sq = ((value-mean)**2 for value in values)
std_dev = sqrt(sum(means_diff_sq) / divisor)
std_devs.append(std_dev)
print 'Average and (standard deviation) of values:'
with open('means.txt', 'wt') as averages:
for i, mean, std_dev in multi_enumerate(means, std_devs):
print '{:.2f} ({:.2f})'.format(mean, std_dev),
averages.write('{:.2f}'.format(mean)) # note std dev not written
if i % num_cols != num_cols-1: # not last column?
averages.write(' ') # delimiter between values on line
else:
print # newline
averages.write('\n')
I am not sure which aspect of the process is giving you the problem, but I will just answer specifically about getting the averages of all the dat files.
Assuming a data structure like this:
72 12 94 79 76 5 30 98 97 48
79 95 63 74 70 18 92 20 32 50
77 88 60 98 19 17 14 66 80 24
...
Getting averages of the files:
import glob
import itertools
avgs = []
for datpath in glob.iglob("*.dat"):
with open(datpath, 'r') as f:
str_nums = itertools.chain.from_iterable(i.strip().split() for i in f)
nums = map(int, str_nums)
avg = sum(nums) / len(nums)
avgs.append(avg)
print avgs
It loops over each .dat file, reads and joins the lines. Converts them to int (could be float if you want) and appends the avg.
If these files are enormous and you are concerned about the amount of memory when reading them in, you could more explicitly loop over each line and only keep counter, the way your original example was doing:
for datpath in glob.iglob("*.dat"):
with open(datpath, 'r') as f:
count = 0
total = 0
for line in f:
nums = [int(i) for i in line.strip().split()]
count += len(nums)
total += sum(nums)
avgs.append(total / count)
Note: I am not handling exceptional cases, such as the file being empty and producing a Divide By Zero situation.

Categories