How can I read int from a file? I have a large(512MB) txt file, which contains integer data as:
0 0 0 10 5 0 0 140
0 20 6 0 9 5 0 0
Now if I use c = file.read(1), I get only one character at a time, but I need one integer at a time. Like:
c = 0
c = 10
c = 5
c = 140 and so on...
Any great heart please help. Thanks in advance.
Here's one way:
with open('in.txt', 'r') as f:
for line in f:
for s in line.split(' '):
num = int(s)
print num
By doing for line in f you are reading bit by bit (using neither read() all nor readlines). Important because your file is large.
Then you split each line on spaces, and read each number as you go.
You can do more error checking than that simple example, which will barf if the file contains corrupted data.
As the comments say, this should be enough for you - otherwise if it is possible your file can have extremely long lines you can do something trickier like reading blocks at a time.
512 MB is really not that large. If you're going to create a list of the data anyway, I don't see a problem with doing the reading step in one go:
my_int_list = [int(v) for v in open('myfile.txt').read().split()]
if you can structure your code so you don't need the entire list in memory, it would be better to use a generator:
def my_ints(fname):
for line in open(fname):
for val in line.split():
yield int(val)
and then use it:
for c in my_ints('myfile.txt'):
# do something with c (which is the next int)
I would do it this way:
buffer = file.read(8192)
contents += buffer
split the output string by space
remove last element from the array (might not be full number)
replace contents with last element string
repeat until buffer is None`
Related
I've a file which have integers in first two columns.
File Name : file.txt
col_a,col_b
1001021,1010045
2001021,2010045
3001021,3010045
4001021,4010045 and so on
Now using python, i get a variable var_a = 2002000.
Now how to find the range within which this var_a lies in "file.txt".
Expected Output : 2001021,2010045
I have tried with below,
With open("file.txt","r") as a:
a_line = a.readlines()
for line in a_line:
line_sp = line.split(',')
if var_a < line_sp[0] and var_a > line_sp[1]:
print ('%r, %r', %(line_sp[0], line_sp[1])
Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.
Since the file have more than million of record this make it time
consuming. Is there any better way to do the same without a for loop.
Unfortunately you have to iterate over all records in file and the only way you can archive that is some kind of for loop. So complexity of this task will always be at least O(n).
It is better to read your file linewise (not all into memory) and store its content inside ranges to look them up for multiple numbers. Ranges store quite efficiently and you only have to read in your file once to check more then 1 number.
Since python 3.7 dictionarys are insert ordered, if your file is sorted you will only iterate your dictionary until the first time a number is in the range, for numbers not all all in range you iterate the whole dictionary.
Create file:
fn = "n.txt"
with open(fn, "w") as f:
f.write("""1001021,1010045
2001021,2010045
3001021,3010045
garbage
4001021,4010045""")
Process file:
fn = "n.txt"
# read in
data = {}
with open(fn) as f:
for nr,line in enumerate(f):
line = line.strip()
if line:
try:
start,stop = map(int, line.split(","))
data[nr] = range(start,stop+1)
except ValueError as e:
pass # print(f"Bad data ({e}) in line {nr}")
look_for_nums = [800, 1001021, 3001039, 4010043, 9999999]
for look_for in look_for_nums:
items_checked = 0
for nr,rng in data.items():
items_checked += 1
if look_for in rng:
print(f"Found {look_for} it in line {nr} in range: {rng.start},{rng.stop-1}", end=" ")
break
else:
print(f"{look_for} not found")
print(f"after {items_checked } checks")
Output:
800 not found after 4 checks
Found 1001021 it in line 0 in range: 1001021,1010045 after 1 checks
Found 3001039 it in line 2 in range: 3001021,3010045 after 3 checks
Found 4010043 it in line 5 in range: 4001021,4010045 after 4 checks
9999999 not found after 4 checks
There are better ways to store such a ranges-file, f.e. in a tree like datastructure - research into k-d-trees to get even faster results if you need them. They partition the ranges in a smarter way, so you do not need to use a linear search to find the right bucket.
This answer to Data Structure to store Integer Range , Query the ranges and modify the ranges provides more things to research.
Assuming each line in the file has the correct format, you can do something like following.
var_a = 2002000
with open("file.txt") as file:
for l in file:
a,b = map(int, l.split(',', 1)) # each line must have only two comma separated numbers
if a < var_a < b:
print(l) # use the line as you want
break # if you need only the first occurrence, break the loop now
Note that you'll have to do additional verifications/workarounds if the file format is not guaranteed.
Obviously you have to iterate through all the lines (in the worse case). But we don't load all the lines into memory at once. So as soon as the answer is found, the rest of the file is ignored without reading (assuming you are looking only for the first match).
Windows 10 pro 64bit, python installed 64bit version
The file weighs 1,80 gb
How to fix thiss error, and print all string
def count():
reg = open('link_genrator.txt', 'r')
s = reg.readline().split()
print(s)
reg.read().split('\n') will give a list of all lines.
Why don't you just do s = reg.read(65536).splitlines()? This will give you a hint on the structure of the content and you can then play with the size you read in a chunk.
Once you know a bit more, you can try to loop that line an sum up the number of lines
After looking at the answers and trying to understand what the initial question could be I come to more complete answer than my previous one.
Looking at the question and the code in the sample function I assume now following:
is seems he want to separate the contents of a file into words and print them
from the function name I suppose he would like to count all these words
the whole file is quite big and thus Python stops with a memory error
Handling such large files obviously asks for a different treatment than the usual ones. For example, I do not see any use in printing all the separated words of such a file on the console. Of course it might make sense to count these words or search for patterns in it.
To show as an example how one might treat such big files I wrote following example. It is meant as a starting point for further refinements and changes according your own requirements.
MAXSTR = 65536
MAXLP = 999999999
WORDSEP = ';'
lineCnt = 0
wordCnt = 0
lpCnt = 0
fn = 'link_genrator.txt'
fin = open(fn, 'r')
try:
while lpCnt < MAXLP:
pos = fin.tell()
s = fin.read(MAXSTR)
lines = s.splitlines(True)
if len(lines) == 0:
break
# count words of line
k= 0
for l in lines:
lineWords = l.split(WORDSEP)# semi-colon separates each word
k += len(lineWords) # sum up words of each line
wordCnt += k - 1 # last word most probably not complete: subtract one
# count lines
lineCnt += len(lines)-1
# correction when line ends with \n
if lines[len(lines)-1][-1] == '\n':
lineCnt += 1
wordCnt += 1
lpCnt += 1
print('{0} {4} - {5} act Pos: {1}, act lines: {2}, act words: {3}'.format(lpCnt, pos, lineCnt, wordCnt, lines[0][0:10], lines[len(lines)-1][-10:]))
finally:
fin.close()
lineCnt += 1
print('Total line count: {}'.format(lineCnt))
That code works for files up to 2GB (tested with 2.1GB). The two constants at the beginning let you play with the size of the read in chunks and limit the amount of text processed. During testing you can then just process a subset of the whole data which goes much faster.
I've only programmed in Python for the past 8 months, so please excuse my probably noob approach to python.
My problem is the following, which i hope someone could help me solve.
I have lots of data in a file, for instance something like this (just a snip):
SWITCH MGMT IP;SWITCH HOSTNAME;SWITCH MODEL;SWITCH SERIAL;SWITCH UPTIME;PORTS NOT IN USE
10.255.240.1;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 33 minutes;1
10.255.240.7;641_HX_LEFT_2960x;WS-C2960X-24PS-L;FOC1750S2E5;12 weeks, 4 days, 7 minutes;21
10.255.240.8;641_UX_BASEMENT_2960x;WS-C2960X-24PS-L;FOC1750S2AG;12 weeks, 4 days, 7 minutes;12
10.255.240.9;641_UX_SPECIAL_2960x;WS-C2960X-24PS-L;FOC1750S27M;12 weeks, 4 days, 8 minutes;25
10.255.240.2;641_UX_OFFICE_3560;WS-C3560-8PC-S;FOC1202U24E;2 years, 30 weeks, 3 days, 16 hours, 43 minutes;2
10.255.240.3;641_UX_SFO_2960x;WS-C2960X-24PS-L;FOC1750S2BR;12 weeks, 4 days, 7 minutes;14
10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1
10.255.240.5;641_HX_RIGHT_2960s;WS-C2960S-24PS-L;FOC1627X1BF;12 weeks, 4 days, 12 minutes;16
10.255.240.6;641_HX_LEFT_2960x-02;WS-C2960X-24PS-L;FOC1750S2C4;12 weeks, 4 days, 7 minutes;15
10.255.240.4;641_UX_BASEMENT_2960s;WS-C2960S-24PS-L;FOC1607Z27T;12 weeks, 4 days, 8 minutes;3
10.255.240.62;641_UX_OFFICE_3560CG;WS-C3560CG-8PC-S;FOC1646Y0U2;15 weeks, 5 days, 12 hours, 15 minutes;6
I want to run through all the data in the file and check if a serial number occurs more than once. If it does i want to remove the duplicate found. The reason why the result might contain the same switch or router multiple times is that it might have several layer 3 interfaces, where it can be managed.
So in the above example. After i've run through the data it should remove the line:
10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1
Since the second line in the file already contains the same switch and serial number.
I've spend several days trying to figure out, how to achieve this and it is starting to give me a headache.
My base code looks like this:
if os.stat("output.txt").st_size != 0:
with open('output.txt','r') as file:
header_line = next(file) # Start from line 2 in the file.
data = [] # Contains the data from the file.
sn = [] # Contains the serial numbers to check up against.
ok = [] # Will contain the clean data with no duplicates.
data.append(header_line.split(";")) # Write the head to data.
for line in file: # Run through the file data line for line.
serialchk = line.split(";") # Split the data into a list
data.append(serialchk) # Write the data to data list.
sn.append(serialchk[3]) # Write the serial number to sn list.
end = len(data) # Save the length of the data list, so i can run through the data
i = 0 # For my while loop, so i know when to stop.'
while i != end: # from here on out i am pretty lost on how to achieve my goal.
found = 0
for x in range(len(data)):
if sn[i] == data[x][3]:
found += 1
print data[x]
ok.append(data[x])
elif found > 1:
print "Removing:\r\n"
print data[x-1]
del ok[-1]
found = 0
i += 1
Is there a more pythonic way to do this? I am pretty sure with all the talented people here, that someone can give me clues on how to make this happen.
Thank you very much in advance.
You're making it way more complicated than it has to be, and not memory-friendly (you dont have to load the whole file in memory to filter duplicates).
The simple way is to read your file line by line, and for each line check if the serial number has already been seen. If yes, skip the line, else store the serial number and write the line to your output file:
seen = set()
with open('output.txt','r') as source, open("cleaned.txt", "w") as dest:
dest.write(next(source)) # Start from line 2 in the file.
for line in src:
sn = line.split(";")[3]
if sn not in seen:
seen.add(sn)
dest.write(line)
# else, well we just ignore the line ;)
NB : I assume you want to write back the deduplicated lines to a file. If you want to keep them in memory the algorithm is mostly the same, just append your deduplicated lines to a list instead - but beware of memory usage if you have huge files.
My suggestion:
if os.stat("output.txt").st_size != 0:
with open('output.txt','r') as file:
header_line = next(file) # Start from line 2 in the file.
srn = set() # create a set where the seen srn will be stored
ok = [] # Will contain the clean data with no duplicates.
ok.append(header_line.split(";")) # Write the head to ok.
for line in file: # Run through the file data line for line.
serialchk = line.split(";") # Split the data into a list
if serialchk[3] not in srn: # if the srn hasn't be seen
ok.append(serialchk) # add the row to ok
srn.add(serialchk[3]) # add the srn to seen set
else: # if the srn has already be seen
print "Removing: "+";".join(serialchk) # notify the user it has been skipped
You'll end up with ok containing only rows with uniq srn, and print the removed rows
Hopefully it might help
I will walk you through the changes I would make.
The first thing I would do would be to use the csv module to parse the input. Since you can iterate over the DictReader, I also opt for that for brevity. The list data will contain the final (cleaned) results.
from csv import DictReader
import os
if os.stat("output.txt").st_size != 0:
with open('output.txt', 'r') as f:
reader = DictReader(f, delimiter=';') # create the reader instance
serial_numbers = set()
data = []
for row in reader:
if row["SWITCH HOSTNAME"] in serial_numbers:
pass
else:
data.append(row)
serial_numbers.add(row["SWITCH HOSTNAME"])
The format of the data will have changed by my approach, from a list of lists to a list of dicts, but if you want to save the cleaned data into a new csv file, the DictWriter class should be an easy way to do that.
I'm trying to loop over every 2 character in a file, do some tasks on them and write the result characters into another file.
So I tried to open the file and read the first two characters.Then I set the pointer on the 3rd character in the file but it gives me the following error:
'bytes' object has no attribute 'seek'
This is my code:
the_file = open('E:\\test.txt',"rb").read()
result = open('E:\\result.txt',"w+")
n = 0
s = 2
m = len(the_file)
while n < m :
chars = the_file.seek(n)
chars.read(s)
#do something with chars
result.write(chars)
n =+ 1
m =+ 2
I have to mention that inside test.txt is only integers (numbers).
The content of test.txt is a series of binary data (0's and 1's) like this:
01001010101000001000100010001100010110100110001001011100011010000001010001001
Although it's not the point here, but just want to replace every 2 character with something else and write it into result.txt .
Use the file with the seek and not its contents
Use an if statement to break out of the loop as you do not have the length
use n+= not n=+
finally we seek +2 and read 2
Hopefully this will get you close to what you want.
Note: I changed the file names for the example
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
n = 0
s = 2
while True:
the_file.seek(n)
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
n +=2
the_file.close()
Note that because, in this case, you are reading the file sequentially, in chunks i.e. read(2) rather than read() the seek is superfluous.
The seek() would only be required if you wished to alter the position pointer within the file, say for example you wanted to start reading at the 100th byte (seek(99))
The above could be written as:
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
while True:
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
the_file.close()
You were trying to use .seek() method on a string, because you thought it was a File object, but the .read() method of files turns it into a string.
Here's a general approach I might take to what you were going for:
# open the file and load its contents as a string file_contents
with open('E:\\test.txt', "r") as f:
file_contents = f.read()
# do the stuff you were doing
n = 0
s = 2
m = len(file_contents)
# initialize a result string
result = ""
# iterate over the file_contents, incrementing by 2, adding to results
for i in xrange(0, m, 2):
result += file_contents[i]
# write to results.txt
with open ('E:\\result.txt', 'wb') as f:
f.write(result)
Edit: It seems like there was a change to the question. If you want to change every second character, you'll need to make some adjustments.
The text file is like
101 # an integer
abcd # a string
2 # a number that indicates how many 3-line structures will there be below
1.4 # some float number
2 # a number indicating how many numbers will there be in the next line
1 5 # 2 numbers
2.7 # another float number
3 # another number
4 2 7 # three numbers
and the output should be like
[101,'abcd',[1.4,[1,5]],[2.7,[4,2,7]]]
I can do it line by line, with readlines(), strip(), int(), and for loop, but I'm not sure how to do it like a pro.
P.S. there can be spaces and tabs and maybe empty lines randomly inserted in the text file. The input was originally intended for C program where it doesn't matter :(
My code:
with open('data','r') as f:
lines = [line.strip('\n') for line in f.readlines()]
i=0
while(i<len(lines)):
course_id = int(lines[i])
i+=1
course_name = lines[i]
i+=1
class_no = int(lines[i])
i+=1
for j in range(class_no):
fav = float(lines[i])
i+=2
class_sched = lines[i].split(" ")
the variables read from the file will be handled afterwards
All those i+='s look absolutely hideous! And it seems to be a long Python program for this sort of task