I've created a tab delimited bed file using the following code
def raw_data_file(sample_name, chrom):
data = []
with open (('{}_{}_raw_data2.bed'.format(sample_name, chrom)),'w') as text_file:
for (i, zone) in enumerate(zones):
select = final_data[i]
for x in select:
row = [chrom, int(zone[1][0]), int(zone[1][1]), zone[0], x]
text_file.write("\t".join(map(str, row))+"\n")
I then open it using
with open ('HG00148_1_raw_data2.bed', 'rb') as f:
rawdata = [x.decode('utf-8').split('\t') for x in f.read().splitlines()]
The data show lines with chromosome number, start point, end point, zone name and a list of associated data (read position, p-value, reads)
When trying to the read position of line zero using:
rawdata[0][4][1]
my code returns 7 instead of 755255 (treats each character as a byte). What should I change either in my encoding or decoding of the bed file for the read position to be returned correctly?
Thanks
Related
I would like to insert a string at a specific column of a specific line in a file.
Suppose I have a file file.txt
How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?
I would like to change the last line to say How was the History test? by adding the string History at line 4 column 13.
Currently I read in every line of the file and add the string to the specified position.
with open("file.txt", "r+") as f:
# Read entire file
lines = f.readlines()
# Update line
lino = 4 - 1
colno = 13 -1
lines[lino] = lines[lino][:colno] + "History " + lines[lino][colno:]
# Rewrite file
f.seek(0)
for line in lines:
f.write(line)
f.truncate()
f.close()
But I feel like I should be able to simply add the line to the file without having to read and rewrite the entire file.
This is possibly a duplicate of below SO thread
Fastest Way to Delete a Line from Large File in Python
In above it's a talk about delete, which is just a manipulation, and yours is more of a modification. So the code would get updated like below
def update(filename, lineno, column, text):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno - 1:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to update
line = fro.readline()
chars = line[0: column-1] + text + line[column-1:]
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
if __name__ == "__main__":
update("file.txt", 4, 13, "History ")
In a large file it make sense to not make modification till the lineno where the update needs to happen, Imagine you have file with 10K lines and update needs to happen at 9K, your code will load all 9K lines of data in memory unnecessarily. The code you have would work still but is not the optimal way of doing it
The function readlines() reads the entire file. But it doesn't have to. It actually reads from the current file cursor position to the end, which happens to be 0 right after opening. (To confirm this, try f.tell() right after with statement.) What if we started closer to the end of the file?
The way your code is written implies some prior knowledge of your file contents and layouts. Can you place any constraints on each line? For example, given your sample data, we might say that lines are guaranteed to be 27 bytes or less. Let's round that to 32 for "power of 2-ness" and try seeking backwards from the end of the file.
# note the "rb+"; need to open in binary mode, else seeking is strictly
# a "forward from 0" operation. We need to be able to seek backwards
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
last = f.readlines()[-1].decode()
At which point the code has only read the last 32 bytes of the file.1 readlines() (at the byte level) will look for the line end byte (in Unix, \n or 0x0a or byte value 10), and return the before and after. Spelled out:
>>> last = f.readlines()
>>> print( last )
[b'hemistry test?\n', b'How was the test?']
>>> last = last[-1]
>>> print( last )
b'How was the test?'
Crucially, this works robustly under UTF-8 encoding by exploiting the UTF-8 property that ASCII byte values under 128 do not occur when encoding non-ASCII bytes. In other words, the exact byte \n (or 0x0a) only ever occurs as a newline and never as part of a character. If you are using a non-UTF-8 encoding, you will need to check if the code assumptions still hold.
Another note: 32 bytes is arbitrary given the example data. A more realistic and typical value might be 512, 1024, or 4096. Finally, to put it back to a working example for you:
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
# does *not* read while file, unless file is exactly 32 bytes.
last = f.readlines()[-1]
last_decoded = last.decode()
# Update line
colno = 13 -1
last_decoded = last_decoded[:colno] + "History " + last_decoded[colno:]
last_line_bytes = len( last )
f.seek(-last_line_bytes, 2)
f.write( last_decoded.encode() )
f.truncate()
Note that there is no need for f.close(). The with statement handles that automatically.
1 The pedantic will correctly note that the computer and OS will likely have read at least 512 bytes, if not 4096 bytes, relating to the on-disk or in-memory page size.
You can use this piece of code :
with open("test.txt",'r+') as f:
# Read the file
lines=f.readlines()
# Gets the column
column=int(input("Column:"))-1
# Gets the line
line=int(input("Line:"))-1
# Gets the word
word=input("Word:")
lines[line]=lines[line][0:column]+word+lines[line][column:]
# Delete the file
f.seek(0)
for i in lines:
# Append the lines
f.write(i)
This answer will only loop through the file once and only write everything after the insert. In cases where the insert is at the end there is almost no overhead and where the insert at the beginning it is no worse than a full read and write.
def insert(file, line, column, text):
ln, cn = line - 1, column - 1 # offset from human index to Python index
count = 0 # initial count of characters
with open(file, 'r+') as f: # open file for reading an writing
for idx, line in enumerate(f): # for all line in the file
if idx < ln: # before the given line
count += len(line) # read and count characters
elif idx == ln: # once at the line
f.seek(count + cn) # place cursor at the correct character location
remainder = f.read() # store all character afterwards
f.seek(count + cn) # move cursor back to the correct character location
f.write(text + remainder) # insert text and rewrite the remainder
return # You're finished!
I'm not sure whether you were having problems changing your file to contain the word "History", or whether you wanted to know how to only rewrite certain parts of a file, without having to rewrite the whole thing.
If you were having problems in general, here is some simple code which should work, so long as you know the line within the file that you want to change. Just change the first and last lines of the program to read and write statements accordingly.
fileData="""How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?""" # So that I don't have to create the file, I'm writing the text directly into a variable.
fileData=fileData.split("\n")
fileData[3]=fileData[3][:11]+" History"+fileData[3][11:] # The 3 referes to the line to add "History" to. (The first line is line 0)
storeData=""
for i in fileData:storeData+=i+"\n"
storeData=storeData[:-1]
print(storeData) # You can change this to a write command.
If you wanted to know how to change specific "parts" to a file, without rewriting the whole thing, then (to my knowledge) that is not possible.
Say you had a file which said Ths is a TEST file., and you wanted to correct it to say This is a TEST file.; you would technically be changing 17 characters and adding one on the end. You are changing the "s" to an "i", the first space to an "s", the "i" (from "is") to a space, etc... as you shift the text forward.
A computer can't actually insert bytes between other bytes. It can only move the data, to make room.
I am currently working on a project and I need to test if, on the last row (line) of the input, I have this byte: '\x1a'. If the last row has this marker I want to delete the entire row.
I have this code so far, but i don't know how to make it test for that byte on the last row and delete it.
Thank you!
readFile1 = open("sdn.csv")
lines1 = readFile1.readlines()
readFile1.close()
w1 = open("sdn.csv", 'w')
w1.writelines([item for item in lines1[:-1]])
w1.close()
readFile2 = open("add.csv")
lines2 = readFile2.readlines()
readFile2.close()
w2 = open("add.csv",'w')
w2.writelines([item for item in lines2[:-1]])
w2.close()
readFile3 = open("alt.csv")
lines3 = readFile3.readlines()
readFile3.close()
w = open("alt.csv",'w')
w.writelines([item for item in lines3[:-1]])
w.close()
In any of your code blocks, you have read your file's contents into a variable with a line like:
lines1 = readFile1.readlines()
If you want to see if the \x1a byte exists anywhere in the last line of the text, then you can do this:
if '\x1a' in lines1[-1]:
# whatever you need to do
If you want to find the byte and then actually delete the row from the list altogether:
if '\x1a' in lines1[-1]:
# \x1a byte was found, remove the last item from list
del lines1[-1]
And if I may offer a suggestion, all your code blocks repeat. You could create a function which captures all the functionality and then pass file names to it.
def process_csv(file_name):
# Open the file for both reading and writing
# This will also automatically close the file handle after
# you're done with it
with open(file_name, 'r+') as csv_file:
data = csv_file.readlines()
if '\x1a' in data[-1]:
# erase file and then write data without last row to it
csv_file.seek(0)
csv_file.truncate()
csv_file.writelines(data[:-1])
else:
# Just making this explicit
# Don't do anything to the file if the \x1a byte wasn't found
pass
for f in ('sdn.csv', 'add.csv', 'alt.csv'):
process_csv(f)
I have a number of txt files that represent spatial data in a grid form, essentially arrays of the same dimensions in which each value signifies a trait about the corresponding parcel of land. I have been trying to script a sequence that imports each file, adds "-9999" on the border of the entire grid, and saves out to an otherwise identical txt file.
The first 6 rows of each txt file are header rows, and shouldn't be changed.
My progress is as follows:
for datfile in spatialfiles:
results = []
borderrow = []
with open('{}.txt'.format(datfile)) as inputfile:
#header = inputfile.readlines()
for line in inputfile:
row = ['-9999'] + line.strip().split(' ') + ['-9999']
results.append(row)
for cell in range(len(row)):
borderrow.append('-9999')
results = [borderrow] + results[6:] + [borderrow]
with file("{}-new.txt".format(datfile), 'w') as outputFile:
for row in header[:6]:
outputFile.write(row)
for row in results:
outputFile.write(row)
"header = inputfile.readlines()" has been commented out because it seems to cause a NameError in which "row" is no longer recognized. At the same time, I haven't found another way to retain the 6 header rows for exporting later.
Why does readlines() seem to alter the ability to iterate through the lines of the inputfile when it is only being used to write to a variable? What am I missing? (Any other pointers on my undoubtedly bloated code always welcome!)
readlines() reads the whole file into memory, parses it into a list, and leaves a pointer to the end of the file. When you try to read the same file again, it will attempt to resume reading from the pointer, which is already at the end of the file. Call readlines() once and loop through the list with a counter which changes the loop's behavior after 6 lines.
I'm facing a problem in reading random rows from a large csv file and moving it to another CSV file using 0.18.1 pandas and 2.7.10 Python on Windows.
I want to load only the randomly selected rows into the memory and move them to another CSV. I don't want to load the entire content of first CSV into memory.
This is the code I used:
import random
file_size = 100
f = open("customers.csv",'r')
o = open("train_select.csv", 'w')
for i in range(0, 50):
offset = random.randrange(file_size)
f.seek(offset)
f.readline()
random_line = f.readline()
o.write(random_line)
The current output looks something like this:
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
My problems are 2 fold:
I want to see the header also in the second csv and not just the rows.
A row should be selected by random function only once.
The output should be something like this:
id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
You have do simpler than that:
first, read the customers file fully, title is a special case, keep it out.
shuffle the list of lines (that's what you were looking for)
write back title + shuffled lines
code:
import random
with open("customers.csv",'r') as f:
title = f.readline()
lines = f.readlines()
random.shuffle(lines)
with open("train_select.csv", 'w') as f:
f.write(title)
f.writelines(lines)
EDIT: if you don't want to hold the whole file in memory, here's an alternative. The only drawback is that you have to read the file once (but not store in memory) to compute line offsets:
import random
input_file = "customers.csv"
line_offsets = list()
# just read the title
with open(input_file,'r') as f:
title = f.readline()
# store offset of the first
while True:
# store offset of the next line start
line_offsets.append(f.tell())
line = f.readline()
if line=="":
break
# now shuffle the offsets
random.shuffle(line_offsets)
# and write the output file
with open("train_select.csv", 'w') as fw:
fw.write(title)
for offset in line_offsets:
# seek to a line start
f.seek(offset)
fw.write(f.readline())
At OP request, and since my 2 previous implementations had to read the input file, here's a more complex implementation where the file is not read in advance.
It uses bisect to store the couples of offsets of the lines, and a minimum line len (to be configured) in order to avoid that the random list is too long for nothing.
Basically, the program generates randomly ordered offsets ranging from offset of the second line (title line skipped) to the end of file, by step of minimum_line_len.
For each offset, it checks if line has not already been read (using bisect, which is fast but further testing is complex because of the corner cases).
- if not read, skip back to find previous linefeed (that is reading the file, can't do otherwise) write it in the output file, store the start/end offsets in the couple list
- if already read, skip
the code:
import random,os,bisect
input_file = "csv2.csv"
input_size = os.path.getsize(input_file)
smallest_line_len = 4
line_offsets = []
# just read the title
with open(input_file,'r') as f, open("train_select.csv", 'w') as fw:
# read title and write it back
title = f.readline()
fw.write(title)
# generate offset list, starting from current pos to the end of file
# with a step of min line len to avoid generating too many numbers
# (this can be 1 but that will take a while)
offset_list = list(range(f.tell(),input_size,smallest_line_len))
# shuffle the list at random
random.shuffle(offset_list)
# now loop through the offsets
for offset in offset_list:
# look if the offset is already contained in the list of sorted tuples
insertion_point = bisect.bisect(line_offsets,(offset,0))
if len(line_offsets)>0 and insertion_point == len(line_offsets) and line_offsets[-1][1]>offset:
# bisect tells to insert at the end: check if within last couple boundary: if so, already processed
continue
elif insertion_point < len(line_offsets) and (offset==line_offsets[insertion_point][0] or
(0 < insertion_point and line_offsets[insertion_point-1][0]<=offset<=line_offsets[insertion_point-1][1])):
# offset is already known, line has already been processed: skip
continue
else:
# offset is not known: rewind until we meet an end of line
f.seek(offset)
while True:
c=f.read(1)
if c=="\n":
# we found the line terminator of the previous line: OK
break
offset -= 1
f.seek(offset)
# now store the current position: start of the current line
line_start = offset+1
# now read the line fully
line = f.readline()
# now compute line end (approx..)
line_end = f.tell() - 1
# and insert the "line" in the sorted list
line_offsets.insert(insertion_point,(line_start,line_end))
fw.write(line)
if
So if i use
tail 'path'
to view last few lines of text file I get 9 lines of data in this format:
20-3-2015 16:7:13 6
I use
splitted = file_open(name).rstrip().split(" ");
where the file_open function is
def file_open(name):
f_name = prefix + name;
offs = -10;
with open(f_name, 'r') as f: # Open file to read
while True:
f.seek(offs,2) # Jump to final line and go to point in line to begin
lines = f.readlines();
if len(lines) >= 2:
return lines[-1]
offs *= 2;
This should open file, go to last line return the full last line and then split up the three columns.
Instead the value of splitted is
['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
whereas it should obviously be the final line. I have been using this code perfectly fine but all of a sudden I am getting this issue.
You should not use seek for text files and readlines with binary files. The behavior is not defined. readlines can use internal buffers. Some systems behave awkward, if you seek below 0 or larger than the file.