split the data into separate file after encountering a column name

split the data into separate file after encountering a column name - python

eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',
I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?
I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.

Challenging! Here's my solution. There is likely a more straightforward way to do this though.
with open("./file.csv", "r") as readfile:
file_number = 0
current_line_no = 0
tmpline = None
for line in readfile:
# count which file you're on. Also use write mode "W" if the first line. Else append.
with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:
# check if the "headers" are appearing and if the current file has more than 1 line.
# Not sure if the header check is the best for your use case. Maybe regex is best here.
if current_line_no != 0 and ("eno" in line and "ename" in line):
file_number += 1 # increment to next file
current_line_no = 0 # reset file number
tmpline = line # remember the "current line". This needs to be added to next file.
continue # continue to next line in readfile
# if there is a templine from previous, add it to this as header.
if tmpline is not None:
writefile.write(tmpline)
tmpline = None
# write the line and increment to new line
writefile.write(line)
current_line_no += 1
I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!

I know you asked for Python, but there are some questions that just cry out for the power of AWK :)
awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv

One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.
import csv
HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
reader = csv.reader(f)
file_name_counter = 0
for row in reader:
if row == HEADERS:
file_name_counter += 1
with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
NOTE: I believe the newline="" argument is necessary on Windows, as otherwise csv.writer() will add an extra new line between each entry.

Related

I want to replace a certain column of a file with a list - Python

I have a stock file which looks like this:
12334232:seat belt:2.30:12:10:30
14312332:toy card:3.40:52:10:30
12512312:xbox one:5.30:23:10:30
12543243:laptop:1.34:14:10:30
65478263:banana:1.23:23:10:30
27364729:apple:4.23:42:10:30
28912382:orange:1.12:16:10:30
12892829:elephant:6.45:14:10:30
I want to replace the items in the fourth column if they are below the numbers in the fifth column after a certain transaction to the numbers in the sxith column. How would I replace the items in the fourth column?
Everytime I use the following lines of code below, it overwrites the whole file with nothing (deletes everything)
for line in stockfile:
c=line.split(":")
print("pass")
if stock_order[i] == User_list[i][0]:
stockfile.write(line.replace(current_stocklevel_list[i], reorder_order[i] ) )
else:
i = i + 1
I want the stockfile to look like this after it has replaced the necessary items in the column:
12334232:seat belt:2.30:30:10:30
14312332:toy card:3.40:30:10:30
12512312:xbox one:5.30:30:10:30
12543243:laptop:1.34:30::10:30
65478263:banana:1.23:30:10:30
27364729:apple:4.23:30:10:30
28912382:orange:1.12:30:10:30
12892829:elephant:6.45:30:10:30

If you are opening file after some time, you should use "a" (append) as a mode so that file doesn't get truncated.
Write pointer will automatically be on the end of file.
So:
f = open("filename", "a")
f.seek(0) # To start from beginning
But if you want to read and write, then add "+" to the mode and file wouldn't be truncated as well.
f = open("filename", "r+")
Both read and write pointers will be at the beginning of file, you'll need to seek only onto position where you wish to start writing/reading.
But you are doing it wrong.
See, file's content will be overwritten, not inserted automatically.
If you are in writable mode and at the end of file content will be added.
So, you either need to load whole file, make changes you need and write everything back.
Or, you have to write changes at some point and shift remaining content truncating the file if the content is shorter than before.
mmap module can help you to treat file as a string,. You will be able to efficiently shift data and to resize the file.
But, if you really want to change file in place, you should have the file with fixed length of columns. So, when you want to change a value, you do not need to shift anything back and forth. Just find the right row and col, seek there, write new value over the old one (making sure to delete all of the old) and that is just that.

You should try to read in the data first:
with open('inputfile', 'r') as infile:
data = infile.readlines()
Then you can loop over the data and edit as needed and write it out:
with open('outputfile', 'w') as outfile:
for line in data:
c = line.split(":")
if random.randint(1,3) == 1:
# update fourth column based on some good reason
c[3] += 2
outfile.write(':'.join(c) + '\n')
Or you could do it on go with something like:
with open('inputfile', 'r') as infile, open('outputfile', 'w') as outfile:
line = infile.readline()
c = line.split(":")
if random.randint(1,3) == 1:
# update fourth column based on some good reason
c[3] += 2
outfile.write(':'.join(c) + '\n')
os.rename('outputfile', 'inputfile')

Python3 - dumping a JSON data into penultimate line of a file [duplicate]

Is there a way to do this? Say I have a file that's a list of names that goes like this:
Alfred
Bill
Donald
How could I insert the third name, "Charlie", at line x (in this case 3), and automatically send all others down one line? I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?

This is a way of doing the trick.
with open("path_to_file", "r") as f:
contents = f.readlines()
contents.insert(index, value)
with open("path_to_file", "w") as f:
contents = "".join(contents)
f.write(contents)
index and value are the line and value of your choice, lines starting from 0.

If you want to search a file for a substring and add a new text to the next line, one of the elegant ways to do it is the following:
import os, fileinput
old = "A"
new = "B"
for line in fileinput.FileInput(file_path, inplace=True):
if old in line :
line += new + os.linesep
print(line, end="")

There is a combination of techniques which I found useful in solving this issue:
with open(file, 'r+') as fd:
contents = fd.readlines()
contents.insert(index, new_string) # new_string should end in a newline
fd.seek(0) # readlines consumes the iterator, so we need to start over
fd.writelines(contents) # No need to truncate as we are increasing filesize
In our particular application, we wanted to add it after a certain string:
with open(file, 'r+') as fd:
contents = fd.readlines()
if match_string in contents[-1]: # Handle last line to prevent IndexError
contents.append(insert_string)
else:
for index, line in enumerate(contents):
if match_string in line and insert_string not in contents[index + 1]:
contents.insert(index + 1, insert_string)
break
fd.seek(0)
fd.writelines(contents)
If you want it to insert the string after every instance of the match, instead of just the first, remove the else: (and properly unindent) and the break.
Note also that the and insert_string not in contents[index + 1]: prevents it from adding more than one copy after the match_string, so it's safe to run repeatedly.

You can just read the data into a list and insert the new record where you want.
names = []
with open('names.txt', 'r+') as fd:
for line in fd:
names.append(line.split(' ')[-1].strip())
names.insert(2, "Charlie") # element 2 will be 3. in your list
fd.seek(0)
fd.truncate()
for i in xrange(len(names)):
fd.write("%d. %s\n" %(i + 1, names[i]))

The accepted answer has to load the whole file into memory, which doesn't work nicely for large files. The following solution writes the file contents with the new data inserted into the right line to a temporary file in the same directory (so on the same file system), only reading small chunks from the source file at a time. It then overwrites the source file with the contents of the temporary file in an efficient way (Python 3.8+).
from pathlib import Path
from shutil import copyfile
from tempfile import NamedTemporaryFile
sourcefile = Path("/path/to/source").resolve()
insert_lineno = 152 # The line to insert the new data into.
insert_data = "..." # Some string to insert.
with sourcefile.open(mode="r") as source:
destination = NamedTemporaryFile(mode="w", dir=str(sourcefile.parent))
lineno = 1
while lineno < insert_lineno:
destination.file.write(source.readline())
lineno += 1
# Insert the new data.
destination.file.write(insert_data)
# Write the rest in chunks.
while True:
data = source.read(1024)
if not data:
break
destination.file.write(data)
# Finish writing data.
destination.flush()
# Overwrite the original file's contents with that of the temporary file.
# This uses a memory-optimised copy operation starting from Python 3.8.
copyfile(destination.name, str(sourcefile))
# Delete the temporary file.
destination.close()
EDIT 2020-09-08: I just found an answer on Code Review that does something similar to above with more explanation - it might be useful to some.

You don't show us what the output should look like, so one possible interpretation is that you want this as the output:
Alfred
Bill
Charlie
Donald
(Insert Charlie, then add 1 to all subsequent lines.) Here's one possible solution:
def insert_line(input_stream, pos, new_name, output_stream):
inserted = False
for line in input_stream:
number, name = parse_line(line)
if number == pos:
print >> output_stream, format_line(number, new_name)
inserted = True
print >> output_stream, format_line(number if not inserted else (number + 1), name)
def parse_line(line):
number_str, name = line.strip().split()
return (get_number(number_str), name)
def get_number(number_str):
return int(number_str.split('.')[0])
def format_line(number, name):
return add_dot(number) + ' ' + name
def add_dot(number):
return str(number) + '.'
input_stream = open('input.txt', 'r')
output_stream = open('output.txt', 'w')
insert_line(input_stream, 3, 'Charlie', output_stream)
input_stream.close()
output_stream.close()

Parse the file into a python list using file.readlines() or file.read().split('\n')
Identify the position where you have to insert a new line, according to your criteria.
Insert a new list element there using list.insert().
Write the result to the file.

location_of_line = 0
with open(filename, 'r') as file_you_want_to_read:
#readlines in file and put in a list
contents = file_you_want_to_read.readlines()
#find location of what line you want to insert after
for index, line in enumerate(contents):
if line.startswith('whatever you are looking for')
location_of_line = index
#now you have a list of every line in that file
context.insert(location_of_line, "whatever you want to append to middle of file")
with open(filename, 'w') as file_to_write_to:
file_to_write_to.writelines(contents)
That is how I ended up getting whatever data I want to insert to the middle of the file.
this is just pseudo code, as I was having a hard time finding clear understanding of what is going on.
essentially you read in the file to its entirety and add it into a list, then you insert your lines that you want to that list, and then re-write to the same file.
i am sure there are better ways to do this, may not be efficient, but it makes more sense to me at least, I hope it makes sense to someone else.

A simple but not efficient way is to read the whole content, change it and then rewrite it:
line_index = 3
lines = None
with open('file.txt', 'r') as file_handler:
lines = file_handler.readlines()
lines.insert(line_index, 'Charlie')
with open('file.txt', 'w') as file_handler:
file_handler.writelines(lines)

I write this in order to reutilize/correct martincho's answer (accepted one)
! IMPORTANT: This code loads all the file into ram and rewrites content to the file
Variables index, value may be what you desire, but pay attention to making value string and end with '\n' if you don't want it to mess with existing data.
with open("path_to_file", "r+") as f:
# Read the content into a variable
contents = f.readlines()
contents.insert(index, value)
# Reset the reader's location (in bytes)
f.seek(0)
# Rewrite the content to the file
f.writelines(contents)
See the python docs about file.seek method: Python docs

Below is a slightly awkward solution for the special case in which you are creating the original file yourself and happen to know the insertion location (e.g. you know ahead of time that you will need to insert a line with an additional name before the third line, but won't know the name until after you've fetched and written the rest of the names). Reading, storing and then re-writing the entire contents of the file as described in other answers is, I think, more elegant than this option, but may be undesirable for large files.
You can leave a buffer of invisible null characters ('\0') at the insertion location to be overwritten later:
num_names = 1_000_000 # Enough data to make storing in a list unideal
max_len = 20 # The maximum allowed length of the inserted line
line_to_insert = 2 # The third line is at index 2 (0-based indexing)
with open(filename, 'w+') as file:
for i in range(line_to_insert):
name = get_name(i) # Returns 'Alfred' for i = 0, etc.
file.write(F'{i + 1}. {name}\n')
insert_position = file.tell() # Position to jump back to for insertion
file.write('\0' * max_len + '\n') # Buffer will show up as a blank line
for i in range(line_to_insert, num_names):
name = get_name(i)
file.write(F'{i + 2}. {name}\n') # Line numbering now bumped up by 1.
# Later, once you have the name to insert...
with open(filename, 'r+') as file: # Must use 'r+' to write to middle of file
file.seek(insert_position) # Move stream to the insertion line
name = get_bonus_name() # This lucky winner jumps up to 3rd place
new_line = F'{line_to_insert + 1}. {name}'
file.write(new_line[:max_len]) # Slice so you don't overwrite next line
Unfortunately there is no way to delete-without-replacement any excess null characters that did not get overwritten (or in general any characters anywhere in the middle of a file), unless you then re-write everything that follows. But the null characters will not affect how your file looks to a human (they have zero width).

How to clean large malformed CSV file using Python

I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname
,my_username
,10.0.0.1
192.168.1.1
,2015-02-11 13:41:54 -0600
,,true
,false
my_2nd_hostname
,my_2nd_username
,10.0.0.2
192.168.1.2
,2015-02-11 14:04:41 -0600
,true
,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv
inputfile = open('input.csv', 'r')
outputfile_1 = open('output.csv', 'w')
counter = 1
for line in inputfile:
#Skip header row
if counter == 1:
outputfile_1.write(line)
counter = counter + 1
else:
line = line.replace('\r', '').replace('\n', '')
outputfile_1.write(line)
inputfile.close()
outputfile_1.close()
with open('output.csv', 'r') as f:
text = f.read()
comma_count = text.count(',') #comma_count/6 = total number of rows
#need to insert a newline after the field contents after every 6th comma
#unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
#then process as normal CSV
#one path I started to go down... but this isn't even functional
groups = text.split(',')
counter2 = 1
while (counter2 <= comma_count/6):
line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
print line
counter2 = counter2 + 1
EDIT 2
Thanks to #DSM and #Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv
import re
outputfile_1 = open('output.csv', 'wb')
wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
with open('input.csv', 'r') as f:
text = f.read()
comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
cursor = 0
field_counter = 1
row_count = 0
csv_row = []
for index in comma_indices:
newrowflag = False
if "\r" in text[cursor:index]:
#This chunk has two fields, the last of one row and first of the next
next_field=text[cursor:index].split('\r')
next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed]) #Add the last field of this row
#Reset the cursor to be in the middle of the chuck (after the last field and before the next)
#And set a flag that we need to start the next csvrow before we move on to the next comma index
cursor = cursor+text[cursor:index].index('\r')+1
newrowflag = True
else:
next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
#Advance the cursor to the character after the comma to start the next field
cursor = index + 1
#If we've done 7 fields then we've finished the row
if field_counter%7==0:
row_count = row_count + 1
wr.writerow(csv_row)
#Reset
csv_row = []
#If the last chunk had 2 fields in it...
if newrowflag:
next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
field_counter = field_counter + 1
field_counter = field_counter + 1
#Write the last row
wr.writerow(csv_row)
outputfile_1.close()
# Process output.csv as normal CSV file...

This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
Hostname - there is only one
usernames - read these until you meet something that does not look like a username (comma delimited)
ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
timestamp - easy to identify with a pattern match
test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.

From your data excerpt it seems like any line that starts with a comma needs to be joined to the preceding line and any line starting with anything other than a comma marks a new row.
If that's the case than you could use something the following code to clean up the CSV file such that the standard library csv parser can handle it.
#!/usr/bin/python
raw_data = 'somefilename.raw'
csv_data = 'somefilename.csv'
with open(raw_data, 'Ur') as inp, open(csv_data, 'wb') as out:
row = list()
for line in inp:
line.rstrip('\n')
if line.startswith(','):
row.append(line)
else:
out.write(''.join(row)+'\n')
row = list()
row.append(line))
# Don't forget to write the last row!
out.write(''.join(row)+'\n')
This is a miniature state machine ... accumulating lines into each row until we find a line that doesn't start with a comma, writing the previous row and so on.

Insert line at middle of file with Python?

Is there a way to do this? Say I have a file that's a list of names that goes like this:
Alfred
Bill
Donald
How could I insert the third name, "Charlie", at line x (in this case 3), and automatically send all others down one line? I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?

This is a way of doing the trick.
with open("path_to_file", "r") as f:
contents = f.readlines()
contents.insert(index, value)
with open("path_to_file", "w") as f:
contents = "".join(contents)
f.write(contents)
index and value are the line and value of your choice, lines starting from 0.

If you want to search a file for a substring and add a new text to the next line, one of the elegant ways to do it is the following:
import os, fileinput
old = "A"
new = "B"
for line in fileinput.FileInput(file_path, inplace=True):
if old in line :
line += new + os.linesep
print(line, end="")

There is a combination of techniques which I found useful in solving this issue:
with open(file, 'r+') as fd:
contents = fd.readlines()
contents.insert(index, new_string) # new_string should end in a newline
fd.seek(0) # readlines consumes the iterator, so we need to start over
fd.writelines(contents) # No need to truncate as we are increasing filesize
In our particular application, we wanted to add it after a certain string:
with open(file, 'r+') as fd:
contents = fd.readlines()
if match_string in contents[-1]: # Handle last line to prevent IndexError
contents.append(insert_string)
else:
for index, line in enumerate(contents):
if match_string in line and insert_string not in contents[index + 1]:
contents.insert(index + 1, insert_string)
break
fd.seek(0)
fd.writelines(contents)
If you want it to insert the string after every instance of the match, instead of just the first, remove the else: (and properly unindent) and the break.
Note also that the and insert_string not in contents[index + 1]: prevents it from adding more than one copy after the match_string, so it's safe to run repeatedly.

You can just read the data into a list and insert the new record where you want.
names = []
with open('names.txt', 'r+') as fd:
for line in fd:
names.append(line.split(' ')[-1].strip())
names.insert(2, "Charlie") # element 2 will be 3. in your list
fd.seek(0)
fd.truncate()
for i in xrange(len(names)):
fd.write("%d. %s\n" %(i + 1, names[i]))

The accepted answer has to load the whole file into memory, which doesn't work nicely for large files. The following solution writes the file contents with the new data inserted into the right line to a temporary file in the same directory (so on the same file system), only reading small chunks from the source file at a time. It then overwrites the source file with the contents of the temporary file in an efficient way (Python 3.8+).
from pathlib import Path
from shutil import copyfile
from tempfile import NamedTemporaryFile
sourcefile = Path("/path/to/source").resolve()
insert_lineno = 152 # The line to insert the new data into.
insert_data = "..." # Some string to insert.
with sourcefile.open(mode="r") as source:
destination = NamedTemporaryFile(mode="w", dir=str(sourcefile.parent))
lineno = 1
while lineno < insert_lineno:
destination.file.write(source.readline())
lineno += 1
# Insert the new data.
destination.file.write(insert_data)
# Write the rest in chunks.
while True:
data = source.read(1024)
if not data:
break
destination.file.write(data)
# Finish writing data.
destination.flush()
# Overwrite the original file's contents with that of the temporary file.
# This uses a memory-optimised copy operation starting from Python 3.8.
copyfile(destination.name, str(sourcefile))
# Delete the temporary file.
destination.close()
EDIT 2020-09-08: I just found an answer on Code Review that does something similar to above with more explanation - it might be useful to some.

You don't show us what the output should look like, so one possible interpretation is that you want this as the output:
Alfred
Bill
Charlie
Donald
(Insert Charlie, then add 1 to all subsequent lines.) Here's one possible solution:
def insert_line(input_stream, pos, new_name, output_stream):
inserted = False
for line in input_stream:
number, name = parse_line(line)
if number == pos:
print >> output_stream, format_line(number, new_name)
inserted = True
print >> output_stream, format_line(number if not inserted else (number + 1), name)
def parse_line(line):
number_str, name = line.strip().split()
return (get_number(number_str), name)
def get_number(number_str):
return int(number_str.split('.')[0])
def format_line(number, name):
return add_dot(number) + ' ' + name
def add_dot(number):
return str(number) + '.'
input_stream = open('input.txt', 'r')
output_stream = open('output.txt', 'w')
insert_line(input_stream, 3, 'Charlie', output_stream)
input_stream.close()
output_stream.close()

Parse the file into a python list using file.readlines() or file.read().split('\n')
Identify the position where you have to insert a new line, according to your criteria.
Insert a new list element there using list.insert().
Write the result to the file.

location_of_line = 0
with open(filename, 'r') as file_you_want_to_read:
#readlines in file and put in a list
contents = file_you_want_to_read.readlines()
#find location of what line you want to insert after
for index, line in enumerate(contents):
if line.startswith('whatever you are looking for')
location_of_line = index
#now you have a list of every line in that file
context.insert(location_of_line, "whatever you want to append to middle of file")
with open(filename, 'w') as file_to_write_to:
file_to_write_to.writelines(contents)
That is how I ended up getting whatever data I want to insert to the middle of the file.
this is just pseudo code, as I was having a hard time finding clear understanding of what is going on.
essentially you read in the file to its entirety and add it into a list, then you insert your lines that you want to that list, and then re-write to the same file.
i am sure there are better ways to do this, may not be efficient, but it makes more sense to me at least, I hope it makes sense to someone else.

A simple but not efficient way is to read the whole content, change it and then rewrite it:
line_index = 3
lines = None
with open('file.txt', 'r') as file_handler:
lines = file_handler.readlines()
lines.insert(line_index, 'Charlie')
with open('file.txt', 'w') as file_handler:
file_handler.writelines(lines)

I write this in order to reutilize/correct martincho's answer (accepted one)
! IMPORTANT: This code loads all the file into ram and rewrites content to the file
Variables index, value may be what you desire, but pay attention to making value string and end with '\n' if you don't want it to mess with existing data.
with open("path_to_file", "r+") as f:
# Read the content into a variable
contents = f.readlines()
contents.insert(index, value)
# Reset the reader's location (in bytes)
f.seek(0)
# Rewrite the content to the file
f.writelines(contents)
See the python docs about file.seek method: Python docs

Below is a slightly awkward solution for the special case in which you are creating the original file yourself and happen to know the insertion location (e.g. you know ahead of time that you will need to insert a line with an additional name before the third line, but won't know the name until after you've fetched and written the rest of the names). Reading, storing and then re-writing the entire contents of the file as described in other answers is, I think, more elegant than this option, but may be undesirable for large files.
You can leave a buffer of invisible null characters ('\0') at the insertion location to be overwritten later:
num_names = 1_000_000 # Enough data to make storing in a list unideal
max_len = 20 # The maximum allowed length of the inserted line
line_to_insert = 2 # The third line is at index 2 (0-based indexing)
with open(filename, 'w+') as file:
for i in range(line_to_insert):
name = get_name(i) # Returns 'Alfred' for i = 0, etc.
file.write(F'{i + 1}. {name}\n')
insert_position = file.tell() # Position to jump back to for insertion
file.write('\0' * max_len + '\n') # Buffer will show up as a blank line
for i in range(line_to_insert, num_names):
name = get_name(i)
file.write(F'{i + 2}. {name}\n') # Line numbering now bumped up by 1.
# Later, once you have the name to insert...
with open(filename, 'r+') as file: # Must use 'r+' to write to middle of file
file.seek(insert_position) # Move stream to the insertion line
name = get_bonus_name() # This lucky winner jumps up to 3rd place
new_line = F'{line_to_insert + 1}. {name}'
file.write(new_line[:max_len]) # Slice so you don't overwrite next line
Unfortunately there is no way to delete-without-replacement any excess null characters that did not get overwritten (or in general any characters anywhere in the middle of a file), unless you then re-write everything that follows. But the null characters will not affect how your file looks to a human (they have zero width).

writing lines group by group in different files

I've got a little script which is not working nicely for me, hope you can help and find the problem.
I have two starting files:
traveltimes: contains the lines I need, it's a column file (every row has just a number). The lines I need are separated by a line which starts with 11 whitespaces
header lines: contains three header lines
output_file: I want to get 29 files (STA%s). What's inside? Every file will contain the same header lines after which I want to append the group of lines contained in the traveltimes file (one different group of lines for every file). Every group of lines is made by 74307 rows (1 column)
So far this script creates 29 files with the same header lines but then it mixes up everything, I mean it writes something but it's not what I want.
Any idea????
def make_station_files(traveltimes, header_lines):
"""Gives the STAxx.tgrid files required by loc3d"""
sta_counter = 1
with open (header_lines, 'r') as file_in:
data = file_in.readlines()
for i in range (29):
with open ('STA%s' % (sta_counter), 'w') as output_files:
sta_counter += 1
for i in data [0:3]:
values = i.strip()
output_files.write ("%s\n\t1\n" % (values))
with open (traveltimes, 'r') as times_file:
#collector = []
for line in times_file:
if line.startswith (" "):
break
output_files.write ("%s" % (line))

Suggestion:
Read the header rows first. Make sure this works before proceeding. None of the rest of the code needs to be indented under this.
Consider writing a separate function to group the traveltimes file into a list of lists.
Once you have a working traveltimes reader and grouper, only then create a new STA file, print the headers to it, and then write the timegroups to it.
Build your program up step-by-step, making sure it does what you expect at each step. Don't try to do it all at once because then you won't easily be able to track down where the issue lies.
My quick edit of your script uses itertools.groupby() as a grouper. It is a little advanced because the grouping function is stateful and tracks it state in a mutable list:
def make_station_files(traveltimes, header_lines):
'Gives the STAxx.tgrid files required by loc3d'
with open (header_lines, 'r') as f:
headers = f.readlines()
def station_counter(line, cnt=[1]):
'Stateful station counter -- Keeps the count in a mutable list'
if line.strip() == '':
cnt[0] += 1
return cnt[0]
with open(traveltimes, 'r') as times_file:
for station, group in groupby(times_file, station_counter):
with open('STA%s' % (station), 'w') as output_file:
for header in headers[:3]:
output_file.write ('%s\n\t1\n' % (header.strip()))
for line in group:
if not line.startswith(' '):
output_file.write ('%s' % (line))
This code is untested because I don't have sample data. Hopefully, you'll get the gist of it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split the data into separate file after encountering a column name - python

I know you asked for Python, but there are some questions that just cry out for the power of AWK :) awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv

Related

I want to replace a certain column of a file with a list - Python

Python3 - dumping a JSON data into penultimate line of a file [duplicate]

How to clean large malformed CSV file using Python

Insert line at middle of file with Python?

writing lines group by group in different files

Categories

Resources