Reformatting a txt file with characters at index positions using python - python

Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a text file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file with over 1000 lines of data:
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
]["01/09/21","01:28",6.9,74,2.6,4.1,4.1,248,0.0,0.0,1025.8,81.9,17.0,44,4.1,4.1,6.9,0,0,0.00,0.00,2.5,0,0.0,248,0.0,0.0
I need it as
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0]
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ], at the end.
I thought something like this would work but the second bracket appears after the line break (\n) is there any way to avoid this?
infile=open(infile)
outfile=open(outfile, 'w')
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
print(elements[k])
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
outfile.write(format_line(line))
outfile.close()

You are referring to a function before it is defined.
Move the definition of format_line before it is called in the for loop.
When I rearranged your code it seems to work.
New code:
outfile=open("outputfile","w")
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
format_line(line)

Related

Python script to remove certain things form a string

i have a file with many lines like this,
>6_KA-RFNB-1505/2021-EPI_ISL_8285588-2021-12-02
i need to convert it to
>6_KA_2021-1202
all of the lines that require this change start in a >
The 6_KA and the 2021-12-02 are different for all lines.
I also need to add an empty line before every line that i change in thsi manner.
UPDATE: You changed the requirements from when I originally answered yourpost, but the below does what you are looking for. The principle remains the same: use regex to identify the parts of the string you are looking to replace. And then as you are going thru each line of the file create a new string based on the values you parsed out from the regex
import re
regex = re.compile('>(?P<first>[0-9a-zA-Z]{1,3}_[0-9a-zA-Z]{1,3}).*(?P<year>[0-9]{4})-(?P<month>[0-9]{2})-(?P<day>[0-9]{2})\n')
def convert_file(inputFile):
with open(inputFile, 'r') as input, open('Output.txt', 'w') as output:
for line in input:
text = regex.match(line)
if text:
output.write("\n" + text.group("first") + '_' + text.group("year") + "-" + text.group("month") + text.group("day") + "\n")
else:
output.write(line)
convert_file('data.txt')

loading a text file with complex numbers using i instead of j

I'm a new python programmer so excuse me if it was a silly problem.
I'm loading a txt file containing complex numbers. This is a 2x4 sample from the actual large file txt file (i is used as the imaginary number instead of j):
0.633399474768199 - 0.0175109522504542i 0.337208501994460 + 0.00414157519417569i 0.462845433000816 + 0.0311199272434047i 0.248496359856117 + 0.000929998413548307i
0.633719938420320 - 0.0168830372084714i 0.364374358580293 + 0.0247026480558120i 0.460808199213633 + 0.0346904985858835i 0.251160695519198 - 0.00257247233248499i
tried to load the file using:
data = np.loadtxt(path, dtype=np.complex_)
appearently the error is only solved when I delete all the spaces before and after + and - between the real part and imaginary part for all values, and I also need to replace i by j.
0.633399474768199-0.0175109522504542j 0.337208501994460+0.00414157519417569j
I can do this manually (not an option for large data), is there any easier way to load it? Becase I'm not sure how to delete the spaces before and after + and -, without affecting the spaces between separate values, which is not consistance between all values, some values got more spaces between them than other values, example of three values with different spaces between them:
0.633830049713846 - 0.0164809219396847i 0.375552117859690 + 0.00970977484227810i 0.473980903316675 + 0.0360707252275126i
The simplest thing to do is probably write a new file with the desired format. For instance, you could do:
with open(yourinputfile, 'rt') as f, open('output.txt', 'wt') as g:
for line in f:
pairs = [k.replace(' ', '') for k in line.split('i')][:-2]
g.write('j '.join(pairs) + 'j\n')
The expression line.split('i') divides your input string at each i character. For instance, if the line is
'0.633399474768199 - 0.0175109522504542i 0.337208501994460 + 0.00414157519417569i 0.462845433000816 + 0.0311199272434047i 0.248496359856117 + 0.000929998413548307i'
split(i) would produce the list of strings
['0.633399474768199 - 0.0175109522504542', ' 0.337208501994460 + 0.00414157519417569', ' 0.462845433000816 + 0.0311199272434047', ' 0.248496359856117 + 0.000929998413548307', '']
Note the empty string at the end of that list. The [:-2] is a way to pick up all the strings except that last one. And then k.replace(' ', '') removes all the spaces in each pair.
Then 'j '.join(pairs) + 'j\n' uses the join method of the string class to tack a j (note the space) onto the end of each of the pairs except for the last. We tack on a j and a newline to the last one, and then write that to a new file.
After doing this, you can use np.loadtxt as you originally intended.
An alternative solution without creating a new file :
import numpy as np
with open(path, "r") as f:
lines = f.readlines()
data = []
for line in lines:
#remove spaces before and after "+" and create a list around "i" character
line2 = [elem.replace(" + ", "+") for elem in line.split("i")[:-1:]]
#id with "-"
line2 = [elem.replace(" - ", "-") for elem in line2]
# add a "j" character at the end of each element
line2 = [elem+"j" for elem in line2]
data.append(line2)
#convert to a complex numpy ndarray
data = np.array(data, dtype=np.complex128)

Minor bug in code written to format a text file (incorrect spacing) (Python 3)

New to coding so sorry if this is a silly question.
I have some text that I'm attempting to format to make it more pleasant to read, so I tried my hand at writing a short program in Python to do it for me. I initially removed extra paragraph breaks in MS-Word using the find-and-replace option. The input text looks something like this:
This is a sentence. So is this one. And this.
(empty line)
This is the next line
(empty line)
and some lines are like this.
I want to eliminate all empty lines, so that there is no spacing between lines, and ensure no sentences are left hanging mid-way like in the bit above. All new lines should begin with 2 (two) empty spaces, represented by the $ symbol below. So after formatting it should look something like this:
$$This is a sentence. So is this one. And this.
$$This is the next line and some lines are like this.
I wrote the following script:
import os
directory = "C:/Users/DELL/Desktop/"
filename = "test.txt"
path = os.path.join(directory, filename)
with open(path,"r") as f_in, open(directory+"output.txt","w+") as f_out:
temp = " "
for line in f_in:
curr_line = line.strip()
temp += curr_line
#print("Current line:\n%s\n\ntemp line: %s" % (curr_line, temp))
if curr_line:
if temp[-1]==".": #check if sentence is complete
f_out.write(temp)
temp = "\n " #two blank spaces here
It eliminates all blank lines, indents new lines by two spaces, and conjoins hanging sentences, but doesn't insert the necessary blank space - so the output currently looks like (missing space between the words line and and).
$$This is a sentence. So is this one. And this.
$$This is the next lineand some lines are like this.
I tried to fix this by changing the following lines of code to read as follows:
temp += " " + curr_line
temp = "\n " #one space instead of two
and that doesn't work, and I'm not entirely sure why. It might be an issue with the text itself but I'll check on that.
Any advice would be appreciated, and if there is a better way to do what I want than this convoluted mess that I wrote, then I would like to know that as well.
EDIT: I seem to have fixed it. In my text (very long so I didn't notice it at first) there were two lines separated by 2 (two) empty lines, and so my attempt at fixing it didn't work. I moved one line a bit further below to give the following loop, which seems to have fixed it:
for line in f_in:
curr_line = line.strip()
#print("Current line:\n%s\n\ntemp line: %s" % (curr_line, temp))
if curr_line:
temp += " " + curr_line
if temp[-1]==".": #check if sentence is complete
f_out.write(temp)
temp = "\n "
I also saw that an answer below initially had a bit of Regex in it, I'll have to learn that at some point in the future I suppose.
Thanks for the help everyone.
This should work. It's effectively the same as yours but a bit more efficient. Doesn't use string concatenation + += (which are slow) but instead saves incomplete lines as a list. It then writes 2 spaces, each incomplete sentence joined by spaces and then a newline -- this simplifies it by only writing when a line is complete.
temp = []
with open(path_in, "r") as f_in, open(path_out, "w") as f_out:
for line in f_in:
curr_line = line.strip()
if curr_line:
temp.append(curr_line)
if curr_line.endswith('.'): # write our line
f_out.write(' ')
f_out.write(' '.join(temp))
f_out.write('\n')
temp.clear() # reset temp
outputs
This is a sentence. So is this one. And this.
This is the next line and some lines are like this.

How to remove extra space from end of the line before newline in python?

I'm quite new to python. I have a program which reads an input file with different characters and then writes all unique characters from that file into an output file with a single space between each of them. The problem is that after the last character there is one extra space (before the newline). How can I remove it?
My code:
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
if(c == ' '):
pass
else:
outfile.write(' ')
outfile.write('\n')
With the line outfile.write(' '), you write a space after each character (unless the character is a space). So you'll have to avoid writing the last space. Now, you can't tell whether any given character is the last one until you're done reading, so it's not like you can just put in an if statement to test that, but there are a few ways to get around that:
Write the space before the character c instead of after it. That way the space you have to skip is the one before the first character, and that you definitely can identify with an if statement and a boolean variable. If you do this, make sure to check that you get the right result if the first or second c is itself a space.
Alternatively, you can avoid writing anything until the very end. Just save up all the characters you see - you already do this in the list result - and write them all in one go. You can use
' '.join(strings)
to join together a list of strings (in this case, your characters) with spaces between them, and this will automatically omit a trailing space.
Why are you adding that if block on the end?
Your program is adding the extra space on the end.
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
charno = 0
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
charno += 1
if (c == ' '):
pass
elif charno => len(line):
pass
else:
outfile.write(' ')
outfile.write('\n')

.split() creating a blank line in python3

I am trying to convert a 'fastq' file in to a tab-delimited file using python3.
Here is the input: (line 1-4 is one record that i require to print as tab separated format). Here, I am trying to read in each record in to a list object:
#SEQ_ID
GATTTGGGGTT
+
!''*((((***
#SEQ_ID
GATTTGGGGTT
+
!''*((((***
using this:
data = open('sample3.fq')
fq_record = data.read().replace('#', ',#').split(',')
for item in fq_record:
print(item.replace('\n', '\t').split('\t'))
Output is:
['']
['#SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '']
['#SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '', '']
I am geting a blank line at the begining of the output, which I do not understand why ??
I am aware that this can be done in so many other ways but I need to figure out the reason as I am learning python.
Thanks
When you replace # with ,#, you put a comma at the beginning of the string (since it starts with #). Then when you split on commas, there is nothing before the first comma, so this gives you an empty string in the split. What happens is basically like this:
>>> print ',x'.split(',')
['', 'x']
If you know your data always begins with #, you can just skip the empty record in your loop. Just do for item in fq_record[1:].
You can also go line-by-line without all the replacing:
fobj = io.StringIO("""#SEQ_ID
GATTTGGGGTT
+
!''*((((***
#SEQ_ID
GATTTGGGGTT
+
!''*((((***""")
data = []
entry = []
for raw_line in fobj:
line = raw_line.strip()
if line.startswith('#'):
if entry:
data.append(entry)
entry = []
entry.append(line)
data.append(entry)
data looks like this:
[['#SEQ_ID', 'GATTTGGGGTTy', '+', "!''*((((***"],
['#SEQ_ID', 'GATTTGGGGTTx', '+', "!''*((((***"]]
Thank you all for your answers. As a beginner, my main problem was the occurrence of a blank line upon .split(',') which I have now understood conceptually. So my first useful program in python is here:
# this script converts a .fastq file in to .fasta format
import sys
# Usage statement:
print('\nUsage: fq2fasta.py input-file output-file\n=========================================\n\n')
# define a function for fasta formating
def format_fasta(name, sequence):
fasta_string = '>' + name + "\n" + sequence + '\n'
return fasta_string
# open the file for reading
data = open(sys.argv[1])
# open the file for writing
fasta = open(sys.argv[2], 'wt')
# feed all fastq records in to a list
fq_records = data.read().replace('#', ',#').split(',')
# iterate through list objects
for item in fq_records[1:]: # this is to avoid the first line which is created as blank by .split() function
line = item.replace('\n', '\t').split('\t')
name = line[0]
sequence = line[1]
fasta.write(format_fasta(name, sequence))
fasta.close()
Other things suggested in the answers would be more clear to me as I learn more.
Thanks again.

Categories