I'm a new python programmer so excuse me if it was a silly problem.
I'm loading a txt file containing complex numbers. This is a 2x4 sample from the actual large file txt file (i is used as the imaginary number instead of j):
0.633399474768199 - 0.0175109522504542i 0.337208501994460 + 0.00414157519417569i 0.462845433000816 + 0.0311199272434047i 0.248496359856117 + 0.000929998413548307i
0.633719938420320 - 0.0168830372084714i 0.364374358580293 + 0.0247026480558120i 0.460808199213633 + 0.0346904985858835i 0.251160695519198 - 0.00257247233248499i
tried to load the file using:
data = np.loadtxt(path, dtype=np.complex_)
appearently the error is only solved when I delete all the spaces before and after + and - between the real part and imaginary part for all values, and I also need to replace i by j.
0.633399474768199-0.0175109522504542j 0.337208501994460+0.00414157519417569j
I can do this manually (not an option for large data), is there any easier way to load it? Becase I'm not sure how to delete the spaces before and after + and -, without affecting the spaces between separate values, which is not consistance between all values, some values got more spaces between them than other values, example of three values with different spaces between them:
0.633830049713846 - 0.0164809219396847i 0.375552117859690 + 0.00970977484227810i 0.473980903316675 + 0.0360707252275126i
The simplest thing to do is probably write a new file with the desired format. For instance, you could do:
with open(yourinputfile, 'rt') as f, open('output.txt', 'wt') as g:
for line in f:
pairs = [k.replace(' ', '') for k in line.split('i')][:-2]
g.write('j '.join(pairs) + 'j\n')
The expression line.split('i') divides your input string at each i character. For instance, if the line is
'0.633399474768199 - 0.0175109522504542i 0.337208501994460 + 0.00414157519417569i 0.462845433000816 + 0.0311199272434047i 0.248496359856117 + 0.000929998413548307i'
split(i) would produce the list of strings
['0.633399474768199 - 0.0175109522504542', ' 0.337208501994460 + 0.00414157519417569', ' 0.462845433000816 + 0.0311199272434047', ' 0.248496359856117 + 0.000929998413548307', '']
Note the empty string at the end of that list. The [:-2] is a way to pick up all the strings except that last one. And then k.replace(' ', '') removes all the spaces in each pair.
Then 'j '.join(pairs) + 'j\n' uses the join method of the string class to tack a j (note the space) onto the end of each of the pairs except for the last. We tack on a j and a newline to the last one, and then write that to a new file.
After doing this, you can use np.loadtxt as you originally intended.
An alternative solution without creating a new file :
import numpy as np
with open(path, "r") as f:
lines = f.readlines()
data = []
for line in lines:
#remove spaces before and after "+" and create a list around "i" character
line2 = [elem.replace(" + ", "+") for elem in line.split("i")[:-1:]]
#id with "-"
line2 = [elem.replace(" - ", "-") for elem in line2]
# add a "j" character at the end of each element
line2 = [elem+"j" for elem in line2]
data.append(line2)
#convert to a complex numpy ndarray
data = np.array(data, dtype=np.complex128)
Related
i have a file with many lines like this,
>6_KA-RFNB-1505/2021-EPI_ISL_8285588-2021-12-02
i need to convert it to
>6_KA_2021-1202
all of the lines that require this change start in a >
The 6_KA and the 2021-12-02 are different for all lines.
I also need to add an empty line before every line that i change in thsi manner.
UPDATE: You changed the requirements from when I originally answered yourpost, but the below does what you are looking for. The principle remains the same: use regex to identify the parts of the string you are looking to replace. And then as you are going thru each line of the file create a new string based on the values you parsed out from the regex
import re
regex = re.compile('>(?P<first>[0-9a-zA-Z]{1,3}_[0-9a-zA-Z]{1,3}).*(?P<year>[0-9]{4})-(?P<month>[0-9]{2})-(?P<day>[0-9]{2})\n')
def convert_file(inputFile):
with open(inputFile, 'r') as input, open('Output.txt', 'w') as output:
for line in input:
text = regex.match(line)
if text:
output.write("\n" + text.group("first") + '_' + text.group("year") + "-" + text.group("month") + text.group("day") + "\n")
else:
output.write(line)
convert_file('data.txt')
Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a text file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file with over 1000 lines of data:
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
]["01/09/21","01:28",6.9,74,2.6,4.1,4.1,248,0.0,0.0,1025.8,81.9,17.0,44,4.1,4.1,6.9,0,0,0.00,0.00,2.5,0,0.0,248,0.0,0.0
I need it as
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0]
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ], at the end.
I thought something like this would work but the second bracket appears after the line break (\n) is there any way to avoid this?
infile=open(infile)
outfile=open(outfile, 'w')
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
print(elements[k])
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
outfile.write(format_line(line))
outfile.close()
You are referring to a function before it is defined.
Move the definition of format_line before it is called in the for loop.
When I rearranged your code it seems to work.
New code:
outfile=open("outputfile","w")
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
format_line(line)
I'm quite new to python. I have a program which reads an input file with different characters and then writes all unique characters from that file into an output file with a single space between each of them. The problem is that after the last character there is one extra space (before the newline). How can I remove it?
My code:
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
if(c == ' '):
pass
else:
outfile.write(' ')
outfile.write('\n')
With the line outfile.write(' '), you write a space after each character (unless the character is a space). So you'll have to avoid writing the last space. Now, you can't tell whether any given character is the last one until you're done reading, so it's not like you can just put in an if statement to test that, but there are a few ways to get around that:
Write the space before the character c instead of after it. That way the space you have to skip is the one before the first character, and that you definitely can identify with an if statement and a boolean variable. If you do this, make sure to check that you get the right result if the first or second c is itself a space.
Alternatively, you can avoid writing anything until the very end. Just save up all the characters you see - you already do this in the list result - and write them all in one go. You can use
' '.join(strings)
to join together a list of strings (in this case, your characters) with spaces between them, and this will automatically omit a trailing space.
Why are you adding that if block on the end?
Your program is adding the extra space on the end.
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
charno = 0
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
charno += 1
if (c == ' '):
pass
elif charno => len(line):
pass
else:
outfile.write(' ')
outfile.write('\n')
I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.
I am trying to convert a 'fastq' file in to a tab-delimited file using python3.
Here is the input: (line 1-4 is one record that i require to print as tab separated format). Here, I am trying to read in each record in to a list object:
#SEQ_ID
GATTTGGGGTT
+
!''*((((***
#SEQ_ID
GATTTGGGGTT
+
!''*((((***
using this:
data = open('sample3.fq')
fq_record = data.read().replace('#', ',#').split(',')
for item in fq_record:
print(item.replace('\n', '\t').split('\t'))
Output is:
['']
['#SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '']
['#SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '', '']
I am geting a blank line at the begining of the output, which I do not understand why ??
I am aware that this can be done in so many other ways but I need to figure out the reason as I am learning python.
Thanks
When you replace # with ,#, you put a comma at the beginning of the string (since it starts with #). Then when you split on commas, there is nothing before the first comma, so this gives you an empty string in the split. What happens is basically like this:
>>> print ',x'.split(',')
['', 'x']
If you know your data always begins with #, you can just skip the empty record in your loop. Just do for item in fq_record[1:].
You can also go line-by-line without all the replacing:
fobj = io.StringIO("""#SEQ_ID
GATTTGGGGTT
+
!''*((((***
#SEQ_ID
GATTTGGGGTT
+
!''*((((***""")
data = []
entry = []
for raw_line in fobj:
line = raw_line.strip()
if line.startswith('#'):
if entry:
data.append(entry)
entry = []
entry.append(line)
data.append(entry)
data looks like this:
[['#SEQ_ID', 'GATTTGGGGTTy', '+', "!''*((((***"],
['#SEQ_ID', 'GATTTGGGGTTx', '+', "!''*((((***"]]
Thank you all for your answers. As a beginner, my main problem was the occurrence of a blank line upon .split(',') which I have now understood conceptually. So my first useful program in python is here:
# this script converts a .fastq file in to .fasta format
import sys
# Usage statement:
print('\nUsage: fq2fasta.py input-file output-file\n=========================================\n\n')
# define a function for fasta formating
def format_fasta(name, sequence):
fasta_string = '>' + name + "\n" + sequence + '\n'
return fasta_string
# open the file for reading
data = open(sys.argv[1])
# open the file for writing
fasta = open(sys.argv[2], 'wt')
# feed all fastq records in to a list
fq_records = data.read().replace('#', ',#').split(',')
# iterate through list objects
for item in fq_records[1:]: # this is to avoid the first line which is created as blank by .split() function
line = item.replace('\n', '\t').split('\t')
name = line[0]
sequence = line[1]
fasta.write(format_fasta(name, sequence))
fasta.close()
Other things suggested in the answers would be more clear to me as I learn more.
Thanks again.