Need to split list into a nested list - python

Im trying to make a sublist based on end of a line and #:
for example the file contains:
#
2.1,-3.1
-0.7,4.1
#
3.8,1.5
-1.2,1.1
and the output needs to be:
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]
but after coding :
results = []
fileToProcess = open("numerical.txt", "r")
for line in fileToProcess:
results.append(line.strip().split(' '))
print(results)
i get :
[['#'], ['2.1', '-3.1'], ['-0.7', '4.1'], ['#'], ['3.8', '1.5'], ['-1.2', '1.1']]

Assuming Python as a programming language, and assuming you want exactly the output to be like this:
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]
Here is how to do it:
I commented the code for better understanding. Please tell me if something isn't clear.
fileToProcess = open("numerical.txt", "r")
results = []
hashtag_results = []
# For each line, we have two cases: either the line contains hashtags or contains numbers.
for line in fileToProcess:
'''
If the line doesn't contain hashtags, then we want to:
1. Separate the text by "," and not spaces.
2. Parse the text as floats using list comprehension.
3. Append the parsed line to hashtag_results which contains
all lists between two hashtags.
'''
if not line.startswith("#"):
line_results = [ float(x) for x in line.strip().split(',')]
hashtag_results.append(line_results)
'''
If the line contains a hashtag AND the hastag_results ISN'T EMPTY:
then we want to append the whole hashtag_list to the final results list.
'''
if line.startswith("#") and hashtag_results:
results.append(hashtag_results)
hashtag_results = []
# For the final line, we append the last hashtag_results to the final results too.
results.append(hashtag_results)
print(results)
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]

The general idea looks fine in your OP, but you will need to split by "," (instead of " "), and append a list to results, where list is a list of the numerical values.
Another issue is that you don't close the file once you're finished with it. I suggest to use the built-in context manager construct (https://book.pythontips.com/en/latest/context_managers.html), which open() supports, and will automatically close the file once you leave the context manager scope.
Parsing data from file is a common data processing task in Python, so it could be achieved in a more "pythonic" way with a list comprehension.
# use a context manager, so once you leave the `with` block,
# the file is closed(!)
with open("numerical.txt", "r") as fileToProcess:
results = [
# split the line on "," and interpret each element as a float
[float(val) for val in line.strip().split(",")]
# iterate through each line in the file
for line in fileToProcess
# ignore lines that just have '#'
if line.strip() != "#"
]
# here, the file would be closed, and `results` will contain the parsed data
# result = [[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]

Related

Question about String operations in Python

I am new to Python, and was practising File Operations. I have written this program:
myfile = open('test3.txt', 'w+')
myfile.writelines(['Doctor', 'Subramanian', 'Swamy', 'Virat', 'Hindustan', 'Sangam'])
which outputs the following:
DoctorSubramanianSwamyViratHindustanSangam.
How do I add spaces in between items of the list in the final output such that the final output is Doctor Subramanian Swamy Virat Hindustan Sangam?
Based on what I understood from your question, you wish to add spaces between elements of the list in the final output. One possible solution is:
myfile = open('test3.txt', 'w+')
list = ['Doctor', 'Subramanian', 'Swamy', 'Virat', 'Hindustan', 'Sangam']
for l in list:
myfile.write(l+' ')
In particular, this line myfile.write(l+' ') will add a space after writing every element.
You could try stripping it using the same .strip() method?
value = "'"
list1 = []
for item in list2:
list1.append(item.strip("{0}".format(value)))
try it and let me know

How can I categorize numbers that inside a text file?

I have a text file that has 5000 lines. It's format is like that:
1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build
I want to make different lists for example:
for second column:
second_build=[3,1]
second_demolish=[6,6,4]
I've tried something like that:
with open('cons.data') as file:
second_build=[line.split(',')[1] for line in file if line.split(',')[7]=='build']
But It did not work.
You can get values for each column/action as follows:
lines = """1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build""".split(
"\n"
)
build_cols = [list() for _ in range(7)]
demolish_cols = [list() for _ in range(7)]
data = {"build": build_cols, "demolish": demolish_cols}
for line in lines:
tokens = line.split(",")
for bc, tok in zip(data[tokens[-1]], tokens):
bc.append(tok)
# to access second column build values:
print(build_cols[1])
# ['3', '1']
For example, build_cols stores a list of lists, each entry represents a column. For each build line, you append items from an appropriate column to the corresponding position in the build_cols.
Just simply first make the readlines a variable, then in the list comprehension simply add a rstrip then will work, because the values (except the last) all have '\n' at the end, so strip them out, and make them integers:
with open('cons.data') as file:
f=file.readlines()
second_build=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='build']
second_demolish=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='demolish']
And now:
print(second_build)
print(second_demolish)
Is:
[3, 1]
[6, 6, 4]

How to make a list a string and then an array

import csv
with open('data1.txt', 'r') as f:
fread = csv.reader(f, delimiter='\t')
output = []
for line in fread:
output.append(line)
data_values = str(output[1:17]) #skip the first line and grab relevant data from lines 1 to 16
print(data_values)
usable_data_values = float(data_values)
Right now I'm trying to convert a .txt file with two columns of data into an array with two columns of data. At this point I've extracted the data and have this:
[['0.0000', '1.06E+05'], ['0.0831', '93240'], ['0.1465', '1.67E+05'],
['0.2587', '1.54E+05'], ['0.4828', '1.19E+05'], ['0.7448', '1.17E+05'],
['0.9817', '1.10E+05'], ['1.2563', '1.11E+05'], ['1.4926', '74388'], ['1.7299', '83291'],
['1.9915', '66435'], ['3.0011', '35407'], ['4.0109', '21125'], ['5.0090', '20450'],
['5.9943', '15798'], ['7.0028', '4785.2']]
I'm trying to get that data into something usable (I think I need to get rid of the commas but I'm new to Python and don't even know how to do this). Any help would be appreciated on getting these numbers into a usable form for operations(multiply, add, divide, etc.)!
I believe this is the best you can do. The commas are fine, they just indicate separate elements of a python list. Notice the quotation marks disappear to indicate you are no longer dealing with text strings.
lst = [['0.0000', '1.06E+05'], ['0.0831', '93240'], ['0.1465', '1.67E+05'],
['0.2587', '1.54E+05'], ['0.4828', '1.19E+05'], ['0.7448', '1.17E+05'],
['0.9817', '1.10E+05'], ['1.2563', '1.11E+05'], ['1.4926', '74388'], ['1.7299', '83291'],
['1.9915', '66435'], ['3.0011', '35407'], ['4.0109', '21125'], ['5.0090', '20450'],
['5.9943', '15798'], ['7.0028', '4785.2']]
[list(map(float, i)) for i in lst]
# [[0.0, 106000.0],
# [0.0831, 93240.0],
# [0.1465, 167000.0],
# [0.2587, 154000.0],
# [0.4828, 119000.0],
# [0.7448, 117000.0],
# [0.9817, 110000.0],
# [1.2563, 111000.0],
# [1.4926, 74388.0],
# [1.7299, 83291.0],
# [1.9915, 66435.0],
# [3.0011, 35407.0],
# [4.0109, 21125.0],
# [5.009, 20450.0],
# [5.9943, 15798.0],
# [7.0028, 4785.2]]

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

Scan through txt, append certain data to an empty list in Python

I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]

Categories