Read several lists from the text file properly python - python

I have a text file which have 541 lists and each list has 280 numbers such as below:
[301.82779832839964, 301.84247725804647, 301.85718673070272, ..., 324.4056396484375, 324.20379638671875, 324.00198364257812]
.
.
[310.6907599572782, 310.68334604280966, 310.67756809346469,..., 324.23541883368551, 324.18277040240207, 324.09177971086382]
To read this text file, I used numpy.genfromtxt making a code to read the first list for the test such as:
pt1 = np.genfromtxt(filn1,dtype=np.float64,delimiter=",")
print pt1[0].shape
print list(pt1[0])
I expected that I could see the full list of the first list but the result list showed 'nan' in the first and the last place as below:
[nan, 301.84247725804647, 301.85718673070272, ..., 324.4056396484375, 324.20379638671875, nan]
I have tried other option in numpy.genfromtxt, I couldn't find why it resulted 'nan' in the first and the last place in the list. This event was not only for the first list, but also for all lists.
Any idea or help would be really appreciated.
Thank you,
Isaac

import numpy as np
from ast import literal_eval
pt1 = np.array(map(literal_eval,open("in.txt")))
For:
[301.82779832839964, 301.84247725804647, 301.85718673070272, 324.4056396484375, 324.20379638671875, 324.00198364257812]
[310.6907599572782, 310.68334604280966, 310.67756809346469, 324.23541883368551, 324.18277040240207, 324.09177971086382]
You will get:
[[ 301.82779833 301.84247726 301.85718673 324.40563965 324.20379639
324.00198364]
[ 310.69075996 310.68334604 310.67756809 324.23541883 324.1827704
324.09177971]]

It looks like the problem is caused by the square brackets in your textfile; the simplest solution would be to remove these characters from your file, either just using find-replace in a text editor, or if you file is too large, by using a command-line tool like sed.

It's applying 'nan' to the [ and ] in your files. As a last resort you could do something like this:
data = []
d = file('filn').read().split('\n')
for line in d:
if line:
data.append(eval(line))
data = np.asarray(data)
Alternatively you can replace the [ and ] for the whole file, and then you can use np.genfromtxt(filn1,dtype=np.float64,delimiter=",") like you were before, without getting and nan elements.

Related

In Python, how to remove items in a list based on the specific string format?

I have a Python list as below:
merged_cells_lst = [
'P19:Q19
'P20:Q20
'P21:Q21
'P22:Q22
'P23:Q23
'P14:Q14
'P15:Q15
'P16:Q16
'P17:Q17
'P18:Q18
'AU9:AV9
'P10:Q10
'P11:Q11
'P12:Q12
'P13:Q13
'A6:P6
'A7:P7
'D9:AJ9
'AK9:AQ9
'AR9:AT9'
'A1:P1'
]
I only want to unmerge the cells in the P and Q columns. Therefore, I seek to remove any strings/items in the merged_cells_lst that does not have the format "P##:Q##".
I think that regex is the best and most simple way to go about this. So far I have the following:
for item in merge_cell_lst:
if re.match(r'P*:Q*'):
pass
else:
merged_cell_lst.pop(item)
print(merge_cell_lst)
The code however is not working. I could use any additional tips/help. Thank you!
Modifying a list while looping over it causes troubles. You can use list comprehension instead to create a new list.
Also, you need a different regex expression. The current pattern P*:Q* matches PP:QQQ, :Q, or even :, but not P19:Q19.
import re
merged_cells_lst = ['P19:Q19', 'P20:Q20', 'P21:Q21', 'P22:Q22', 'P23:Q23', 'P14:Q14', 'P15:Q15', 'P16:Q16', 'P17:Q17', 'P18:Q18', 'AU9:AV9', 'P10:Q10', 'P11:Q11', 'P12:Q12', 'P13:Q13', 'A6:P6', 'A7:P7', 'D9:AJ9', 'AK9:AQ9', 'AR9:AT9', 'A1:P1']
p = re.compile(r"P\d+:Q\d+")
output = [x for x in merged_cells_lst if p.match(x)]
print(output)
# ['P19:Q19', 'P20:Q20', 'P21:Q21', 'P22:Q22', 'P23:Q23', 'P14:Q14', 'P15:Q15',
# 'P16:Q16', 'P17:Q17', 'P18:Q18', 'P10:Q10', 'P11:Q11', 'P12:Q12', 'P13:Q13']
Your list has some typos, should look something like this:
merged_cells_lst = [
'P19:Q19',
'P20:Q20',
'P21:Q21', ...]
Then something as simple as:
x = [k for k in merged_cells_lst if k[0] == 'P']
would work. This is assuming that you know a priori that the pattern you want to remove follows the Pxx:Qxx format. If you want a dynamic solution then you can replace the condition in the list comprehension with a regex match.

How to print selected numbers from a list in python?

I want to print all atoms except H (hydrogen) from the pdb file. Here is the file
https://github.com/mahesh27dx/molecular_phys.git
Following code prints the objects of the file
import numpy as np
import mdtraj as md
coord = md.load('alanine-dipeptide-nowater.pdb')
atoms, bonds = coord.topology.to_dataframe()
atoms
The result looks like this
From this table I want to print all the elements except H . I think it can be done in python using the list slicing. Could someone have any idea how it can be done?
Thanks!
You probably should clarify that you want help with mdtraj or pandas specifically.
Anyway, it's as simple as atoms.loc[atoms.element != 'H'].
You can do the following:
print(*list(df[df.element!='H']['element']))
If you want the unique elements, you can do this:
print(*set(df[df.element!='H']['element']))

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

Get datatype for a particular string part in python3

I have input in a csv file like -0.02872239612042904, -0.19755002856254578.. with 128 values and when I read that array from a csv file it gets read as '-0.02872239612042904, -0.19755002856254578..' I have figured out a way to map all strings to a specific datatype. Right now i am doing it like this:-
result=list(map(float, re.findall(r'\d+', en))) #en=string read from csv file
But since these are face encodings and when the distance is calculated it returns False all the time which i believe is because of the fact that after converting to string the array becomes like 1906684972345829.0 and so on.
I can't find a datatype to represent numbers like -0.02872239612042904 that's why when mapping I am converting to float which is the wrong format. Can anyone please tell me what is the correct datatype for numbers like -0.02872239612042904 in python3. Much thanks, it is giving me headache now.
EDIT:-
This is how I am reading data from the csv file:-
def get_encodings():
df=pd.read_csv('Encodings/encodings.csv') #getting file
with tqdm(total=len(list(df.iterrows()))) as prbar:
encodings=[]
images=[]
for index, row in df.iterrows():
r=[]
en=df.loc[index,'Encoding']
print(en) #prints correctly
print(type(en)) #prints string and I want exact same data in its original form which looks like I have shown below
"[-0.19053705 0.06230173 0.04058716 -0.08283613 -0.07159504 -0.10155849
0.06008045 -0.06842063 0.1317966 -0.10250588 0.203399 -0.01436609
-0.21249449 -0.09238856 0.0279788 0.08926097 -0.09177385 -0.1628615
-0.03505187 -0.12979373 0.05772705 0.00208503 -0.06933809 0.00741822
-0.17499965 -0.25000119 -0.0205064 -0.03139503 0.01130889 -0.1057417
0.13554846 0.06285821 -0.18908061 -0.02082938 0.04383367 0.23148835
-0.05068404 -0.00925579 0.1900605 -0.05617992 -0.12842563 -0.06219928
0.07317995 0.26369438 0.10394366 0.05749369 0.02448226 -0.07668396
0.1266536 -0.23425353 0.04819498 0.07290804 0.111645 0.08294459
0.10209186 -0.21581331 0.07399686 0.07748453 -0.22381224 0.01746997
0.0188249 -0.06403829 -0.07789861 -0.0249712 0.21001905 0.03979192
-0.12171203 -0.06864078 0.21658717 -0.17392246 -0.06753681 0.09808435
-0.0076007 -0.18134885 -0.23990698 0.07026891 0.3552466 0.17010394
-0.16684352 0.03726491 0.02757547 0.01445537 0.10094975 0.04033324
-0.10441576 0.0377433 -0.09693146 0.04404883 0.16759454 0.0402087
-0.05915016 0.1369293 0.05408669 0.05787617 0.03509152 0.01340439
-0.06379045 0.04323686 -0.09738267 -0.02683797 0.14505677 -0.10747927
0.03247242 0.11747092 -0.18656668 0.22448684 -0.00474619 -0.00586929
-0.05853979 0.06613642 -0.065335 0.02921261 0.08723848 -0.30918318
0.23265852 0.20364268 -0.07978678 0.19747412 0.08048097 0.04772019
0.06427031 -0.03703914 -0.14493702 -0.12132056 -0.01301065 -0.02351468
0.10600268 0.06480799]"
One row of my data looks like this^ and i just want all of it without quotes in this type dtype('
If you have a csv, use the csv-module to read it (or read up on pandas, wich will auto-convert your values to suitable types):
Create demo file:
data = """-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822
-1.02872239612042904, -1.19755002856254578, 1.31345692434, -1.0009348573822
-2.02872239612042904, -2.19755002856254578, 2.31345692434, -2.0009348573822
-3.02872239612042904, -3.19755002856254578, 3.31345692434, -3.0009348573822
apple, prank, 0.23, nothing
"""
with open("datafile.csv","w") as f:
f.write(data)
Read demofile back in
def safeFloat(text):
try:
return float(text)
except ValueError: # maybe even catchall here
return float("nan")
data = []
import csv
with open("datafile.csv","r") as r:
csv = csv.reader(r, delimiter=',')
for l in csv:
data.append(list(map(safeFloat,l))) # safeFloat to capture errors
print(data)
If you have non-floats in your data, you may want to use a def safeFloat(text) instead of float inside map to guard against parsing errors some text is not convertable to float.
Output:
[[-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822],
[-1.028722396120429, -1.1975500285625458, 1.31345692434, -1.0009348573822],
[-2.028722396120429, -2.1975500285625458, 2.31345692434, -2.0009348573822],
[-3.028722396120429, -3.1975500285625458, 3.31345692434, -3.0009348573822],
[nan, nan, 0.23, nan]]
You could also use regex, but then your pattern needs to allow the optional sign as well as a dot and numbers before/after it:
r'[+-]?\d+\.\d+' # would allow for 123.1245 - but not for 123 or .1234
# would allow an optional +- before numbers
You can check patterns f.e. at http://regex101.com - this pattern with demo data can be found here: https://regex101.com/r/xSiyO1/1
pandas solution (only valid data):
data = """-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822
-1.02872239612042904, -1.19755002856254578, 1.31345692434, -1.0009348573822
-2.02872239612042904, -2.19755002856254578, 2.31345692434, -2.0009348573822
-3.02872239612042904, -3.19755002856254578, 3.31345692434, -3.0009348573822
"""
with open("datafile.csv","w") as f:
f.write(data)
import pandas as pd
import numpy as np
df = pd.read_csv("datafile.csv", dtype={"a":np.float64,"b":np.float64,"c":np.float64,"d":np.float64},names=["a","b","c","d"] )
print(df)
Output:
a b c d
0 -0.028722 -0.19755 0.313457 -0.000935
1 -1.028722 -1.19755 1.313457 -1.000935
2 -2.028722 -2.19755 2.313457 -2.000935
3 -3.028722 -3.19755 3.313457 -3.000935

limit a float list into 10 digits

I have a list import from a data file.
lines=['1628.246', '100.0000', '0.4563232E-01', '0.4898217E-01', '0.3017656E-02', '0.2271272', '0.2437533', '0.1500232E-01', '0.4102987', '0.4117742', '0.5461504E-02', '2.080838', '0.5527303E-03', '-0.4542367E-03', '-0.2238781E-01', '-0.8196812E-03', '-0.3796306E-01', '-0.7906407E-03', '-0.6738000E-03', '0.000000']
I want to generate a new list include all element in same 10 digits and put back to file
Here is I did:
newline=map(float,lines)
newline=map("{:.10f}".format,newline)
newline=map(str,newline)
jitterfile.write(join(newline)+'\n')
It works, but looks not beautiful. Any idea to make it good looking?
You can do it in a single line like so:
newline=["{:.10f}".format(float(i)) for i in lines]
jitterfile.write(join(newline)+'\n')
Of note, your third instruction newline=map(str,newline) is redundant as the entries in the list are already strings, so casting them is unnecessary.
The map function also accept lambda , also as the result of format is string you don't need to apply the str on your list ,and you need to use join with a delimiter like ',':
>>> newline=map(lambda x:"{:.10f}".format(float(x)),newline)
>>> newline
['1628.2460000000', '100.0000000000', '0.0456323200', '0.0489821700', '0.0030176560', '0.2271272000', '0.2437533000', '0.0150023200', '0.4102987000', '0.4117742000', '0.0054615040', '2.0808380000', '0.0005527303', '-0.0004542367', '-0.0223878100', '-0.0008196812', '-0.0379630600', '-0.0007906407', '-0.0006738000', '0.0000000000']
jitterfile.write(','.join(newline)+'\n')

Categories