Remove line that contain specific element in a txt file python - python

I have a txt file with elements:
705.95 117.81 1242.00 252.43 5.02
1036.12 183.52 1242.00 375.00 1.96
124.11 143.43 296.91 230.32 10.70
0.00 0.00 0.00 0.00 4.84
0.00 6.60 112.99 375.00 17.50
0.00 186.66 14.82 375.00 8.23
695.36 162.75 820.66 263.08 12.84
167.61 134.45 417.75 222.10 27.61
0.00 0.00 0.00 0.00 6.86
0.00 0.00 0.00 0.00 11.76
I want to delete lines that contains 0.00 0.00 0.00 0.00 as the first four elements of each line, how can I do that using python? Your help is highly appreciated.

with open('file.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
if not line.startswith('0.00 0.00 0.00 0.00'):
outfile.write(line)
Here we open file.txt with your lines for reading and output.txt for writing the result. Then, we iterate over each line of the input file and write the line in the results file if it doesn't start with '0.00 0.00 0.00 0.00'.

If you want to overwrite the files without creating new output files, you can try this. The following code also helps you iterate through all the text files in the current directory.
import glob
for i in glob.glob("*.txt"):
with open(i, "r+") as f:
content = f.readlines()
f.truncate(0)
f.seek(0)
for line in content:
if not line.startswith("0.00 0.00 0.00 0.00"):
f.write(line)

Related

NameError: name 'load' is not defined after importing spyrmsd

I'm running a python code where I'm importing the python module sPyRMSD. I'm getting an error when I reach the line containing the io.loadmol() for loading molecules.
from spyrmsd import io,rmsd
ref = io.loadmol("tempref.pdb")
I'm getting the following error-
Reference Molecule:lig_ref.pdb
PDBqt File:ligand_vina_out.pdbqt
Traceback (most recent call last):
File "rmsd.py", line 34, in <module>
ref = io.loadmol("tempref.pdb")
File "/home/aathiranair/.local/lib/python3.8/site-
packages/spyrmsd/io.py", line 66, in loadmol
mol = load(fname)
NameError: name 'load' is not defined
I tried uninstalling and reinstalling the spyrmsd module, but I still face the same issue.
I also tried creating a virtual environment and running the script but faced the same issue.
(ihub_proj) aathiranair#aathiranair-Inspiron-5406-
2n1:~/Desktop/Ihub$ python3 rmsd.py lig_ref.pdb
ligand_vina_out.pdbqt
Reference Molecule:lig_ref.pdb
PDBqt File:ligand_vina_out.pdbqt
Traceback (most recent call last):
File "rmsd.py", line 34, in <module>
ref = io.loadmol("tempref.pdb")
File
"/home/aathiranair/Desktop/Ihub/ihub_proj/lib/python3.8/site-
packages/spyrmsd/io.py", line 66, in loadmol
mol = load(fname)
NameError: name 'load' is not defined
the tempref.pdb file looks like this-
ATOM 1 O6 LIG 359 2.349 1.014 7.089 0.00 0.00
ATOM 9 H LIG 359 1.306 1.691 9.381 0.00 0.00
ATOM 2 C2 LIG 359 0.029 4.120 8.082 0.00 0.00
ATOM 3 O9 LIG 359 -1.106 2.491 9.345 0.00 0.00
ATOM 4 C1 LIG 359 -0.204 3.890 0.337 0.00 0.00
ATOM 5 S5 LIG 359 -0.355 4.108 4.075 0.00 0.00
ATOM 8 C4 LIG 359 -3.545 1.329 7.893 0.00 0.00
ATOM 7 C7 LIG 359 -1.133 5.150 9.406 0.00 0.00
ATOM 6 C3 LIG 359 -0.064 1.805 8.234 0.00 0.00
It seems that to use the module io one of OpenBabel or RDKit is required
Also, make sure to have NumPy

sklearn.metrics.classification_report support field is showing wrong number for the labels

I am using sklearn.metrics.classification_report to evaluate the result of my classification.
y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)
print(classification_report(y_true, y_pred, target_names=list(le.classes_)))
And here is my result:
precision recall f1-score support
Technology 0.00 0.00 0.00 1
Travel 0.00 0.00 0.00 5
Fashion 0.00 0.00 0.00 25
Entertainment 0.72 1.00 0.84 130
Art 0.00 0.00 0.00 7
Politic 0.00 0.00 0.00 12
avg / total 0.52 0.72 0.61 180
The problem is actually I have 7 labels. The order goes Technology, Travel, Fashion, Entertainment, Art, Politic, Sports. Actually I don't have any Art label in my y_true result but the report lists in order so, it lists Art but skips Sports. It writes the result of Politic for Art and the result of Sports goes to Politic's row.
Why does not it skip the Art? I have no idea how can I solve this.
The index labels in the classification report are the values of the argument "target_names". Please make sure, you are giving the right values to that argument.
According to your required output, you should have
target_names = ["Technology", "Travel", "Fashion", "Entertainment", "Politic", "Sports"]
I would suggest, please check the output of your "le.classes_" and I am not sure which transformer 'le' refers to.

Writing lines from input file to output file based on the order in a list

I have an input data input.dat that looks like this:
0.00 0.00
0.00 0.00
0.00 0.00
-0.28 1.39
-0.49 1.24
-0.57 1.65
-0.61 2.11
-0.90 1.73
-0.87 2.29
I have have a list denoting line numbers as follows:
linenum = [7, 2, 6]
I need to write to a file output_veloc_max.dat the rows in input.dat that correspond to linenum values in the same order.
The result should look like this:
-0.61 2.11
0.00 0.00
-0.57 1.65
I have written the following code:
linenum=[7,2,6]
i=1
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
for line1 in f5:
if i in linenum:
print(line1, end=' ', file=out)
print(i,line1)
i+=1
But, it gives me output that looks like this:
2 0.00 0.00
6 -0.57 1.65
7 -0.61 2.11
What am I doing wrong?
Store the values as you encounter them in a dictionary d with the keys denoting the line number and the value holding the line contents. Write them to the file with writelines according to the order of linenum. Use enumerate(fileobj, 1) to get a line number for each line instead of an explicit counter like i:
linenum=[7,2,6]
d = {}
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
for num, line1 in enumerate(f5, 1):
if num in linenum:
d[num] = line1
out.writelines([d[i] for i in linenum])
Of course, you can further trim this down with a dictionary comprehension:
linenum = [7, 2, 6]
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
d = {k: v for k, v in enumerate(f5, 1) if k in linenum}
out.writelines([d[i] for i in linenum])

python matplotlib plotfile explicitly use floating number

I have a simple data file to plot.
Here is the contents of a data file and I named it "ttry":
0.27 0
0.28 0
0.29 0
0.3 0
0.31 0
0.32 0
0.33 0
0.34 0
0.35 0
0.36 0
0.37 0
0.38 0.00728737997257
0.39 0.0600137174211
0.4 0.11488340192
0.41 0.157321673525
0.42 0.193158436214
0.43 0.233882030178
0.44 0.273319615912
0.45 0.311556927298
0.46 0.349879972565
0.47 0.387602880658
0.48 0.424211248285
0.49 0.460390946502
0.5 0.494855967078
0.51 0.529406721536
0.52 0.561814128944
0.53 0.594307270233
0.54 0.624228395062
0.55 0.654492455418
0.56 0.683984910837
0.57 0.711762688615
0.58 0.739368998628
0.59 0.765775034294
0.6 0.790895061728
0.61 0.815586419753
0.62 0.840192043896
0.63 0.863082990398
0.64 0.886231138546
0.65 0.906292866941
0.66 0.915809327846
0.67 0.911436899863
0.68 0.908179012346
0.69 0.904749657064
0.7 0.899519890261
0.71 0.895147462277
0.72 0.891632373114
0.73 0.888803155007
0.74 0.884687928669
0.75 0.879029492455
0.76 0.876114540466
0.77 0.872170781893
0.78 0.867541152263
0.79 0.86274005487
0.8 0.858367626886
0.81 0.854080932785
0.82 0.850994513032
0.83 0.997170781893
0.84 1.13477366255
0.85 1.24296982167
0.86 1.32690329218
0.87 1.40397805213
0.88 1.46836419753
0.89 1.52306241427
0.9 1.53232167353
0.91 1.52906378601
0.92 1.52211934156
0.93 1.516718107
0.94 1.51543209877
0.95 1.50660150892
0.96 1.50137174211
0.97 1.49408436214
0.98 1.48816872428
0.99 1.48088134431
1 1.4723079561
And then I use matplotlib.pyplot.plotfile to plot it. Here is my python script
from matplotlib import pyplot
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ")
pyplot.show()
However the following error appears:
C:\WINDOWS\system32\cmd.exe /c ttry.py
Traceback (most recent call last):
File "E:\research\ttry.py", line 2, in <module>
pyplot.plotfile("ttry",col=(0,1),delimiter=" ")
File "C:\Python33\lib\site-packages\matplotlib\pyplot.py", line 2311, in plotfile
checkrows=checkrows, delimiter=delimiter, names=names)
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in csv2rec
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in <listcomp>
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2031, in newfunc
return func(val)
ValueError: invalid literal for int() with base 10: '0.00728737997257'
shell returned 1
Hit any key to close this window...
Obviously, python just considers yaxis data as int. So how to tell python I use float for yaxis data?
It implies int type of your second column based on first few values, which are all int's. To make it check all rows, add checkrows = 0 to arguments, that is:
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ", checkrows = 0)
It's an argument coming from matplotlib.mlab.csv2rec, see more info here.

Double conditional filtering of many csv files in python

I stack into the following task: combine all files under specific mask and remove duplicates based on two criteria: if the name and TEXT are identical keep the one with the largest 4th column.
I currently have this not well tested code (based on my previous question), but because it's using dictionary, it rewriting the previous data that has the same name but different TEXT. I was trying to use only lists.
How to make the filtering on two conditions simultaneously?
Your help is greatly appreciated.
import glob,csv
files = glob.glob("*.txt")
fo = open("combined.csv","a")
writer = csv.writer(fo,delimiter=' ')
datum = []
nyt = set()
for f in files:
with open(f) as fi:
for row in csv.reader(fi,delimiter=' '):
crow = row[0],row[4]
nyt.add(crow)
if crow in nyt:
dupl = [element for element in datum if element[0] == row[0]]
if dupl[0][3] < row[3]:
# replace row in datum with row
if dupl[0][3] > row[3]:
continue
else:
datum.append(row)
examples
file1
name1 0.06 0.91 0.17 TEXT1 smthing smthing
name2 0.46 0.42 0.02 TEXT1 smthing smthing
name3 0.15 0.80 0.61 TEXT1 smthing smthing
file2
name1 0.68 0.38 0.61 TEXT2 smthing smthing
name2 0.73 0.62 0.03 TEXT2 smthing smthing
name3 0.84 0.81 0.60 TEXT2 smthing smthing
file3
name1 0.86 0.18 0.03 TEXT1 smthing smthing
name2 0.04 0.12 0.75 TEXT1 smthing smthing
name3 0.59 0.70 0.71 TEXT1 smthing smthing
I was thinking too much, the fast solution is to combine two values as a unique key for the dict
import glob,csv
files = glob.glob("*.txt")
fo = open("combined.csv","a")
writer = csv.writer(fo,delimiter=' ')
datum = {}
for f in files:
with open(f) as fi:
for row in csv.reader(fi,delimiter=' '):
crow = row[0],row[4]
if crow in datum:
if float(datum[crow][-4]) < float(row[3]):
datum[crow] = row[0:]
else:
datum[crow] = row[0:]

Categories