Replacing different numbers of whitespace with 2 spaces Python

Replacing different numbers of whitespace with 2 spaces Python - python

I am opening a binary file to write to a txt file. I need to remove some lines of the code before writing it to the text so I do this with the following code:
with open(rawfile) as f, open(outfile,'w') as f2:
for x in f:
if (':') not in x and ('Station') not in x and('--')not in x and('hPa') not in x:
f2.write(x.strip()+'\n')
This produces several columns of data like:
Header Header Header Header
Data Data Data Data
Data Data Data Data
As you can see, there are varying number of space between each string in each column/row. While keeping the columns and rows intact, all I need to do is replace the varying number of spaces to be 2 spaces evenly between all data strings. I tried making another for statement but could not figure out how to keep it all to one file and have it overwrite the file everytime.
EDIT WITH DATA SNIPPET:
PRES HGHT TEMP DWPT FRPT RELH RELI MIXR DRCT SKNT
1002.0 53 32.4 23.4 23.4 59 59 18.47 345 10
1000.0 65 31.0 19.0 19.0 49 49 14.03 340 10
999.0 74 30.4 19.4 19.4 52 52 14.41 339 10
996.0 101 30.1 19.5 19.5 53 53 14.54 335 10
972.0 318 28.1 20.3 20.3 63 63 15.64 355 10
959.0 437 26.9 20.7 20.7 69 69 16.29 340 11
932.0 692 24.5 21.6 21.6 84 84 17.74 0 5
931.0 1 5
925.0 10 5
888.0 12.88 95 13
Additionally, I cannot have the data become bunched up in those large blank spaces. Data needs to remain under its associated header (this is a text file and not a dataframe)

You could do:
header_len = 0
for x in f:
if (':') not in x and ('Station') not in x and('--')not in x and('hPa') not in x:
line = x.strip()+'\n'
line_parts = line.split(' ')
if header_len == 0:
header_len = len(line_parts[0])
elif len(line_parts[0]) < header_len:
line_parts = [lp + ' ' * (header_len - len(lp)) for lp in line_parts]
while '' in line_parts:
line_parts.remove('')
f2.write(' '.join(line_parts))

Related

How can I specify a different decimal format on each column when using Pandas DataFrame to CSV?

I am parsing specific columns from a text file with data that looks like this:
n Elapsed time TimeUTC HeightMSL GpsHeightMSL P Temp RH Dewp Dir Speed Ecomp Ncomp Lat Lon
s hh:mm:ss m m hPa ∞C % ∞C ∞ m/s m/s m/s ∞ ∞
1 0 23:15:43 198 198 978.5 33.70 47 20.87 168.0 7.7 -1.6 7.6 32.835222 -97.297940
2 1 23:15:44 202 201 978.1 33.03 48 20.62 162.8 7.3 -2.2 7.0 32.835428 -97.298000
3 2 23:15:45 206 206 977.6 32.89 48 20.58 160.8 7.5 -2.4 7.0 32.835560 -97.298077
4 3 23:15:46 211 211 977.1 32.81 49 20.58 160.3 7.8 -2.6 7.4 32.835660 -97.298160
5 4 23:15:47 217 217 976.5 32.74 49 20.51 160.5 8.3 -2.7 7.8 32.835751 -97.298242
6 5 23:15:48 223 223 975.8 32.66 48 20.43 160.9 8.7 -2.8 8.2 32.835850 -97.298317
I perform one calculation on the first m/s column (converting m/s to kt) and write all data where hpa > 99.9 to an output file. That output looks like this:
978.5,198,33.7,20.87,168.0,14.967568
978.1,201,33.03,20.62,162.8,14.190032
977.6,206,32.89,20.58,160.8,14.5788
977.1,211,32.81,20.58,160.3,15.161952
976.5,217,32.74,20.51,160.5,16.133872
975.8,223,32.66,20.43,160.9,16.911407999999998
The code executes fine and the output file works for what I'm using it for, but is there a way to format the column output to a specific decimal place? As you can see in my code, I've tried df.round but it doesn't impact the output. I've also looked at float_format parameter, but that seems like it would apply the format to all columns. My intended output should look like this:
978.5, 198, 33.7, 20.9, 168, 15
978.1, 201, 33.0, 20.6, 163, 14
977.6, 206, 32.9, 20.6, 161, 15
977.1, 211, 32.8, 20.6, 160, 15
976.5, 217, 32.7, 20.5, 161, 16
975.8, 223, 32.7, 20.4, 161, 17
My code is below:
import pandas as pd
headers = ['n', 's', 'time', 'm1', 'm2', 'hpa', 't', 'rh', 'td', 'dir', 'spd', 'u', 'v', 'lat', 'lon']
df = pd.read_csv ('edt_20220520_2315.txt', encoding_errors = 'ignore', skiprows = 2, sep = '\s+', names = headers)
df['spdkt'] = df['spd'] * 1.94384
df['hpa'].round(decimals = 1)
df['spdkt'].round(decimals = 0)
df['t'].round(decimals = 1)
df['td'].round(decimals = 1)
df['dir'].round(decimals = 0)
extract = ['hpa', 'm2', 't', 'td', 'dir', 'spdkt']
with open('test_output.txt' , 'w') as fh:
df_to_write = df[df['hpa'] > 99.9]
df_to_write.to_csv(fh, header = None, index = None, columns = extract, sep = ',')

You can pass dictionary and then if round by 0 casting columns to integers:
d = {'hpa':1, 'spdkt':0, 't':1, 'td':1, 'dir':0}
df = df.round(d).astype({k:'int' for k, v in d.items() if v == 0})
print (df)
n s time m1 m2 hpa t rh td dir spd u v \
0 1 0 23:15:43 198 198 978.5 33.7 47 20.9 168 7.7 -1.6 7.6
1 2 1 23:15:44 202 201 978.1 33.0 48 20.6 163 7.3 -2.2 7.0
2 3 2 23:15:45 206 206 977.6 32.9 48 20.6 161 7.5 -2.4 7.0
3 4 3 23:15:46 211 211 977.1 32.8 49 20.6 160 7.8 -2.6 7.4
4 5 4 23:15:47 217 217 976.5 32.7 49 20.5 160 8.3 -2.7 7.8
5 6 5 23:15:48 223 223 975.8 32.7 48 20.4 161 8.7 -2.8 8.2
lat lon spdkt
0 32.835222 -97.297940 15
1 32.835428 -97.298000 14
2 32.835560 -97.298077 15
3 32.835660 -97.298160 15
4 32.835751 -97.298242 16
5 32.835850 -97.298317 17

Read tab seperated txt file in pandas (python)

While trying to read a .txt file in pandas, I'm getting an error where the imported file is only one row but has far too many columns.
This is one row from the data
1 182154.6-025557 18:21:54.63 -02:55:57.2 0.0 8.25e-03 1.5e-02 0.20 1.02e-01 -1.95e-01 1.5e-02 55 37 189 0.0 1.53e-01 3.3e-02 0.16 6.32e-01 7.24e-01 6.5e-02 46 29 59 6.2 2.91e-01 5.8e-02 0.17 4.62e-01 6.83e-01 7.0e-02 37 20 54 6.3 3.27e-01 6.2e-02 0.19 3.92e-01 5.51e-01 6.6e-02 37 26 47 0.0 2.28e-01 9.8e-02 0.12 2.50e-01 9.8e-02 46 36 43 7.6 1.1 0.24 0.5 4.6 40 22 36 2 0 starless
I'm using the following code to import the data:
data = pd.read_csv("data.txt", header=None, sep='\t', lineterminator='\r')
And this outputs:
0 1 ... 26254 26255
0 1 182154.6-025557 18:21:54.63 ... NaN CO high-V_LSR\n
[1 rows x 26256 columns]
Any advice on how to import this data correctly would be very helpful

Perhaps your .txt file isn't quite tab separated.
This code should work for reading in multiple lines from your file. It just splits items if there is whitespace between them.
with open('data.txt', 'r') as f:
raw_data = f.readlines()
data = []
for line in raw_data:
data.append([l for l in line.strip().split(' ') if l !=''])
pd.DataFrame(data)
I get the following output (dataframe with 63 columns)
0 1 2 3 4 5 6 \
0 182154.6-025557 18:21:54.63 -02:55:57.2 0.0 8.25e-03 1.5e-02 0.20
7 8 9 ... 53 54 55 56 57 58 59 60 61 \
0 1.02e-01 -1.95e-01 1.5e-02 ... 1.1 0.24 0.5 4.6 40 22 36 2 0
62
0 starless
[1 rows x 63 columns]
Either that or you want to try…
data = pd.read_csv("data.txt", header=None, sep='\t', lineterminator='\n')

Calculate mean values from pandas dataframe

I am trying to find a good way to calculate mean values from values in a dataframe. It contains measured data from an experiment and is imported from an excel sheet. The columns contain the time passed by, electric current and the corresponding voltage.
The current is changed in steps and then held for some time (the current values vary a little bit, so they are not exactly the same for each step). Now I want to calculate the mean voltage for each current step. Since it takes some time after the voltage gets stable after a step, I also want to leave out the first few voltage values after a step.
Currently I am doing this with loops, but I was wondering wether there is a nicer way with the usage of the groupby function (or others maybe).
Just say if you need more details or clarification.
Example of data:
s [A] [V]
0 6.0 -0.001420 0.780122
1 12.0 -0.002484 0.783297
2 18.0 -0.001478 0.785870
3 24.0 -0.001256 0.793559
4 30.0 -0.001167 0.806086
5 36.0 -0.000982 0.815364
6 42.0 -0.003038 0.825018
7 48.0 -0.001174 0.831739
8 54.0 0.000478 0.838861
9 60.0 -0.001330 0.846086
10 66.0 -0.001456 0.851556
11 72.0 0.000764 0.855950
12 78.0 -0.000916 0.859778
13 84.0 -0.000916 0.859778
14 90.0 -0.001445 0.863569
15 96.0 -0.000287 0.864303
16 102.0 0.000056 0.865080
17 108.0 -0.001119 0.865642
18 114.0 -0.000843 0.866434
19 120.0 -0.000997 0.866809
20 126.0 -0.001243 0.866964
21 132.0 -0.002238 0.867180
22 138.0 -0.001015 0.867177
23 144.0 -0.000604 0.867505
24 150.0 0.000507 0.867571
25 156.0 -0.001569 0.867525
26 162.0 -0.001569 0.867525
27 168.0 -0.001131 0.866756
28 174.0 -0.001567 0.866884
29 180.0 -0.002645 0.867240
.. ... ... ...
242 1708.0 24.703866 0.288902
243 1714.0 26.469208 0.219226
244 1720.0 26.468838 0.250437
245 1726.0 26.468681 0.254972
246 1732.0 26.468173 0.271525
247 1738.0 26.468260 0.247282
248 1744.0 26.467666 0.296894
249 1750.0 26.468085 0.247300
250 1756.0 26.468085 0.247300
251 1762.0 26.467808 0.261096
252 1768.0 26.467958 0.259615
253 1774.0 26.467828 0.260871
254 1780.0 28.232325 0.185291
255 1786.0 28.231697 0.197642
256 1792.0 28.231170 0.172802
257 1798.0 28.231103 0.170685
258 1804.0 28.229453 0.184009
259 1810.0 28.230816 0.181833
260 1816.0 28.230913 0.188348
261 1822.0 28.230609 0.178440
262 1828.0 28.231144 0.168507
263 1834.0 28.231144 0.168507
264 1840.0 8.813723 0.641954
265 1846.0 8.814301 0.652373
266 1852.0 8.818517 0.651234
267 1858.0 8.820255 0.637536
268 1864.0 8.821443 0.628136
269 1870.0 8.823643 0.636616
270 1876.0 8.823297 0.635422
271 1882.0 8.823575 0.622253
Output:
s [A] [V]
0 303.000000 -0.000982 0.857416
1 636.000000 0.879220 0.792504
2 699.000000 1.759356 0.752446
3 759.000000 3.519479 0.707161
4 816.000000 5.278372 0.669020
5 876.000000 7.064800 0.637848
6 939.000000 8.828799 0.611196
7 999.000000 10.593054 0.584402
8 1115.333333 12.357359 0.556127
9 1352.000000 14.117167 0.528826
10 1382.000000 15.882287 0.498577
11 1439.000000 17.646748 0.468379
12 1502.000000 19.410817 0.437342
13 1562.666667 21.175572 0.402381
14 1621.000000 22.939826 0.365724
15 1681.000000 24.704600 0.317134
16 1744.000000 26.468235 0.256047
17 1807.000000 28.231037 0.179606
18 1861.000000 8.819844 0.638190
The current approach:
df = df[['s','[A]','[V]']]
#Looping over the rows to separate current points
b=df['[A]'].iloc[0]
start=0
list = []
for index, row in df.iterrows():
if not math.isclose(row['[A]'], b, abs_tol=1e-02):
b=row['[A]']
list.append(df.iloc[start:index])
start=index
list.append(df.iloc[start:])
#Deleting first few points after each current change
list_b = []
for l in list:
list_b.append(l.iloc[3:])
#Calculating mean values for each current point
list_c = []
for l in list_b:
list_c.append(l.mean())
result=pd.DataFrame(list_c)

Does this help?
df.groupby(['Columnname', 'Columnname2']).mean()
You may need to create intermediate dataframes for each step. Can you provide an example of the output you want?

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.

Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]

This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Python selecting items by comparing values in a table using dictionary

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively.
The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).
So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.
Here is the code:
#!usr/bin/python
filename = "data.txt"
readfile = open(filename,"r")
d = dict()
for i in readfile.readlines():
i = i.strip()
i = i.split("\t")
d.setdefault(i[1], []).append([i[0],i[10]])
d.setdefault(i[1], []).append([i[0],i[11]])
for x in d:
print(x,d[x])
readfile.close()
But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid.
Is there any good logic to solve the problem?
Thedata.txtfile (including the header row and with»representing tab characters)
qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732
And here's a little more readable version of the table's contents:
0 1 2 3 4 5 6 7 8 9 10 11
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ACLA_022040 TBB 32.71 431 258 8 39 468 24 423 2.00E-76 240
ACLA_024600 TBB 80 435 87 0 1 435 1 435 0 729
ACLA_031860 TBB 39.74 453 251 3 1 447 1 437 1.00E-121 357
ACLA_046030 TBB 75.81 434 105 0 1 434 1 434 0 704
ACLA_072490 TBB 41.7 446 245 3 4 447 3 435 2.00E-120 353
ACLA_010400 EF1A 27.31 249 127 8 69 286 9 234 3.00E-13 61.6
ACLA_015630 EF1A 22 491 255 17 186 602 3 439 8.00E-19 78.2
ACLA_016510 EF1A 26.23 122 61 4 21 127 9 116 2.00E-08 46.2
ACLA_023300 EF1A 29.31 447 249 12 48 437 3 439 2.00E-45 155
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0 801
ACLA_074730 CALM 23.13 147 101 4 6 143 2 145 7.00E-08 41.2
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1.00E-13 55.1
ACLA_016630 CALM 23.9 159 106 5 58 216 4 147 5.00E-12 51.2
ACLA_031930 RPB2 36.87 1226 633 24 121 1237 26 1219 0 734
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0 1691
ACLA_082370 RPB2 27.69 1228 667 37 31 1132 35 1167 7.00E-110 365
ACLA_061960 ACT 28.57 147 95 5 146 284 69 213 3.00E-12 57.4
ACLA_068200 ACT 28.73 463 231 13 16 471 4 374 1.00E-53 176
ACLA_069960 ACT 24.11 141 97 4 581 718 242 375 9.00E-09 46.2
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0 732

Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.
Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.
import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()
which produces
>>> df_new
sseqid qseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
0 ACT ACLA_095800 91.73 375 31 0 1 375 1 375 0.000000e+00 732.0
1 CALM ACLA_096170 29.33 150 96 4 34 179 2 145 1.000000e-13 55.1
2 EF1A ACLA_028450 85.55 443 63 1 1 443 1 442 0.000000e+00 801.0
3 RPB2 ACLA_065630 65.79 1257 386 14 1 1252 4 1221 0.000000e+00 1691.0
4 TBB ACLA_024600 80.00 435 87 0 1 435 1 435 0.000000e+00 729.0
Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.

#!usr/bin/python
import csv
DATA = "data.txt"
class Sequence:
def __init__(self, row):
self.qseqid = row[0]
self.sseqid = row[1]
self.pident = float(row[2])
self.length = int(row[3])
self.mismatch = int(row[4])
self.gapopen = int(row[5])
self.qstart = int(row[6])
self.qend = int(row[7])
self.sstart = int(row[8])
self.send = int(row[9])
self.evalue = float(row[10])
self.bitscore = float(row[11])
def __str__(self):
return (
"{qseqid}\t"
"{sseqid}\t"
"{pident}\t"
"{length}\t"
"{mismatch}\t"
"{gapopen}\t"
"{qstart}\t"
"{qend}\t"
"{sstart}\t"
"{send}\t"
"{evalue}\t"
"{bitscore}"
).format(**self.__dict__)
def entries(fname, header_rows=1, dtype=list, **kwargs):
with open(fname) as inf:
incsv = csv.reader(inf, **kwargs)
# skip header rows
for i in range(header_rows):
next(incsv)
for row in incsv:
yield dtype(row)
def main():
bestseq = {}
for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
# see if a sequence with the same sseqid already exists
prev = bestseq.get(seq.sseqid, None)
if (
prev is None
or seq.evalue < prev.evalue
or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
):
bestseq[seq.sseqid] = seq
# display selected sequences
keys = sorted(bestseq)
for key in keys:
print(bestseq[key])
if __name__ == "__main__":
main()
which results in
ACLA_095800 ACT 91.73 375 31 0 1 375 1 375 0.0 732.0
ACLA_096170 CALM 29.33 150 96 4 34 179 2 145 1e-13 55.1
ACLA_028450 EF1A 85.55 443 63 1 1 443 1 442 0.0 801.0
ACLA_065630 RPB2 65.79 1257 386 14 1 1252 4 1221 0.0 1691.0
ACLA_024600 TBB 80.0 435 87 0 1 435 1 435 0.0 729.0

While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.
import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter
data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)
# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
(itemgetter(col.strip()), 1)) for col in columns]
def comparer(left, right):
for fn, mult in comparers:
result = cmp(fn(left), fn(right))
if result:
return mult * result
else:
return 0
return sorted(items, cmp=comparer)
# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
with open(data_file_name, 'rb') as data_file:
reader = csv.DictReader(data_file, delimiter=DELIMITER)
format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
for rec in (AttrDict(r) for r in reader):
# Convert the two sort fields to numeric values for proper ordering.
rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
ssqeid_dict[rec.sseqid].append(rec)
for ssqeid in sorted(ssqeid_dict):
# Sort each group of recs with same ssqeid. The first record after sorting
# will be the one sought that has the lowest evalue and highest bitscore.
selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
print format_spec.format(**selected)
Output (»represents tabs):
ACLA_095800» ACT» 91.73» 375» 31» 0» 1» 375» 1» 375» 0.0» 732.0
ACLA_096170» CALM» 29.33» 150» 96» 4» 34» 179» 2» 145» 1e-13» 55.1
ACLA_028450» EF1A» 85.55» 443» 63» 1» 1» 443» 1» 442» 0.0» 801.0
ACLA_065630» RPB2» 65.79» 1257» 386» 14» 1» 1252» 4» 1221» 0.0» 1691.0
ACLA_024600» TBB» 80» 435» 87» 0» 1» 435» 1» 435» 0.0» 729.0

filename = 'data.txt'
readfile = open(filename,'r')
d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
sseqid.append(i.rsplit()[1])
lines.append(i.rsplit())
sorted_sseqid = sorted(set(sseqid))
sdqDict={}
key =None
for sorted_ssqd in sorted_sseqid:
key=sorted_ssqd
evalue=[]
bitscore=[]
qseid=[]
for line in lines:
if key in line:
evalue.append(line[10])
bitscore.append(line[11])
qseid.append(line[0])
sdqDict[key]=[qseid,evalue,bitscore]
print sdqDict
print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid
readfile.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing different numbers of whitespace with 2 spaces Python - python

Related

How can I specify a different decimal format on each column when using Pandas DataFrame to CSV?

Read tab seperated txt file in pandas (python)

Calculate mean values from pandas dataframe

compare 2 dataframe with pandas

Python selecting items by comparing values in a table using dictionary

Categories

Resources