DataFrame max() not return max - python

Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.
I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:
# Setup
!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:
print(df[0])
print(df[0].max())
Will output:
0 3.385
1 0.48
2 1.35
3 465
4 36.33
5 27.66
6 14.83
7 1.04
8 4.19
9 0.425
10 0.101
11 0.92
12 1
13 0.005
14 0.06
15 3.5
16 2
17 1.7
18 2547
19 0.023
20 187.1
21 521
22 0.785
23 10
24 3.3
25 0.2
26 1.41
27 529
28 207
29 85
...
32 6654
33 3.5
34 6.8
35 35
36 4.05
37 0.12
38 0.023
39 0.01
40 1.4
41 250
42 2.5
43 55.5
44 100
45 52.16
46 10.55
47 0.55
48 60
49 3.6
50 4.288
51 0.28
52 0.075
53 0.122
54 0.048
55 192
56 3
57 160
58 0.9
59 1.62
60 0.104
61 4.235
Name: 0, Length: 62, dtype: object
Max: 85
Obviously, the maximum value is way out -- it should be 6654, not 85.
What on earth am I doing wrong?
First StackOverflow post, so thanks in advance.

If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).
These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.
Check the following reproducible example:
>>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
>>> df.col.max()
'9'
You can check that because
>>> '9' > '85'
True
You want these values to be considered floats instead. Use pd.to_numeric
>>> df['col'] = pd.to_numeric(df.col)
>>> df.col.max()
85
For more on str and int comparison, check this question

Related

Read tab seperated txt file in pandas (python)

While trying to read a .txt file in pandas, I'm getting an error where the imported file is only one row but has far too many columns.
This is one row from the data
1 182154.6-025557 18:21:54.63 -02:55:57.2 0.0 8.25e-03 1.5e-02 0.20 1.02e-01 -1.95e-01 1.5e-02 55 37 189 0.0 1.53e-01 3.3e-02 0.16 6.32e-01 7.24e-01 6.5e-02 46 29 59 6.2 2.91e-01 5.8e-02 0.17 4.62e-01 6.83e-01 7.0e-02 37 20 54 6.3 3.27e-01 6.2e-02 0.19 3.92e-01 5.51e-01 6.6e-02 37 26 47 0.0 2.28e-01 9.8e-02 0.12 2.50e-01 9.8e-02 46 36 43 7.6 1.1 0.24 0.5 4.6 40 22 36 2 0 starless
I'm using the following code to import the data:
data = pd.read_csv("data.txt", header=None, sep='\t', lineterminator='\r')
And this outputs:
0 1 ... 26254 26255
0 1 182154.6-025557 18:21:54.63 ... NaN CO high-V_LSR\n
[1 rows x 26256 columns]
Any advice on how to import this data correctly would be very helpful
Perhaps your .txt file isn't quite tab separated.
This code should work for reading in multiple lines from your file. It just splits items if there is whitespace between them.
with open('data.txt', 'r') as f:
raw_data = f.readlines()
data = []
for line in raw_data:
data.append([l for l in line.strip().split(' ') if l !=''])
pd.DataFrame(data)
I get the following output (dataframe with 63 columns)
0 1 2 3 4 5 6 \
0 182154.6-025557 18:21:54.63 -02:55:57.2 0.0 8.25e-03 1.5e-02 0.20
7 8 9 ... 53 54 55 56 57 58 59 60 61 \
0 1.02e-01 -1.95e-01 1.5e-02 ... 1.1 0.24 0.5 4.6 40 22 36 2 0
62
0 starless
[1 rows x 63 columns]
Either that or you want to try…
data = pd.read_csv("data.txt", header=None, sep='\t', lineterminator='\n')

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

how to slicing csv file using python's panda

I have this csv file:
DATE RELEASE 10GB 100GB 200GB 400GB 600GB 800GB 1000GB
5/5/16 2.67 0.36 4.18 8.54 18 27 38 46
5/5/16 2.68 0.5 4.24 9.24 18 27 37 46
5/6/16 2.69 0.32 4.3 9.18 19 29 37 46
5/6/16 2.7 0.35 4.3 9.3 19 28 37 46
5/6/16 2.71 0.3 4.1 8 8.18 16 24 33 41
I need to calculate the difference of each column (10 GB ~ 1000GB)between release 2.71 and release 2.70. That means last row - the row above.
My code to access each desired row are: row1=df[df.RELEASE == 2.70], and row2 = df[df.RELEASE == 2.71]
My issue is: I do not know how to access each element. I try to put the row1 and row2 into list. list(row1), list(row2), but that only print the title rather than the value of each cell.
My question is: how do I acces each element of desired row, so I can calculat: "0.3 -0.35" Thanks for helping!
If its really the last two rows you're chasing try out negative indexing
df.loc[-1,:] - df.loc[-2,:]
I'm on a phone so haven't run the code but it should get you closer.

How to write code in a vectorized way instead of using loops?

I would like to write the following code in a vectorized way as the current code is pretty slow (and would like to learn Python best practices). Basically, the code is saying that if today's value is within 10% of yesterday's value, then today's value (in a new column) is the same as yesterday's value. Otherwise, today's value is unchanged:
def test(df):
df['OldCol']=(100,115,101,100,99,70,72,75,78,80,110)
df['NewCol']=df['OldCol']
for i in range(1,len(df)-1):
if df['OldCol'][i]/df['OldCol'][i-1]>0.9 and df['OldCol'][i]/df['OldCol'][i-1]<1.1:
df['NewCol'][i]=df['NewCol'][i-1]
else:
df['NewCol'][i]=df['OldCol'][i]
return df['NewCol']
The output should be the following:
OldCol NewCol
0 100 100
1 115 115
2 101 101
3 100 101
4 99 101
5 70 70
6 72 70
7 75 70
8 78 70
9 80 70
10 110 110
Can you please help?
I would like to use something like this but I did not manage to solve my issue:
def test(df):
df['NewCol']=df['OldCol']
cond=np.where((df['OldCol'].shift(1)/df['OldCol']>0.9) & (df['OldCol'].shift(1)/df['OldCol']<1.1))
df['NewCol'][cond[0]]=df['NewCol'][cond[0]-1]
return df
A solution in three steps :
df['variation']=(df.OldCol/df.OldCol.shift())
df['gap']=~df.variation.between(0.9,1.1)
df['NewCol']=df.OldCol.where(df.gap).fillna(method='ffill')
For :
OldCol variation gap NewCol
0 100 nan True 100
1 115 1.15 True 115
2 101 0.88 True 101
3 100 0.99 False 101
4 99 0.99 False 101
5 70 0.71 True 70
6 72 1.03 False 70
7 75 1.04 False 70
8 78 1.04 False 70
9 80 1.03 False 70
10 110 1.38 True 110
It seems to be 30x faster than loops on this exemple.
In one line :
x=df.OldCol;df['NewCol']=x.where(~(x/x.shift()).between(0.9,1.1)).fillna(method='ffill')
You should boolean mask your original dataframe:
df[(0.9 <= df['NewCol']/df['OldCol']) & (df['NewCol']/df['OldCol'] <= 1.1)] Will give you all rows where NewCol is within 10% of OldCol
So to set the NewCol field in these rows:
within_10 = df[(0.9 <= df['NewCol']/df['OldCol']) & (df['NewCol']/df['OldCol'] <= 1.1)]
within_10['NewCol'] = within_10['OldCol']
Since you seem to be on a good way of finding the "jump" days yourself I'll only show the trickier bit. So let's assume you have a numpy array with old of length N and a boolean numpy array jump of the same size. As a matter of convention the zeroth element of jump is set at True. Then you can first calculate the numbers of repeats between jumps:
jump_indices = np.where(jumps)[0]
repeats = np.diff(np.r_[jump_indices, [N]])
once you have these you can use np.repeat:
new = np.repeat(old[jump_indices], repeats)

Debugging a print DataFrame issue in Pandas

How do I debug a problem with printing a Pandas DataFrame ? I call this function and then print the output (which is a Pandas DataFrame).
n=ion_tab(y_ion,cycles,t,pH)
print(n)
The last part of the output looks like this:
58 O2 1.784306e-35 4 86 7.3
60 HCO3- 5.751170e+02 5 86 7.3
61 Ca+2 1.825748e+02 5 86 7.3
62 CO2 3.928413e+01 5 86 7.3
63 CaHCO3+ 3.755015e+01 5 86 7.3
64 CaCO3 4.616840e+00 5 86 7.3
65 SO4-2 1.393365e+00 5 86 7.3
66 CO3-2 8.243118e-01 5 86 7.3
67 CaSO4 7.363058e-01 5 86 7.3
... ... ... ... ...
[65 rows x 5 columns]
But if I do an n.tail() command, I see the missing data that ... seems to suggest.
print n.tail()
Species ppm as ion Cycles Temp F pH
68 OH- 5.516061e-03 5 86 7.3
69 CaOH+ 6.097815e-04 5 86 7.3
70 HSO4- 5.395493e-06 5 86 7.3
71 CaHSO4+ 2.632098e-07 5 86 7.3
73 O2 1.783007e-35 5 86 7.3
[5 rows x 5 columns]
If I count the number of rows showing up on the screen, I get 60. If I add the 5 extra that show up with n.tail(), I get the expected 65 rows. Is there some limit in print that would only allow 60 rows ? What's causing ... at the end of my DataFrame ?
Initially I though there was something in the ion_tab function that was limiting the printing. But one I saw the missing data in the n.tail() statement, I got confused.
Any help in debugging this would be appreciated.
Pandas limits the number of rows printed by default. You can change that with pd.set_option
In [4]: pd.get_option('display.max_rows')
Out[4]: 60
In [5]: pd.set_option('display.max_rows', 100)

Categories