finding the indexes of the global min pandas - python

Suppose you have numerical data for some function z = f(x, y) saved in a pandas dataframe, where x is the index values, y is the column values, and the dataframe is populated with the z data. For example:
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
is there a simple pandas command, or maybe a one-line string of a few simple commands, which returns the (x, y) values corresponding to data attributes, specifically in my case as min(z)? In the example data I'd be looking for (1.0, 0.6)
I'm really just hoping there's an answer that doesn't involve parsing out the data into some other structure, because sure, just linearize the data in a numpy array and correlate the numpy array index with (x,y). But if there's something cleaner/more elegant that I simply am not finding, I'd love to learn about it.

Using pandas.DataFrame.idxmin & pandas.Series.idxmin
import pandas as pd
# df view
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
# min column
min_col_name = df.min().idxmin()
# min column index if needed
min_col_idx = df.columns.get_loc(min_col_name)
# min row index
min_row_idx = df[min_col_name].idxmin()
another option:
(df.min(axis=1).idxmin(), df.min().idxmin())

Related

Retrieve next row in pandas dataframe / multiple list comprehension outputs

I have a Pandas dataframe, wt, with a datetime index and three columns as well as dataframe t with the same datetime index and three other columns below:
wt
date 0 1 2
2004-11-19 0.2 0.3 0.5
2004-11-22 0.0 0.0 0.0
2004-11-23 0.0 0.0 0.0
2004-11-24 0.0 0.0 0.0
2004-11-26 0.0 0.0 0.0
2004-11-29 0.0 0.0 0.0
2004-11-30 0.0 0.0 0.0
t
date GLD SPY TLT
2004-11-19 0.009013068949977443 -0.011116725618999457 -0.007980218051028332
2004-11-22 0.0037963376507370583 0.004769204564810003 0.005211874008610895
2004-11-23 -0.00444938820912133 0.0015256823190370472 0.0012398557258792575
2004-11-24 0.006703910614525022 0.0023696682464455776 0.0
2004-11-26 0.005327413984461682 -0.0007598784194529085 -0.00652932567826181
2004-11-29 0.002428792227864962 -0.004562737642585524 -0.010651558073654366
2004-11-30 -0.006167400881057272 0.0006790595025889523 -0.004237773450922022
2004-12-01 0.005762411347517871 0.011366528119433505 -0.0015527950310557648
I'm currently using the Pandas iterrrows method to run through each row for processing, and as a first step, I check if the row entries are non-zero, as below:
for dt, row in t.iterrows():
if sum(wt.loc[dt]) <= 0:
...
Based on this, I'd like to assign values to dataframe wt if non-zero values don't currently exist. How can I retrieve the next row for a given dt entry (eg, '11/22/2004' for dt = '11/19/2004')?
Part 2
As an addendum, I'm setting this up using a for loop for testing but would like to use list comprehension once complete. Processing will return the wt dataframe described above, as well as an intermediate, secondary dataframe again with datetime index and a single column (sample below):
r
date r
2004-11-19 0.030202
2004-11-22 -0.01047
2004-11-23 0.002456
2004-11-24 -0.01274
2004-11-26 0.00928
Is there a way to use list comprehensions to return both the above wt and this r dataframes without simply creating two separate comprehensions?
Edit
I was able to get desired results by changing my approach, so adding for clarification (referenced dataframes are as described above). Wonder if there's any way to apply list comprehensions for this.
r = pd.DataFrame(columns=['ret'],index=wt.index.copy())
dts = wt.reset_index().date
for i, dt in enumerate(dts):
row = t.loc[dt]
dt_1 = dts.shift(-1).iloc[i]
try:
wt.loc[dt_1] = ((wt.loc[dt].tolist() * (1+row)).transpose() / np.dot(wt.loc[dt].tolist(), (1+row))).tolist()
r.loc[dt] = np.dot(wt.loc[dt], row)
except:
print(f'Error calculating for date {dt}')
continue

Efficient way to randomly select all rows from pandas dataframe corresponding to a column value

I have a pandas dataframe containing about 2 Million rows which looks like the following example
ID V1 V2 V3 V4 V5
12 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
01 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
02 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
07 3.8 2.9 1.1 1.6 1.5
19 0.9 1.2 1.8 2.6 9.0
19 0.5 0.4 0.6 0.7 1.8
06 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
18 0.9 1.2 1.8 2.6 9.0
I want to create three subsets of this data such that the column ID is mutually exclusive. And each of the subset includes all rows corresponding to the ID column in the main dataframe.
As of now, I am randomly shuffling the ID column and selecting unique ID's as a list. Using this list I'm selecting all rows that from the dataframe who's ID belong to fraction of the list.
import numpy as np
import random
distinct = list(set(df.ID.values))
random.shuffle(distinct)
X1, X2 = distinct[:1000000], distinct[1000000:2000000]
df_X1 = df.loc[df['ID'].isin(list(X1))]
df_X2 = df.loc[df['ID'].isin(list(X2))]
This is working as expected for smaller data, however for larger data the run doesn't even complete for many hours. Is there a more efficient way to do this? appreciate responses.
I think the slow down is coming in the nested isin list inside the loc slice. I tried a different approach using numpy and a boolean index that seems to double the speed.
First to set up the dataframe. I wasn't sure how many unique items you had so I selected 50. I was also unsure how many columns so arbitrarily selected 10,000 columns and rows.
df = pd.DataFrame(np.random.randn(10000, 10000))
ID = np.random.randint(0,50,10000)
df['ID'] = ID
Then I try to use mostly numpy arrays and avoid the nested list using a boolean index.
# Create a numpy array from the ID columns
a_ID = np.array(df['ID'])
# use the numpy unique method to get a unique array
# a = np.unique(np.array(df['ID']))
a = np.unique(a_ID)
# shuffle the unique array
np.random.seed(100)
np.random.shuffle(a)
# cut the shuffled array in half
X1 = a[0:25]
# create a boolean mask
mask = np.isin(a_ID, X1)
# set the index to the mask
df.index = mask
df.loc[True]
When I ran your code on my sample df, times were 817 ms, the code above runs at 445 ms.
Not sure if this helps. Good question, thanks.

Data calculation in pandas python

I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.

Python and importing floating-point numbers from the excel file

So I have an excel file that looks like this
Name R s l2 max_amplitude ref_amplitude
R_0.3_s_0.5_l2_0.1 0.3 0.5 0.1 1.45131445 1.45131445
R_0.3_s_0.5_l2_0.6 0.3 0.5 0.6 3.52145743 3.52145743
...
R_1.1_s_2.0_l2_1.6 1.1 2.0 1.6 5.07415199 5.07415199
R_1.1_s_2.0_l2_2.1 1.1 2.0 2.1 5.78820419 5.78820419
R_1.1_s_2.0_l2_2.6 1.1 2.0 2.6 5.84488964 5.84488964
R_1.1_s_2.0_l2_3.1 1.1 2.0 3.1 6.35387516 6.35387516
Using the pandas module I import the data into data frame
import pandas as pd
df = pd.read_excel("output_var.xlsx", header=0)
Everything seems to be ok:
df
in the command line produces:
R s l2 max_amplitude ref_amplitude
0 0.3 0.5 0.1 1.451314 1.451314
1 0.3 0.5 0.6 3.521457 3.521457
2 0.3 0.5 1.1 4.770226 4.770226
...
207 1.1 2.0 2.1 5.788204 5.788204
208 1.1 2.0 2.6 5.844890 5.844890
209 1.1 2.0 3.1 6.353875 6.353875
[210 rows x 5 columns]
Now I need to do some calculations based on the value of R so I need to do slicing of the array. Column R containes 5 different values: 0.3, 0.5, 0.7, 0.9 and 1.1. Each of these 5 values has 42 rows. (5x42=210)
To remove the duplicates from "R" I try
set(df.R)
which returns:
{0.29999999999999999,
0.5,
0.69999999999999996,
0.89999999999999991,
0.90000000000000002,
1.1000000000000001}
Beside from representing the 0.3 as 0.29999 etc. there are 6 (instead of 5) different values for R. It seams that sometimes 0.9 gets interpreted as 0.89999999999999991 and sometimes as 0.90000000000000002
This can be (partialy) solved with:
set(round(df.R,1))
which (at least) returns 5 values:
{0.29999999999999999,
0.5,
0.69999999999999996,
0.90000000000000002,
1.1000000000000001}
But now I come to the dangerous part. If I want to do the slicing according to the known values of R (0.3, 0.5, 0.7, 0.9 and 1.1)
len(df[df.R==0.3])
returns
42
and
len(df[df.R==0.9])
returns
41
One value gets deleted by Python! (remember, there are 42 rows for each of 5 R's giving the total number of 210 rows in the file).
How to deal with this problem?
Don't check floats for equality. There are some issues with floating point arithmetic (check here for example).
Instead, check for closeness (really really closeness):
import numpy as np
len(df[np.isclose(df.R, 0.9)])
Normally, if you don't convert the series to a set, pandas would handle that. So if you want to drop duplicates, I'd suggest using pandas methods:
df.drop_duplicates('R')

Pandas interpolate data with units

Hi Everyone,
I've been looking to Stackoverflow for couple of years, and it helped me a lot, so much that I never have to register before :)
But today I'm stuck on a problem using Python with Pandas and Quantities (could be unum or pint as well). I try to do my best to make a clear post, but since it's my first one, I apologize if something is confusing and will try to correct any mistake you'll find :)
I want to import data from a source and build a Pandas dataframe as follow:
import pandas as pd
import quantities as pq
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
This gives:
S1=
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
Now I want to extend the data to the depth2 values:
(obviously there is not point to interpolate depth over depth, but it's a test before it gets more complicated).
s2 = s1.reindex(depth2)
This gives:
S2=
depth
0.0 0.0 m
1.0 NaN
1.1 1.1 m
1.5 NaN
2.0 2.0 m
So far no problem.
But when I try to interpolate the missing values doing:
s2['depth'].interpolate(method='values')
I got the following error:
C:\Python27\lib\site-packages\numpy\lib\function_base.pyc in interp(x, xp, fp, left, right)
1067 return compiled_interp([x], xp, fp, left, right).item()
1068 else:
-> 1069 return compiled_interp(x, xp, fp, left, right)
1070
1071
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
I understand that interpolation from numpy does not work on object.
But if I try now to interpolate the missing values by dropping the units, it works:
s3 = s2['depth'].astype(float).interpolate(method='values')
This gives:
s3 =
0.0 0
1.0 1
1.1 1.1
1.5 1.5
2.0 2
Name: depth, dtype: object
How can I get back the unit in the depth column?
I can't find any trick to put back the unit...
Any help will be greatly appreciated.
Thanks
Here's a way to do what you want.
Split apart the quantities and create a set of 2 columns for each quantity
In [80]: df = concat([ col.apply(lambda x: Series([x.item(),x.dimensionality.string],
index=[c,"%s_unit" % c])) for c,col in s1.iteritems() ])
In [81]: df
Out[81]:
depth depth_unit
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
In [82]: df = df.reindex([0,1.0,1.1,1.5,2.0])
In [83]: df
Out[83]:
depth depth_unit
0.0 0.0 m
1.0 NaN NaN
1.1 1.1 m
1.5 NaN NaN
2.0 2.0 m
Interpolate
In [84]: df['depth'] = df['depth'].interpolate(method='values')
Propogate the units
In [85]: df['depth_unit'] = df['depth_unit'].ffill()
In [86]: df
Out[86]:
depth depth_unit
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Ok I found a solution, might not be the best one, but for my problem it works just fine:
import pandas as pd
import quantities as pq
def extendAndInterpolate(input, newIndex):
""" Function to extend a panda dataframe and interpolate
"""
output = pd.concat([input, pd.DataFrame(index=newIndex)], axis=1)
for col in output.columns:
# (1) Try to retrieve the unit of the current column
try:
# if it succeeds, then store the unit
unit = 1 * output[col][0].units
except Exception, e:
# if it fails, which means that the column contains string
# then return 1
unit = 1
# (2) Check the type of value.
if isinstance(output[col][0], basestring):
# if it's a string return the string and fill the missing cell with this string
value = output[col].ffill()
else:
# if it's a value, to be able to interpolate, you need to:
# - (a) dump the unit with astype(float)
# - (b) interpolate the value
# - (c) add again the unit
value = [x*unit for x in output[col].astype(float).interpolate(method='values')]
#
# (3) Returned the extended pandas table with the interpolated values
output[col] = pd.Series(value, index=output.index)
# Return the output dataframe
return output
Then:
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
s2 = extendAndInterpolate(s1, depth2)
The result:
s1
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
s2
depth
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Thanks for you help.

Categories