Hi Everyone,
I've been looking to Stackoverflow for couple of years, and it helped me a lot, so much that I never have to register before :)
But today I'm stuck on a problem using Python with Pandas and Quantities (could be unum or pint as well). I try to do my best to make a clear post, but since it's my first one, I apologize if something is confusing and will try to correct any mistake you'll find :)
I want to import data from a source and build a Pandas dataframe as follow:
import pandas as pd
import quantities as pq
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
This gives:
S1=
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
Now I want to extend the data to the depth2 values:
(obviously there is not point to interpolate depth over depth, but it's a test before it gets more complicated).
s2 = s1.reindex(depth2)
This gives:
S2=
depth
0.0 0.0 m
1.0 NaN
1.1 1.1 m
1.5 NaN
2.0 2.0 m
So far no problem.
But when I try to interpolate the missing values doing:
s2['depth'].interpolate(method='values')
I got the following error:
C:\Python27\lib\site-packages\numpy\lib\function_base.pyc in interp(x, xp, fp, left, right)
1067 return compiled_interp([x], xp, fp, left, right).item()
1068 else:
-> 1069 return compiled_interp(x, xp, fp, left, right)
1070
1071
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
I understand that interpolation from numpy does not work on object.
But if I try now to interpolate the missing values by dropping the units, it works:
s3 = s2['depth'].astype(float).interpolate(method='values')
This gives:
s3 =
0.0 0
1.0 1
1.1 1.1
1.5 1.5
2.0 2
Name: depth, dtype: object
How can I get back the unit in the depth column?
I can't find any trick to put back the unit...
Any help will be greatly appreciated.
Thanks
Here's a way to do what you want.
Split apart the quantities and create a set of 2 columns for each quantity
In [80]: df = concat([ col.apply(lambda x: Series([x.item(),x.dimensionality.string],
index=[c,"%s_unit" % c])) for c,col in s1.iteritems() ])
In [81]: df
Out[81]:
depth depth_unit
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
In [82]: df = df.reindex([0,1.0,1.1,1.5,2.0])
In [83]: df
Out[83]:
depth depth_unit
0.0 0.0 m
1.0 NaN NaN
1.1 1.1 m
1.5 NaN NaN
2.0 2.0 m
Interpolate
In [84]: df['depth'] = df['depth'].interpolate(method='values')
Propogate the units
In [85]: df['depth_unit'] = df['depth_unit'].ffill()
In [86]: df
Out[86]:
depth depth_unit
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Ok I found a solution, might not be the best one, but for my problem it works just fine:
import pandas as pd
import quantities as pq
def extendAndInterpolate(input, newIndex):
""" Function to extend a panda dataframe and interpolate
"""
output = pd.concat([input, pd.DataFrame(index=newIndex)], axis=1)
for col in output.columns:
# (1) Try to retrieve the unit of the current column
try:
# if it succeeds, then store the unit
unit = 1 * output[col][0].units
except Exception, e:
# if it fails, which means that the column contains string
# then return 1
unit = 1
# (2) Check the type of value.
if isinstance(output[col][0], basestring):
# if it's a string return the string and fill the missing cell with this string
value = output[col].ffill()
else:
# if it's a value, to be able to interpolate, you need to:
# - (a) dump the unit with astype(float)
# - (b) interpolate the value
# - (c) add again the unit
value = [x*unit for x in output[col].astype(float).interpolate(method='values')]
#
# (3) Returned the extended pandas table with the interpolated values
output[col] = pd.Series(value, index=output.index)
# Return the output dataframe
return output
Then:
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
s2 = extendAndInterpolate(s1, depth2)
The result:
s1
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
s2
depth
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Thanks for you help.
Related
I have a data frame as shown below. I need to compare min with spec_min and 'max' with spec_max.
If max>Spec_max then the color of the max cell should be read and if min<Spec_min then that also needs to be red. May I know how to do this?
min max SPEC_MIN SPEC_MAX
V_PTAT3V3[V] 1.124 1.14 1.095 1.2
You may. Here is an example. Assuming your dataframe looks somewhat like this
min max spec_min spec_max
0 1.298092 0.857875 1.0 1.2
1 1.814168 1.032747 0.8 1.0
2 1.396925 1.092014 1.0 1.2
3 1.616848 1.279176 0.8 1.0
4 1.956991 1.200024 1.0 1.2
5 1.649614 1.203371 1.0 1.2
6 1.195811 0.432663 1.2 1.4
7 1.313263 0.795951 1.2 1.4
8 1.157487 1.235014 1.0 1.2
9 1.546830 1.094696 1.2 1.4
10 1.135896 0.792172 0.8 1.0
11 1.561299 0.763911 1.2 1.4
12 1.324006 0.956222 1.0 1.2
13 1.283233 0.585565 1.0 1.2
14 1.179644 0.983332 1.2 1.4
15 1.696883 1.199471 1.2 1.4
16 1.130002 0.947254 0.8 1.0
17 1.249352 0.865932 1.2 1.4
18 1.365273 0.721204 1.0 1.2
19 1.155129 0.722179 1.2 1.4
20 1.315393 0.590603 0.8 1.0
def highlight_under_spec_min(s, props=''):
return np.where(s < df['spec_min'], props, '')
def highlight_under_spec_max(s, props=''):
return np.where(s > df['spec_max'], props, '')
df.style.apply(highlight_under_spec_min, props='color:white;background-color:red', subset=['min'], axis=0)\
.apply(highlight_under_spec_max, props='color:white;background-color:red', subset=['max'], axis=0)
gives you
If this is not what you want I suggest you give an example with cells you want and don't want colored.
You can probably try something like this.
import pandas as pd
df = pd.DataFrame([{'min': 1.124, 'max': 1.14, 'SPEC_MIN': 1.095, 'SPEC_MAX': 1.2}])
def style_cells(dataframe):
"""This function is to be used with the Panda's apply method to
color the cells based on different conditions."""
# Preparing styles.
conditional_style = 'background-color: red'
default_style = ''
# Comparing values and creating masks.
mask = dataframe['min'] < dataframe['SPEC_MIN']
# Creating a style DataFrame with same indices and columns as in the original DataFrame.
df = pd.DataFrame(default_style, index=dataframe.index, columns=dataframe.columns)
# Modifying cell colors.
df.loc[mask, 'min'] = conditional_style
# The same procedure for max values.
mask = dataframe['max'] > dataframe['SPEC_MAX']
df.loc[mask, 'max'] = conditional_style
return df
df.style.apply(style_cells, axis=None)
I have the following dataset:
dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
I want to calculate the slopes based on the timestamp index. This should be the result:
slope:
A 0.4
B -0.7
C -0.1
I tried this solution:
slope = df.apply(lambda x: np.polyfit(df.index), x, 1)[0])
But it returns an error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Any help will be greatly appreciated.
a) Don't apply() the polynomial-fitting to the 'Timestamp' string column, only to the float columns A,B,C. So either make dates the index, or don't include it in the columns passed into apply().
Make dates column your index:
df.set_index('dates', inplace=True)
A B C
dates
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
b) Now as to fixing up the apply() call:
you're missing a second parenthesis, and you need a trailing ...), axis=1 to apply your function columnwise.
also since we changed df.index to now be dates not the autonumbered integers 0,1,2, you need to pass an explicit integer range into polyfit().
Solution:
#pd.options.display.float_format = '{:.3f}'.format
#pd.options.display.precision = 3
#np.set_printoptions(floatmode='fixed', precision=3, suppress=True)
df.apply(lambda x: np.polyfit(range(len(x)), x, 1), axis=1)
dates
2005-01-01 [-1.9860273225978183e-16, 1.3333333333333333]
2005-01-02 [-0.5000000000000004, 1.8333333333333341]
2005-01-04 [-0.9999999999999998, 2.3333333333333335]
(Note: I'm unsuccesfully trying to set the np and pd display options to suppress the unwanted dps and scientific notation on the object returned by polyfit. You can figure that part out yourself.]
And here's the boilerplate to make your example reproducible:
import numpy as np
import pandas as pd
from io import StringIO
df = """dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0"""
df = pd.read_csv(StringIO(df), sep=r'\s+', parse_dates=['dates'])
Suppose you have numerical data for some function z = f(x, y) saved in a pandas dataframe, where x is the index values, y is the column values, and the dataframe is populated with the z data. For example:
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
is there a simple pandas command, or maybe a one-line string of a few simple commands, which returns the (x, y) values corresponding to data attributes, specifically in my case as min(z)? In the example data I'd be looking for (1.0, 0.6)
I'm really just hoping there's an answer that doesn't involve parsing out the data into some other structure, because sure, just linearize the data in a numpy array and correlate the numpy array index with (x,y). But if there's something cleaner/more elegant that I simply am not finding, I'd love to learn about it.
Using pandas.DataFrame.idxmin & pandas.Series.idxmin
import pandas as pd
# df view
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
# min column
min_col_name = df.min().idxmin()
# min column index if needed
min_col_idx = df.columns.get_loc(min_col_name)
# min row index
min_row_idx = df[min_col_name].idxmin()
another option:
(df.min(axis=1).idxmin(), df.min().idxmin())
I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')
I have a dataframe like:
I would like to substract the values like:
minus
What I tried so far (the dataframe: http://pastebin.com/PydRHxcz):
index = pd.MultiIndex.from_tuples([key for key in dfdict], names = ['a','b','c','d'])
dfl = pd.DataFrame([dfdict[key] for key in dfdict],index=index)
dfl.columns = ['data']
dfl.sort(inplace=True)
d = dfl.unstack(['a','b'])
I can do:
d[0:5] - d[0:5]
And I get zeros for all values.
But If I do:
d[0:5] - d[5:]
I get Nans for all values. Any Ideas how I can perform such an operation ?
EDIT:
What works is
dfl.unstack(['a','b'])['data'][5:] - dfl.unstack(['a','b'])['data'][0:5].values
But it feels a bit clumsy
You can use loc to select all rows that correspond to one label in the first level like this:
In [8]: d.loc[0]
Out[8]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 11.098909 9.223784 8.003650 10.014445 13.231898 10.372040
0.3 14.349606 11.420565 9.053073 10.252542 26.342501 25.219403
0.5 1.336937 2.522929 3.875139 11.161803 3.168935 6.287555
0.7 0.379158 1.061104 2.053024 12.358577 0.678352 2.133887
1.0 0.210244 0.631631 1.457333 15.117805 0.292904 1.053916
So doing the subtraction looks like:
In [11]: d.loc[0] - d.loc[1000]
Out[11]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 -3.870946 -3.239915 -3.504068 -0.722377 -2.335147 -2.460035
0.3 -65.611418 -42.225811 -25.712668 -1.028758 -65.106473 -44.067692
0.5 -84.494748 -55.186368 -34.184425 -1.619957 -89.356417 -69.008567
0.7 -92.681688 -61.636548 -37.386604 -4.227343 -110.501219 -78.925078
1.0 -101.071683 -61.758741 -37.080222 -3.081782 -103.779698 -80.337487