I have the following dataset:
dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
I want to calculate the slopes based on the timestamp index. This should be the result:
slope:
A 0.4
B -0.7
C -0.1
I tried this solution:
slope = df.apply(lambda x: np.polyfit(df.index), x, 1)[0])
But it returns an error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Any help will be greatly appreciated.
a) Don't apply() the polynomial-fitting to the 'Timestamp' string column, only to the float columns A,B,C. So either make dates the index, or don't include it in the columns passed into apply().
Make dates column your index:
df.set_index('dates', inplace=True)
A B C
dates
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0
b) Now as to fixing up the apply() call:
you're missing a second parenthesis, and you need a trailing ...), axis=1 to apply your function columnwise.
also since we changed df.index to now be dates not the autonumbered integers 0,1,2, you need to pass an explicit integer range into polyfit().
Solution:
#pd.options.display.float_format = '{:.3f}'.format
#pd.options.display.precision = 3
#np.set_printoptions(floatmode='fixed', precision=3, suppress=True)
df.apply(lambda x: np.polyfit(range(len(x)), x, 1), axis=1)
dates
2005-01-01 [-1.9860273225978183e-16, 1.3333333333333333]
2005-01-02 [-0.5000000000000004, 1.8333333333333341]
2005-01-04 [-0.9999999999999998, 2.3333333333333335]
(Note: I'm unsuccesfully trying to set the np and pd display options to suppress the unwanted dps and scientific notation on the object returned by polyfit. You can figure that part out yourself.]
And here's the boilerplate to make your example reproducible:
import numpy as np
import pandas as pd
from io import StringIO
df = """dates A B C
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-04 3.0 0.0 1.0"""
df = pd.read_csv(StringIO(df), sep=r'\s+', parse_dates=['dates'])
Related
I have a data frame
albumin bilirubin albumin bilirubin
mean mean count count
id class
142345 a 3.4 1.0 1.0 1.0
a 3.2 2.0 0.0 0.0
b 3.1 1.0 0.0 0.0
b 2.7 0.0 3.0 0.0
I am trying to group over row ['id', 'class'] and affect a different aggregation functions to level 1 of my columns. I want to affect the mean to mean columns and sum to count columns, in order to obtain:
albumin bilirubin albumin bilirubin
mean mean count count
id class
142345 a 3.3 1.5 1.0 1.0
b 2.9 0.5 3.0 0.0
I've found a similar issue here. But I can't find a way to adapt it.
I've tried:
def f(x):
aggs = {"mean": np.mean, "count": np.sum}
func = aggs.get(x.name, np.sum)
return func(x)
grouped=df.groupby(['id', 'class'], axis=0, level=1 ).apply(f)
and I get the error
TypeError: 'Categorical' object is not callable
EDIT: suppose there are lots of columns in my df
I found a way, but not totally convinced about it.
mean_grouped = df.groupby(['id', 'class']).mean().loc[:, idx[:, ['mean']]]
sum_grouped = df.groupby(['id', 'class']).sum().loc[:, idx[:, ['count']]]
df_out = pd.concat([mean_grouped, sum_grouped])
I have dataframes similar to the following ones:
,A,B
2020-01-15,1,
2020-01-15,,2
2020-01-16,3,
2020-01-16,,4
2020-01-17,5,
2020-01-17,,6
,A,B,C
2020-01-15,1,
2020-01-15,,2
2020-01-15,,,3
2020-01-16,4,
2020-01-16,,5
2020-01-16,,,6
2020-01-17,7,
2020-01-17,,8
2020-01-17,,,9
I need to transform them to the following:
,A,B
2020-01-15,1,2
2020-01-16,3,4
2020-01-17,5,6
,A,B,C
2020-01-15,1,2,3
2020-01-16,4,5,6
2020-01-17,7,8,9
I have tried with groupby().first() without success
Let us do grubby + first
s=df.groupby(level=0).first()
A B
aaa
2020-01-15 1.0 2.0
2020-01-16 3.0 4.0
2020-01-17 5.0 6.0
I have two questions:
1) Is there something like pandas groupby but applicable on columns (df.columns, not the data within)?
2) How can I extract the "date" from a datetime object?
I have lots of pandas dataframes (or csv files) that have a position column (that I use as index) and then columns of values measured at each position at different time. The column header is a datetime object (or pd.to_datetime).
I would like to extract data from the same date and save them into a new file.
Here is a simple example of two such dataframes.
df1:
2015-03-13 14:37:00 2015-03-13 14:38:00 2015-03-13 14:38:15 \
0.0 24.49393 24.56345 24.50552
0.5 24.45346 24.54904 24.60773
1.0 24.46216 24.55267 24.74365
1.5 24.55414 24.63812 24.80463
2.0 24.68079 24.76758 24.78552
2.5 24.79236 24.83005 24.72879
3.0 24.83691 24.78308 24.66727
3.5 24.78452 24.73071 24.65085
4.0 24.65857 24.79398 24.72290
4.5 24.56390 24.93515 24.83267
5.0 24.62161 24.96939 24.87366
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
and df2:
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30 \
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
2015-05-23 12:25:00 2015-05-23 12:26:00 2015-05-23 12:26:30
0.0 10.31052 10.132660 10.176910
0.5 10.26834 10.086910 10.252720
1.0 10.27393 10.165890 10.276670
1.5 10.29330 10.219090 10.335910
2.0 10.24432 10.193940 10.406430
2.5 10.11618 10.157470 10.323120
3.0 10.02454 10.110720 10.115360
3.5 10.08716 10.010680 9.997345
4.0 10.23868 9.905670 10.008090
4.5 10.27216 9.879425 9.979645
5.0 10.10693 9.919800 9.870361
df1 has data from 13 March and 19 May, df2 has data from 19 May and 23 May. From these two dataframes containing data from 3 days, I would like to get 3 dataframes (or csv files or any other object), one for each day.
(And for a real-life example, multiply the number of lines, columns and files by some hundred.)
In the worst case I can specify the dates in a separate list, but I am still failing to extract these dates from the dataframes.
I did have an idea of a nested loop
for df in dataframes:
for d in dates:
new_df = df[d]
but I can't get the date from the datetime.
First concat all DataFrames by columns and then convert groupby object by strftime for string keys of dictionary of DataFrames:
df = pd.concat([df1,df2, dfN], axis=1)
dfs = dict(tuple(df.groupby(df.columns.strftime('%Y-%m-%d'), axis=1)))
#select DataFrame
print (dfs['2015-03-13'])
I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')
Hi Everyone,
I've been looking to Stackoverflow for couple of years, and it helped me a lot, so much that I never have to register before :)
But today I'm stuck on a problem using Python with Pandas and Quantities (could be unum or pint as well). I try to do my best to make a clear post, but since it's my first one, I apologize if something is confusing and will try to correct any mistake you'll find :)
I want to import data from a source and build a Pandas dataframe as follow:
import pandas as pd
import quantities as pq
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
This gives:
S1=
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
Now I want to extend the data to the depth2 values:
(obviously there is not point to interpolate depth over depth, but it's a test before it gets more complicated).
s2 = s1.reindex(depth2)
This gives:
S2=
depth
0.0 0.0 m
1.0 NaN
1.1 1.1 m
1.5 NaN
2.0 2.0 m
So far no problem.
But when I try to interpolate the missing values doing:
s2['depth'].interpolate(method='values')
I got the following error:
C:\Python27\lib\site-packages\numpy\lib\function_base.pyc in interp(x, xp, fp, left, right)
1067 return compiled_interp([x], xp, fp, left, right).item()
1068 else:
-> 1069 return compiled_interp(x, xp, fp, left, right)
1070
1071
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
I understand that interpolation from numpy does not work on object.
But if I try now to interpolate the missing values by dropping the units, it works:
s3 = s2['depth'].astype(float).interpolate(method='values')
This gives:
s3 =
0.0 0
1.0 1
1.1 1.1
1.5 1.5
2.0 2
Name: depth, dtype: object
How can I get back the unit in the depth column?
I can't find any trick to put back the unit...
Any help will be greatly appreciated.
Thanks
Here's a way to do what you want.
Split apart the quantities and create a set of 2 columns for each quantity
In [80]: df = concat([ col.apply(lambda x: Series([x.item(),x.dimensionality.string],
index=[c,"%s_unit" % c])) for c,col in s1.iteritems() ])
In [81]: df
Out[81]:
depth depth_unit
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
In [82]: df = df.reindex([0,1.0,1.1,1.5,2.0])
In [83]: df
Out[83]:
depth depth_unit
0.0 0.0 m
1.0 NaN NaN
1.1 1.1 m
1.5 NaN NaN
2.0 2.0 m
Interpolate
In [84]: df['depth'] = df['depth'].interpolate(method='values')
Propogate the units
In [85]: df['depth_unit'] = df['depth_unit'].ffill()
In [86]: df
Out[86]:
depth depth_unit
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Ok I found a solution, might not be the best one, but for my problem it works just fine:
import pandas as pd
import quantities as pq
def extendAndInterpolate(input, newIndex):
""" Function to extend a panda dataframe and interpolate
"""
output = pd.concat([input, pd.DataFrame(index=newIndex)], axis=1)
for col in output.columns:
# (1) Try to retrieve the unit of the current column
try:
# if it succeeds, then store the unit
unit = 1 * output[col][0].units
except Exception, e:
# if it fails, which means that the column contains string
# then return 1
unit = 1
# (2) Check the type of value.
if isinstance(output[col][0], basestring):
# if it's a string return the string and fill the missing cell with this string
value = output[col].ffill()
else:
# if it's a value, to be able to interpolate, you need to:
# - (a) dump the unit with astype(float)
# - (b) interpolate the value
# - (c) add again the unit
value = [x*unit for x in output[col].astype(float).interpolate(method='values')]
#
# (3) Returned the extended pandas table with the interpolated values
output[col] = pd.Series(value, index=output.index)
# Return the output dataframe
return output
Then:
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
s2 = extendAndInterpolate(s1, depth2)
The result:
s1
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
s2
depth
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Thanks for you help.