pandas: calculate over multiindex - python

I have a dataframe like:
I would like to substract the values like:
minus
What I tried so far (the dataframe: http://pastebin.com/PydRHxcz):
index = pd.MultiIndex.from_tuples([key for key in dfdict], names = ['a','b','c','d'])
dfl = pd.DataFrame([dfdict[key] for key in dfdict],index=index)
dfl.columns = ['data']
dfl.sort(inplace=True)
d = dfl.unstack(['a','b'])
I can do:
d[0:5] - d[0:5]
And I get zeros for all values.
But If I do:
d[0:5] - d[5:]
I get Nans for all values. Any Ideas how I can perform such an operation ?
EDIT:
What works is
dfl.unstack(['a','b'])['data'][5:] - dfl.unstack(['a','b'])['data'][0:5].values
But it feels a bit clumsy

You can use loc to select all rows that correspond to one label in the first level like this:
In [8]: d.loc[0]
Out[8]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 11.098909 9.223784 8.003650 10.014445 13.231898 10.372040
0.3 14.349606 11.420565 9.053073 10.252542 26.342501 25.219403
0.5 1.336937 2.522929 3.875139 11.161803 3.168935 6.287555
0.7 0.379158 1.061104 2.053024 12.358577 0.678352 2.133887
1.0 0.210244 0.631631 1.457333 15.117805 0.292904 1.053916
So doing the subtraction looks like:
In [11]: d.loc[0] - d.loc[1000]
Out[11]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 -3.870946 -3.239915 -3.504068 -0.722377 -2.335147 -2.460035
0.3 -65.611418 -42.225811 -25.712668 -1.028758 -65.106473 -44.067692
0.5 -84.494748 -55.186368 -34.184425 -1.619957 -89.356417 -69.008567
0.7 -92.681688 -61.636548 -37.386604 -4.227343 -110.501219 -78.925078
1.0 -101.071683 -61.758741 -37.080222 -3.081782 -103.779698 -80.337487

Related

Transforming dataframe to sparse matrix and reset index

I have a data set with the rating of user ID to all product ID. There are only 5000 products and 10,000 users but the ID is in different number. I would like to transform my dataframe to a coo_sparse_matrix(data, (row,col), shape) but with row and col as the real number of products and users, not the ID. Is there any way to do that? Below is the illustration:
Data frame:
User ID
Product ID
Rating
1
14
0.1
1
15
0.2
2
14
0.3
2
16
0.3
5
19
0.4
and expected to have a matrix (in sparse coo form)
ProductID
14
15
16
19
UserID
1
0.1
0.2
0
0
2
0.3
0
0.3
0
5
0
0
0
0.4
because normally the sparse_coo would give a very large matrix with index (1,2,...,19) for product ID and (1,2,3,4,5) for user ID.
Please help me, it is for the thesis due in 3 days and I just found out this error, I code with Python.
Thank you very much!
Hi hope this helps and good luck with your thesis:
import pandas as pd
from scipy.sparse import coo_matrix
dataframe=pd.DataFrame(data={'User ID':[1,1,2,2,5], 'Product ID':[14,15,14,16,19], 'Rating':[0.1,0.2,0.3,0.3,0.4]})
row=dataframe['User ID']
col=dataframe['Product ID']
data=dataframe['Rating']
coo=coo_matrix((data, (row, col))).toarray()
new_dataframe=pd.DataFrame(coo)
#Drop non existing Product IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[:, (new_dataframe != new_dataframe.iloc[0]).any()]
#Drop non existing User IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[(new_dataframe!=0).any(axis=1)]
print(new_dataframe)
Output:
14 15 16 19
1 0.1 0.2 0.0 0.0
2 0.3 0.0 0.3 0.0
5 0.0 0.0 0.0 0.4

Add a column to the original pandas data frame after grouping by 2 columns and taking dot product of two other columns

I have the following data frame in pandas:
I want to add the Avg Price column in the original data frame, after grouping by (Date,Issuer) and then taking the dot product of weights and price, so that it is something like:
Is there a way to do it without using merge or join ? What would be the simplest way to do it?
One way using pandas.DataFrame.prod:
df["Avg Price"] = df[["Weights", "Price"]].prod(1)
df["Avg Price"] = df.groupby(["Date", "Issuer"])["Avg Price"].transform("sum")
print(df)
Output:
Date Issuer Weights Price Avg Price
0 2019-11-12 A 0.4 100 120.0
1 2019-15-12 B 0.5 100 100.0
2 2019-11-12 A 0.2 200 120.0
3 2019-15-12 B 0.3 100 100.0
4 2019-11-12 A 0.4 100 120.0
5 2019-15-12 B 0.2 100 100.0

How to merge the two columns from two dataframe into one column of a new dataframe (pandas)?

I want to merge the values of two different columns of pandas dataframe into one column of new dataframe.
pandas df1 =
hapX
pos 0.0
1 721 0.2
2 735 0.5
3 739 1.0
pandas df2 =
hapY
pos 0.1
1 721 0.0
2 735 0.6
3 739 1.5
I want to generate a new dataframe like:
df_joined['hapX|Y'] = df1.astype(str).add('|').add(df2.astype(str))
with expected output:
hapX|Y
pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
But, this is outputting bunch of NaN
hapX hapY
pos NaN NaN
1 721 NaN NaN
2 735 NaN NaN
3 739 NaN NaN
Is the problem with value being float (i don't think so). What is the problem with my approach?
Also, is there a way to automate the process if columns values are like hapX1 hapX1 hapX3 in one dataframe with hapY1 hapY2 hapY3 in another dataframe?
Thanks,
You can merge the two dataframes and then concat the hapX and hapY.
Say your first column name is no.
df_joined = df1.merge(df2, on = 'no')
df_joined['hapX|Y'] = (df_joined['hapX'].astype(str))+'|'+(df_joined['hapY'].astype(str))
df_joined.drop(['hapX', 'hapY'], axis = 1)
This gives you
no hapX|Y
0 pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
Just to add onto the previous answer, for the general case of N DataFrames,
Suppose you have a number of DataFrames as follows:
dfs = [pd.DataFrame({'hapY'+str(j): [random.random() for i in range(10)]}) for j in range(5)]
such that
>>> dfs[0]
hapY0
0 0.175683
1 0.353729
2 0.949848
3 0.346088
4 0.435292
5 0.837879
6 0.277274
7 0.623121
8 0.325119
9 0.709252
Then,
>>> map( lambda m: '|'.join(m) , zip(*[ dfs[j]['hapY'+str(j)].astype(str) for j in range(5)]))
['0.0845464936138|0.193336164837|0.551717121013|0.113566029656|0.479590342798',
'0.275851474238|0.694161791339|0.151607726092|0.615367668451|0.498997567849',
'0.116891472119|0.258406028668|0.315137581816|0.819992354178|0.864412473301',
'0.729581942312|0.614902776003|0.443986436146|0.227782256619|0.0149481683863',
'0.745583477173|0.441456815889|0.428691631831|0.307480112319|0.136790112739',
'0.981337451224|0.0117895017035|0.415140979617|0.650957722911|0.968082350568',
'0.725618728314|0.0546057041356|0.715910454674|0.0828229441557|0.220878025678',
'0.704047455894|0.303403129266|0.0499082759635|0.49727194707|0.251623048104',
'0.453595354131|0.146042134766|0.346665276655|0.911092176243|0.291405609407',
'0.140523603089|0.117930249858|0.902071673051|0.0804933425857|0.876006332635']
which you can later put into a DataFrame.
I think the simpliest is rename columns by dict which can be created by dict comprehension, last add_suffix:
print (df1)
hapX1 hapX2 hapX3 hapX4
pos
23 1.0 0.0 1.0 1.0
24 1.0 1.0 1.5 1.0
28 1.0 0.0 0.5 0.0
print (df2)
hapY1 hapY2 hapY3 hapY4
pos
23 0.0 1.0 0.5 0.0
24 1.0 1.0 1.5 1.0
28 0.0 1.0 1.0 1.0
d = {'hapY' + str(x):'hapX' + str(x) for x in range(1,5)}
print (d)
{'hapY1': 'hapX1', 'hapY3': 'hapX3', 'hapY2': 'hapX2', 'hapY4': 'hapX4'}
df_joined = df1.astype(str).add('|').add(df2.rename(columns=d).astype(str)).add_suffix('|Y')
print (df_joined)
hapX1|Y hapX2|Y hapX3|Y hapX4|Y
pos
23 1.0|0.0 0.0|1.0 1.0|0.5 1.0|0.0
24 1.0|1.0 1.0|1.0 1.5|1.5 1.0|1.0
28 1.0|0.0 0.0|1.0 0.5|1.0 0.0|1.0

pivoting pandas dataframe into prefixed cols, not a MultiIndex

I have a timeseries dataframe that is similar to:
ts = pd.DataFrame([['Jan 2000','WidgetCo',0.5, 2], ['Jan 2000','GadgetCo',0.3, 3], ['Jan 2000','SnazzyCo',0.2, 4],
['Feb 2000','WidgetCo',0.4, 2], ['Feb 2000','GadgetCo',0.5, 2.5], ['Feb 2000','SnazzyCo',0.1, 4],
], columns=['month','company','share','price'])
Which looks like:
month company share price
0 Jan 2000 WidgetCo 0.5 2.0
1 Jan 2000 GadgetCo 0.3 3.0
2 Jan 2000 SnazzyCo 0.2 4.0
3 Feb 2000 WidgetCo 0.4 2.0
4 Feb 2000 GadgetCo 0.5 2.5
5 Feb 2000 SnazzyCo 0.1 4.0
I can pivot this table like so:
pd.pivot_table(ts,index='month', columns='company')
Which gets me:
share price
company GadgetCo SnazzyCo WidgetCo GadgetCo SnazzyCo WidgetCo
month
Feb 2000 0.5 0.1 0.4 2.5 4 2
Jan 2000 0.3 0.2 0.5 3.0 4 2
This is what I want except that I need to collapse the MultiIndex so that the company is used as a prefix for share and price like so:
WidgetCo_share WidgetCo_price GadgetCo_share GadgetCo_price ...
month
Jan 2000 0.5 2 0.3 3.0
Feb 2000 0.4 2 0.5 2.5
I came up with this function to do just that but it seems like a poor solution:
def pivot_table_to_flat(df, column, index):
res = df.set_index(index)
cols = res.drop(column, axis=1).columns.values
resulting_cols = []
for prefix in res[column].unique():
for col in cols:
new_col_name = prefix + '_' + col
res[new_col_name] = res[res[column] == prefix][col]
resulting_cols.append(new_col_name)
return res[resulting_cols]
pivot_table_to_flat(ts, index='month', column='company')
What is a better way of accomplishing a pivot resulting in a columns with prefixes as opposed to a MultiIndex?
This seems even simpler:
df.columns = [' '.join(col).strip() for col in df.columns.values]
It takes a df with a multiindex column and flattens the column labels, with the df remaining in place.
(ref: #andy-haden Python Pandas - How to flatten a hierarchical index in columns )
I figured it out. Using the data on the MultiIndex makes for a pretty clean solution:
def flatten_multi_index(df):
mi = df.columns
suffixes, prefixes = mi.levels
col_names = [prefixes[i_p] + '_' + suffixes[i_s] for (i_s, i_p) in zip(*mi.labels)]
df.columns = col_names
return df
flatten_multi_index(pd.pivot_table(ts,index='month', columns='company'))
The above version only handles a 2D MultiIndex but it could be generalized if needed.
An update (as of early 2017 and pandas 0.19.2). You can use .values on a MultiIndex. So, this snippet should flatten MultiIndexs for those in need. The snippet is both too clever but not clever enough: it can handle either the row index or column names from the DataFrame, but it will blow up if the result of getattr(df,way) isn't nested (i.e., a MultiIndex).
def flatten_multi(df, way='index'): # or way='columns'
assert way in {'index', 'columns'}, "I'm sorry Dave."
mi = getattr(df, way)
flat_names = ["_".join(s) for s in mi.values]
setattr(df, way, flat_names)
return df

Pandas interpolate data with units

Hi Everyone,
I've been looking to Stackoverflow for couple of years, and it helped me a lot, so much that I never have to register before :)
But today I'm stuck on a problem using Python with Pandas and Quantities (could be unum or pint as well). I try to do my best to make a clear post, but since it's my first one, I apologize if something is confusing and will try to correct any mistake you'll find :)
I want to import data from a source and build a Pandas dataframe as follow:
import pandas as pd
import quantities as pq
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
This gives:
S1=
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
Now I want to extend the data to the depth2 values:
(obviously there is not point to interpolate depth over depth, but it's a test before it gets more complicated).
s2 = s1.reindex(depth2)
This gives:
S2=
depth
0.0 0.0 m
1.0 NaN
1.1 1.1 m
1.5 NaN
2.0 2.0 m
So far no problem.
But when I try to interpolate the missing values doing:
s2['depth'].interpolate(method='values')
I got the following error:
C:\Python27\lib\site-packages\numpy\lib\function_base.pyc in interp(x, xp, fp, left, right)
1067 return compiled_interp([x], xp, fp, left, right).item()
1068 else:
-> 1069 return compiled_interp(x, xp, fp, left, right)
1070
1071
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
I understand that interpolation from numpy does not work on object.
But if I try now to interpolate the missing values by dropping the units, it works:
s3 = s2['depth'].astype(float).interpolate(method='values')
This gives:
s3 =
0.0 0
1.0 1
1.1 1.1
1.5 1.5
2.0 2
Name: depth, dtype: object
How can I get back the unit in the depth column?
I can't find any trick to put back the unit...
Any help will be greatly appreciated.
Thanks
Here's a way to do what you want.
Split apart the quantities and create a set of 2 columns for each quantity
In [80]: df = concat([ col.apply(lambda x: Series([x.item(),x.dimensionality.string],
index=[c,"%s_unit" % c])) for c,col in s1.iteritems() ])
In [81]: df
Out[81]:
depth depth_unit
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
In [82]: df = df.reindex([0,1.0,1.1,1.5,2.0])
In [83]: df
Out[83]:
depth depth_unit
0.0 0.0 m
1.0 NaN NaN
1.1 1.1 m
1.5 NaN NaN
2.0 2.0 m
Interpolate
In [84]: df['depth'] = df['depth'].interpolate(method='values')
Propogate the units
In [85]: df['depth_unit'] = df['depth_unit'].ffill()
In [86]: df
Out[86]:
depth depth_unit
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Ok I found a solution, might not be the best one, but for my problem it works just fine:
import pandas as pd
import quantities as pq
def extendAndInterpolate(input, newIndex):
""" Function to extend a panda dataframe and interpolate
"""
output = pd.concat([input, pd.DataFrame(index=newIndex)], axis=1)
for col in output.columns:
# (1) Try to retrieve the unit of the current column
try:
# if it succeeds, then store the unit
unit = 1 * output[col][0].units
except Exception, e:
# if it fails, which means that the column contains string
# then return 1
unit = 1
# (2) Check the type of value.
if isinstance(output[col][0], basestring):
# if it's a string return the string and fill the missing cell with this string
value = output[col].ffill()
else:
# if it's a value, to be able to interpolate, you need to:
# - (a) dump the unit with astype(float)
# - (b) interpolate the value
# - (c) add again the unit
value = [x*unit for x in output[col].astype(float).interpolate(method='values')]
#
# (3) Returned the extended pandas table with the interpolated values
output[col] = pd.Series(value, index=output.index)
# Return the output dataframe
return output
Then:
depth = [0.0,1.1,2.0] * pq.m
depth2 = [0,1,1.1,1.5,2] * pq.m
s1 = pd.DataFrame(
{'depth' : [x for x in depth]},
index = depth)
s2 = extendAndInterpolate(s1, depth2)
The result:
s1
depth
0.0 0.0 m
1.1 1.1 m
2.0 2.0 m
s2
depth
0.0 0.0 m
1.0 1.0 m
1.1 1.1 m
1.5 1.5 m
2.0 2.0 m
Thanks for you help.

Categories