If I have a pandas.DataFrame which looks like this:
Probability
0 1 2 3
C H C C 0.058619
H H C H 0.073274
C H C H 0.195398
C H H H 0.113982
C H H C 0.009770
C C C C 0.087929
H C H H 0.005234
H H H C 0.003664
H H C C 0.021982
C C H C 0.004187
H C H C 0.000449
C C H H 0.048849
H C C C 0.009421
H H H H 0.042743
C C C H 0.293096
H C C H 0.031403
The index is a tuple of len(4) and its order corresponds to all sequences of length four and a 2 letter alphabet ['H','C']. What is the best way to sum the rows with a H in position [i for i in df.index] ?
df.ix['H'].sum() is functional but I can't see how to generalize to the 'any case' method. For example, I need to be able to perform the same calculation regardless of how long the sequence is or with more than a 2 letter alphabet. Moreover, the df.ix['H'] is difficult because it doesn't seem to accept wild cards, i.e. df.ix['*','H'] for all sequences with H in index position 1. Does anybody have any suggestions? thanks
Setup
To create a dummy data frame which corresponds to your provided example, I used the following:
import pandas as pd
import numpy as np
import random
# define sequence and target
sequence = ["H", "C"]
target = "H"
# define shapes
size_col = 4
size_row = 100
# create dummy data and dummy columns
array_indices = np.random.choice(sequence, size=(size_row, size_col))
array_value = np.random.random(size=(size_row, 1))
array = np.concatenate([array_indices, array_value], axis=1)
col_indices = ["Idx {}".format(x) for x in range(size_col)]
col_values = ["Probability"]
columns = col_indices + col_values
# create pandas data frame
df = pd.DataFrame(array, columns=columns)
df[col_values] = df[col_values].astype(float)
The resulting pandas.DataFrame looks like this:
>>> print(df.head())
Idx 0 Idx 1 Idx 2 Idx 3 Probability
C C C H 0.892125
C H C H 0.633699
C C C C 0.228546
H C H C 0.766639
C H C C 0.379930
The only difference to your data frame is the reset index (you get the same when using df.reset_index()).
Solution
Now, to get the sums of the rows with a target value for all indices, you may use the following:
bool_indices = df[col_indices] == target
result = bool_indices.apply(lambda x: df.loc[x, col_values].sum())
First, you create a new data frame with boolean values which correspond to each index column containing the target value for each row.
Second, you use these boolean series as index columns to define a subset of your actual value column and finally apply an arbitrary method like sum() on it.
The result is the following:
>>> print(result)
Idx 0 Idx 1 Idx 2 Idx 3
Probability 23.246007 23.072544 24.775996 24.683079
This solution is flexible in regard to your input sequence, the target and the shape of your data.
In addition, if you want to use slicing with wildcards, you can use the pandas.IndexSlice on your original data frame example like:
idx = pd.IndexSlice
# to get all rows which have the "H" at second index
df.loc[idx[:, "H"], :]
# to get all rows which have the "H" at third index
df.loc[idx[:, :, "H"], :]
An alternative solution from what pansen suggested is to use pandas.groupby
levels=[0,1,2,3]
for i in range(levels):
for j in df.groupby(level=i):
MI=pandas.MultiIndex.from_product([i,j[0]])
val= float(j[1].sum())
df_l.append( pandas.DataFrame([val],index=MI))
return pandas.concat(df_l)
Related
I have a dataframe of floats
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
1 0.433127 0.479051 0.159739 0.734577 0.113672
2 0.391228 0.516740 0.430628 0.586799 0.737838
3 0.956267 0.284201 0.648547 0.696216 0.292721
4 0.001490 0.973460 0.298401 0.313986 0.891711
5 0.585163 0.471310 0.773277 0.030346 0.706965
6 0.374244 0.090853 0.660500 0.931464 0.207191
7 0.630090 0.298163 0.741757 0.722165 0.218715
I can divide it into quantiles for a single column like so:
def groupby_quantiles(df, column, groups: int):
quantiles = df[column].quantile(np.linspace(0, 1, groups + 1))
bins = pd.cut(df[column], quantiles, include_lowest=True)
return df.groupby(bins)
>>> df.pipe(groupby_quantiles, "a", 2).apply(lambda x: print(x))
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
2 0.391228 0.516740 0.430628 0.586799 0.737838
4 0.001490 0.973460 0.298401 0.313986 0.891711
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
3 0.956267 0.284201 0.648547 0.696216 0.292721
5 0.585163 0.471310 0.773277 0.030346 0.706965
7 0.630090 0.298163 0.741757 0.722165 0.218715
Now, I want to repeat the same operation on each of the groups for the next column. The code becomes ridiculous
>>> (
df
.pipe(groupby_quantiles, "a", 2)
.apply(
lambda df_group: (
df_group
.pipe(groupby_quantiles, "b", 2)
.apply(lambda x: print(x))
)
)
)
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
2 0.391228 0.51674 0.430628 0.586799 0.737838
4 0.001490 0.97346 0.298401 0.313986 0.891711
a b c d e
3 0.956267 0.284201 0.648547 0.696216 0.292721
7 0.630090 0.298163 0.741757 0.722165 0.218715
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
5 0.585163 0.471310 0.773277 0.030346 0.706965
My goal is to repeat this operation for as many columns as I want, then aggregate the groups at the end. Here's how the final function could look like and the desired result assuming to aggregate with the mean.
>>> groupby_quantiles(df, columns=["a", "b"], groups=[2, 2], agg="mean")
a b c d e
0 0.229947 0.163832 0.730887 0.756813 0.150660
1 0.196359 0.745100 0.364515 0.450392 0.814774
2 0.793179 0.291182 0.695152 0.709190 0.255718
3 0.509145 0.475180 0.466508 0.382462 0.410319
Any ideas on how to achieve this?
Here is a way. First using quantile then cut can be rewrite with qcut. Then using recursive operation similar to this.
def groupby_quantiles(df, cols, grs, agg_func):
# to store all the results
_dfs = []
# recursive function
def recurse(_df, depth):
col = cols[depth]
gr = grs[depth]
# iterate over the groups per quantile
for _, _dfgr in _df.groupby(pd.qcut(_df[col], gr)):
if depth != -1: recurse(_dfgr, depth+1) #recursive if not at the last column
else: _dfs.append(_dfgr.agg(agg_func)) #else perform the aggregate
# using negative depth is easier to acces the right column and quantile
depth = -len(cols)
recurse(df, depth) # starts the recursion
return pd.concat(_dfs, axis=1).T # concat the results and transpose
print(groupby_quantiles(df, cols = ['a','b'], grs = [2,2], agg_func='mean'))
# a b c d e
# 0 0.229946 0.163832 0.730887 0.756813 0.150660
# 1 0.196359 0.745100 0.364515 0.450392 0.814774
# 2 0.793179 0.291182 0.695152 0.709190 0.255718
# 3 0.509145 0.475181 0.466508 0.382462 0.410318
Probably really straightforward but I'm having no luck with Google. I have a 2 column dataframe of tuples, and I'm looking to unpack each tuple then pair up the contents from the same position in each column. For example:
Col1 Col2
(a,b,c) (d,e,f)
my desired output is
a d
b e
c f
I have a solution using loops but I would like to know a better way to do it - firstly because I am trying to eradicate loops from my life and secondly because it's potentially not as flexible as I may need it to be.
l1=[('a','b'),('c','d'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])
df
Out[547]:
Col1 Col2
0 (a, b) (j, k)
1 (c, d) (l, m)
2 (e, f, g) (n, o, p)
3 (h, i) (q, r)
for i in range(len(df)):
for j in range(len(df.iloc[i][1])):
print(df.iloc[i][0][j], df.iloc[i][1][j])
a j
b k
c l
d m
e n
f o
g p
h q
i r
All pythonic suggestions and guidance hugely appreciated. Many thanks.
Addition: an example including a row with differing length tuples, per Ch3steR's request below - my loop would not work in this instance ('d2' would not be included, where I would want it to be outputted paired with a null).
l1=[('a','b'),('c','d','d2'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])
Send each Series tolist and then reconstruct the DataFrame and stack. Then concat back together. This will leave you with a MultiIndex with the first level being the original DataFrame index and the second level being the position in the tuple.
This will work for older versions of pandas pd.__version__ < '1.3.0' and for instances where the tuples have an unequal number of elements (where explode will fail)
import pandas as pd
df1 = pd.concat([pd.DataFrame(df[col].tolist()).stack().rename(col)
for col in df.columns], axis=1)
Col1 Col2
0 0 a j
1 b k
1 0 c l
1 d m
2 0 e n
1 f o
2 g p
3 0 h q
1 i r
if the tuples length are always matching and you don't have the newer version of pandas to pass a list columns to explode, do something like this:
import pandas as pd
pd.concat([df.Col1.explode(), df.Col2.explode()], axis=1).reset_index(drop=True)
Col1 Col2
0 a j
1 b k
2 c l
3 d m
4 e n
5 f o
6 g p
7 h q
8 i r
I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!
Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
When using the unique() method on a Series you get a numpy array as a result, this also happens when doing it on a groupby. Consider this example:
import pandas as pd
L0 = ['G','i','G','h','j','h','G','j']
L1 = ['A','A','B','B','B','B','B','B']
df = pd.DataFrame({"A":L0,"B":L1})
dg = df.groupby('B').A.unique()
Resulting in this:
Out[56]:
B
A [G, i]
B [G, h, j]
Name: A, dtype: object
I want each unique element in its own row though:
A
B
A G
A i
B G
B h
B j
I can achieve this by hand like this (I'm deliberately omitting any iteration over DataFrames and only use the underlying numpy arrays):
de = pd.DataFrame(columns=["A","B"])
for i in range(dg.index.nunique()):
ds = pd.Series(dg.values[i]).to_frame()
ds.columns = ["A"]
ds["B"] = dg.index.values[i]
de = de.append(ds)
de = de.set_index('B')
But I'm wondering if there is a shorter (and fast) way that doesn't need loops, creating new Series or DataFrames, or messing around with the numpy arrays.
If not, I might propose it as a feature.
You can use apply with Series:
dg = df.groupby('B').A
.apply(lambda x: pd.Series(x.unique()))
.reset_index(level=1, drop=True)
.to_frame()
print (dg)
A
B
A G
A i
B G
B h
B j
Another possible solution is drop_duplicates:
df = df.drop_duplicates(['A','B']).set_index('B')
print (df)
A
B
A G
A i
B G
B h
B j
Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)
import pandas as pd
import numpy as np
import random
row = ['a','b','c']
col = ['A','B','C','D']
# use numpy for creating a ZEROS matrix
st = np.zeros((len(row),len(col)))
df2 = pd.DataFrame(st, index=row, columns=col)
# CONVERT each cell to an OBJECT for inserting tuples
for c in col:
df2[c] = df2[c].astype(object)
print df2
for i in row:
for j in col:
df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))
print df2
As you can see I first created a zeros(3,4) in numpy and then made each cell an OBJECT type in Pandas so I can insert tuples. Is this correct way to do or there is a better solution to ADD/RETRIVE tuples to matrices?
Results are fine:
A B C D
a 0 0 0 0
b 0 0 0 0
c 0 0 0 0
A B C D
a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)
b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)
c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)
First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:
import numpy as np
import pandas as pd
np.random.seed(2016)
row = ['a','b','c']
col = ['A','B','C','D']
data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)
print(df)
yields
A B C D
a (aA, 0.8967) (aB, 0.7302) (aC, 0.7833) (aD, 0.7417)
b (bA, 0.4621) (bB, 0.6426) (bC, 0.2249) (bD, 0.7085)
c (cA, 0.7471) (cB, 0.6251) (cC, 0.58) (cD, 0.2426)
Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, tuples require "object" dtype).
So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:
import numpy as np
import pandas as pd
np.random.seed(2016)
row=['a','b','c']
col=['A','B','C','D']
prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4),
index=row, columns=col)
print(prevstate)
# A B C D
# a aA aB aC aD
# b bA bB bC bD
# c cA cB cC cD
print(prob)
# A B C D
# a 0.8967 0.7302 0.7833 0.7417
# b 0.4621 0.6426 0.2249 0.7085
# c 0.7471 0.6251 0.5800 0.2426
To loop through the columns, find the row with maximum probability and retrieve the corresponding prevstate, you could use .idxmax and .loc:
for col in prob.columns:
idx = (prob[col].idxmax())
print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))
yields
aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417