Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)
import pandas as pd
import numpy as np
import random
row = ['a','b','c']
col = ['A','B','C','D']
# use numpy for creating a ZEROS matrix
st = np.zeros((len(row),len(col)))
df2 = pd.DataFrame(st, index=row, columns=col)
# CONVERT each cell to an OBJECT for inserting tuples
for c in col:
df2[c] = df2[c].astype(object)
print df2
for i in row:
for j in col:
df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))
print df2
As you can see I first created a zeros(3,4) in numpy and then made each cell an OBJECT type in Pandas so I can insert tuples. Is this correct way to do or there is a better solution to ADD/RETRIVE tuples to matrices?
Results are fine:
A B C D
a 0 0 0 0
b 0 0 0 0
c 0 0 0 0
A B C D
a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)
b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)
c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)
First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:
import numpy as np
import pandas as pd
np.random.seed(2016)
row = ['a','b','c']
col = ['A','B','C','D']
data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)
print(df)
yields
A B C D
a (aA, 0.8967) (aB, 0.7302) (aC, 0.7833) (aD, 0.7417)
b (bA, 0.4621) (bB, 0.6426) (bC, 0.2249) (bD, 0.7085)
c (cA, 0.7471) (cB, 0.6251) (cC, 0.58) (cD, 0.2426)
Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, tuples require "object" dtype).
So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:
import numpy as np
import pandas as pd
np.random.seed(2016)
row=['a','b','c']
col=['A','B','C','D']
prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4),
index=row, columns=col)
print(prevstate)
# A B C D
# a aA aB aC aD
# b bA bB bC bD
# c cA cB cC cD
print(prob)
# A B C D
# a 0.8967 0.7302 0.7833 0.7417
# b 0.4621 0.6426 0.2249 0.7085
# c 0.7471 0.6251 0.5800 0.2426
To loop through the columns, find the row with maximum probability and retrieve the corresponding prevstate, you could use .idxmax and .loc:
for col in prob.columns:
idx = (prob[col].idxmax())
print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))
yields
aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417
Related
I have this array (it's a result from similarity calcul) it's a list of tuples like this:
example = [[(a,b), (c,d)], [(a1,b1), (c1,d2)] …]
In example there is 121044 list of 30 tuples each.
I want to have a pandas Dataframe like of just the second value of the tuples (i.e : b, d, b1, d2) without spending to much time compute it
Do you have any ideas ?
Use nested list comprehension:
df = pd.DataFrame([[y[1] for y in x] for x in example])
print (df)
0 1
0 b d
1 b1 d2
df = pd.DataFrame([[y[1] for y in x] for x in example], columns=['col1','col2'])
print (df)
col1 col2
0 b d
1 b1 d2
For numeric data, you can use numpy indexing directly. This should be more efficient than a list comprehension, as pandas uses numpy internally to store data in contiguous memory blocks.
import pandas as pd, numpy as np
example = [[(1,2), (3,4)], [(5,6), (7,8)]]
df = pd.DataFrame(np.array(example)[..., 1],
columns=['col1', 'col2'])
print(df)
col1 col2
0 2 4
1 6 8
Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
When using the unique() method on a Series you get a numpy array as a result, this also happens when doing it on a groupby. Consider this example:
import pandas as pd
L0 = ['G','i','G','h','j','h','G','j']
L1 = ['A','A','B','B','B','B','B','B']
df = pd.DataFrame({"A":L0,"B":L1})
dg = df.groupby('B').A.unique()
Resulting in this:
Out[56]:
B
A [G, i]
B [G, h, j]
Name: A, dtype: object
I want each unique element in its own row though:
A
B
A G
A i
B G
B h
B j
I can achieve this by hand like this (I'm deliberately omitting any iteration over DataFrames and only use the underlying numpy arrays):
de = pd.DataFrame(columns=["A","B"])
for i in range(dg.index.nunique()):
ds = pd.Series(dg.values[i]).to_frame()
ds.columns = ["A"]
ds["B"] = dg.index.values[i]
de = de.append(ds)
de = de.set_index('B')
But I'm wondering if there is a shorter (and fast) way that doesn't need loops, creating new Series or DataFrames, or messing around with the numpy arrays.
If not, I might propose it as a feature.
You can use apply with Series:
dg = df.groupby('B').A
.apply(lambda x: pd.Series(x.unique()))
.reset_index(level=1, drop=True)
.to_frame()
print (dg)
A
B
A G
A i
B G
B h
B j
Another possible solution is drop_duplicates:
df = df.drop_duplicates(['A','B']).set_index('B')
print (df)
A
B
A G
A i
B G
B h
B j
What is an efficient way to get the diagonal of a square DataFrame. I would expect the result to be a Series with a MultiIndex with two levels, the first being the index of the DataFrame the second level being the columns of the DataFrame.
Setup
import pandas as pd
import numpy as np
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(3, 3) * 5,
columns = list('abc'),
index = list('ABC'),
dtype=np.int64
)
I want to see this:
print df.stack().loc[[('A', 'a'), ('B', 'b'), ('C', 'c')]]
A a 2
B b 2
C c 3
If you don't mind using numpy you could use numpy.diag
pd.Series(np.diag(df), index=[df.index, df.columns])
A a 2
B b 2
C c 3
dtype: int64
You could do something like this:
In [16]:
midx = pd.MultiIndex.from_tuples(list(zip(df.index,df.columns)))
pd.DataFrame(data=np.diag(df), index=midx)
Out[16]:
0
A a 2
B b 2
C c 3
np.diag will give you the diagonal values as a np array, you can then construct the multiindex by zipping the index and columns and pass this as the desired index in the DataFrame ctor.
Actually the complex multiindex generation doesn't need to be so complicated:
In [18]:
pd.DataFrame(np.diag(df), index=[df.index, df.columns])
Out[18]:
0
A a 2
B b 2
C c 3
But johnchase's answer is neater
You can also use iat in a list comprehension to get the diagonal.
>>> pd.Series([df.iat[n, n] for n in range(len(df))], index=[df.index, df.columns])
A a 2
B b 2
C c 3
dtype: int64
I have the following table:
import pandas as pd
import numpy as np
#Dataframe with random numbers and with an a,b,c,d,e index
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
#Resulting dataframe:
a b c d e
a 2.214229 1.621352 0.083113 0.818191 -0.900224
b -0.612560 -0.028039 -0.392266 0.439679 1.596251
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
d -0.061682 1.141558 -0.811471 0.242874 0.345159
e -0.714760 -0.172082 0.205638 0.220528 1.182013
How can i apply a function to the dataframes index? I want to round the numbers for every column where the index is "c".
#Numbers to round to 2 decimals:
a b c d e
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
What is the best way to do this?
For label based indexing use loc:
In [22]:
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
df
Out[22]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.241418 -0.838571 -0.551222 0.662890 -1.234716
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
In [23]:
df.loc['c'] = np.round(df.loc['c'],decimals=2)
df
Out[23]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.240000 -0.840000 -0.550000 0.660000 -1.230000
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
To round values of column c:
df['c'].round(decimals=2)
To round values of row c:
df.loc['c'].round(decimals=2)