I have the following table:
import pandas as pd
import numpy as np
#Dataframe with random numbers and with an a,b,c,d,e index
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
#Resulting dataframe:
a b c d e
a 2.214229 1.621352 0.083113 0.818191 -0.900224
b -0.612560 -0.028039 -0.392266 0.439679 1.596251
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
d -0.061682 1.141558 -0.811471 0.242874 0.345159
e -0.714760 -0.172082 0.205638 0.220528 1.182013
How can i apply a function to the dataframes index? I want to round the numbers for every column where the index is "c".
#Numbers to round to 2 decimals:
a b c d e
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
What is the best way to do this?
For label based indexing use loc:
In [22]:
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
df
Out[22]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.241418 -0.838571 -0.551222 0.662890 -1.234716
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
In [23]:
df.loc['c'] = np.round(df.loc['c'],decimals=2)
df
Out[23]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.240000 -0.840000 -0.550000 0.660000 -1.230000
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
To round values of column c:
df['c'].round(decimals=2)
To round values of row c:
df.loc['c'].round(decimals=2)
Related
I have a data frame look like below I need to give space between each letter of word in same column
import pandas as pd
df = pd.DataFrame({'sequence': ['ABCAD', 'DBAACR']})
df
Expected Output
sequence
A b C A D
D B A A C R
import pandas as pd
df = pd.DataFrame({'sequence':['ABCAD','DBAACR']})
A = []
for i in df['sequence']:
a = (" ".join(i))
A.append(a)
df = pd.DataFrame({'sequence':A})
df
If you execute above cell which will return the pandas DataFrame as below.
sequence
0 A B C A D
1 D B A A C R
Thanks and don't forget to upvote :D
pd.DataFrame({'sequence':[' '.join('ABCAD'),' '.join('DBAACR')]})
You can use apply with lambda function to process columns in pandas data frame
df.sequence.apply(lambda x: ' '.join(list(x)))
Output:
0 A B C A D
1 D B A A C R
Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()
I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!
Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)
import pandas as pd
import numpy as np
import random
row = ['a','b','c']
col = ['A','B','C','D']
# use numpy for creating a ZEROS matrix
st = np.zeros((len(row),len(col)))
df2 = pd.DataFrame(st, index=row, columns=col)
# CONVERT each cell to an OBJECT for inserting tuples
for c in col:
df2[c] = df2[c].astype(object)
print df2
for i in row:
for j in col:
df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))
print df2
As you can see I first created a zeros(3,4) in numpy and then made each cell an OBJECT type in Pandas so I can insert tuples. Is this correct way to do or there is a better solution to ADD/RETRIVE tuples to matrices?
Results are fine:
A B C D
a 0 0 0 0
b 0 0 0 0
c 0 0 0 0
A B C D
a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)
b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)
c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)
First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:
import numpy as np
import pandas as pd
np.random.seed(2016)
row = ['a','b','c']
col = ['A','B','C','D']
data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)
print(df)
yields
A B C D
a (aA, 0.8967) (aB, 0.7302) (aC, 0.7833) (aD, 0.7417)
b (bA, 0.4621) (bB, 0.6426) (bC, 0.2249) (bD, 0.7085)
c (cA, 0.7471) (cB, 0.6251) (cC, 0.58) (cD, 0.2426)
Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, tuples require "object" dtype).
So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:
import numpy as np
import pandas as pd
np.random.seed(2016)
row=['a','b','c']
col=['A','B','C','D']
prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4),
index=row, columns=col)
print(prevstate)
# A B C D
# a aA aB aC aD
# b bA bB bC bD
# c cA cB cC cD
print(prob)
# A B C D
# a 0.8967 0.7302 0.7833 0.7417
# b 0.4621 0.6426 0.2249 0.7085
# c 0.7471 0.6251 0.5800 0.2426
To loop through the columns, find the row with maximum probability and retrieve the corresponding prevstate, you could use .idxmax and .loc:
for col in prob.columns:
idx = (prob[col].idxmax())
print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))
yields
aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417