How to convert a series of tuples into a pandas dataframe? - python

Assume that we have the following pandas series resulted from an apply function applied on a dataframe after groupby.
<class 'pandas.core.series.Series'>
0 (1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2])
1 (2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1])
2 (1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4])
3 (1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
4 (3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6])
dtype: object
Can we convert this into a dataframe when the sigList=['sig1','sig2', 'sig3'] are given?
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
1 0 0.2 0.2 0.2 0.2 0.2 0.2
2 1000 0.6 0.7 0.5 0.1 0.3 0.1
1 0 0.4 0.4 0.4 0.4 0.4 0.4
1 0 0.5 0.5 0.5 0.5 0.5 0.5
3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Thanks in advance

Do it the old fashioned (and fast) way, using a list comprehension:
columns = ("Length Distance sig1Max sig2Max"
"sig3Max sig1Min sig2Min sig3Min").split()
df = pd.DataFrame([[a, b, *c, *d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6
Or, perhaps you meant, do it a little more dynamically
sigList = ['sig1', 'sig2', 'sig3']
columns = ['Length', 'Distance']
columns.extend(f'{s}{lbl}' for lbl in ('Max', 'Min') for s in sigList )
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in series.values], columns=columns)
print(df)
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1 0 0.2 0.2 0.2 0.2 0.2 0.2
1 2 1000 0.6 0.7 0.5 0.1 0.3 0.1
2 1 0 0.4 0.4 0.4 0.4 0.4 0.4
3 1 0 0.5 0.5 0.5 0.5 0.5 0.5
4 3 14000 0.8 0.8 0.8 0.6 0.6 0.6

You may check
newdf=pd.DataFrame(s.tolist())
newdf=pd.concat([newdf[[0,1]],pd.DataFrame(newdf[2].tolist()),pd.DataFrame(newdf[3].tolist())],1)
newdf.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
newdf
Out[163]:
Length Distance sig1Max ... sig1Min sig2Min sig3Min
0 1 0 0.2 ... 0.2 0.2 0.2
1 2 1000 0.6 ... 0.1 0.3 0.1
2 1 0 0.4 ... 0.4 0.4 0.4
3 1 0 0.5 ... 0.5 0.5 0.5
4 3 14000 0.8 ... 0.6 0.6 0.6
[5 rows x 8 columns]

You can flatten each element and then convert each to a Series itself. Converting each element to a Series turns the main Series (s in the example below) into a DataFrame. Then just set the column names as you wish.
For example:
import pandas as pd
# load in your data
s = pd.Series([
(1, 0, [0.2, 0.2, 0.2], [0.2, 0.2, 0.2]),
(2, 1000, [0.6, 0.7, 0.5], [0.1, 0.3, 0.1]),
(1, 0, [0.4, 0.4, 0.4], [0.4, 0.4, 0.4]),
(1, 0, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
(3, 14000, [0.8, 0.8, 0.8], [0.6, 0.6, 0.6]),
])
def flatten(x):
# note this is not very robust, but works for this case
return [x[0], x[1], *x[2], *x[3]]
df = s.apply(flatten).apply(pd.Series)
df.columns = [
"Length", "Distance", "sig1Max", "sig2Max", "sig3Max", "sig1Min", "sig2Min", "sig3Min"
]
Then you have df as:
Length Distance sig1Max sig2Max sig3Max sig1Min sig2Min sig3Min
0 1.0 0.0 0.2 0.2 0.2 0.2 0.2 0.2
1 2.0 1000.0 0.6 0.7 0.5 0.1 0.3 0.1
2 1.0 0.0 0.4 0.4 0.4 0.4 0.4 0.4
3 1.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5
4 3.0 14000.0 0.8 0.8 0.8 0.6 0.6 0.6

Related

pandas changed column value condition of three other columns

I have the following pandas dataframe:
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.3, 0, 0.1, 0.2]})
I want to change values on 'pred' column, based on columns a,b,c,d , as following:
if a has the value at column a is larger than the values of column b,c,d
and
if one of columns - b , c or d has value larger than 0.25
then change value in 'pred' to 0. so the results should be:
pred a b c d
0 1 0.4 0.2 0.1 0.1
1 0 0.6 0.4 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.5 0.1 0.2 0.2
How can I do this?
Create a boolean condition/mask then use loc to set value to 0 where condition is True
cols = ['b', 'c', 'd']
mask = df[cols].lt(df['a'], axis=0).all(1) & df[cols].gt(.25).any(1)
df.loc[mask, 'pred'] = 0
pred a b c d
0 1 0.40 0.20 0.1 0.1
1 0 0.60 0.40 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.50 0.10 0.2 0.2
import pandas as pd
def row_cond(row):
m_val = max(row[2:])
if row[1]>m_val and m_val>0.25:
row[0] = 0
return row
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.1, 0, 0.3, 0.2]})
new_df = df.apply(row_cond,axis=1)
Output:
pred a b c d
0 1.0 0.40 0.20 0.1 0.1
1 0.0 0.60 0.40 0.0 0.0
2 0.0 0.35 0.32 0.2 0.3
3 4.0 0.50 0.10 0.2 0.2

Python: Appending a row into all rows in a dataframe

I have 2 dataframe as shown below
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
WA WB WC
0 0.4 0.2 0.4
1 0.1 0.3 0.6
2 0.3 0.2 0.5
3 0.3 0.3 0.4
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
stv_Astv_Bstv_c
0 0.5 0.2 0.4
Is there anyway to append dff2 which only consist of one row to every single row in ddf? Resulting dataframe should thus have 6 columns and rows
You can use:
dff[dff2.columns] = dff2.squeeze()
print(dff)
# Output
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
Pandas does the broadcasting for you when you assign a scalar as a column:
import pandas as pd
dff = pd.DataFrame([[0.4, 0.2, 0.4], [0.1, 0.3, 0.6], [0.3, 0.2, 0.5], [0.3,0.3,0.4]], columns=['WA', 'WB','WC'])
dff2 = pd.DataFrame([[0.5, 0.2, 0.4]], columns = ['stv_A', 'stv_B', 'stv_c'])
for col in dff2.columns:
dff[col] = dff2[col][0] # Pass a scalar
print(dff)
Output:
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4
You can first repeat the row in dff2 len(dff) times with different methods, then concat the repeated dataframe to dff
df = pd.concat([dff, pd.concat([dff2]*len(dff)).reset_index(drop=True)], axis=1)
print(df)
WA WB WC stv_A stv_B stv_c
0 0.4 0.2 0.4 0.5 0.2 0.4
1 0.1 0.3 0.6 0.5 0.2 0.4
2 0.3 0.2 0.5 0.5 0.2 0.4
3 0.3 0.3 0.4 0.5 0.2 0.4

Converting a PD DF to a dictionary

I have a pd dataframe as seen in image:
image of data
imported via the pd.read_csv method.
I would like to convert it to a dictionary, where the key is 'Countries', and the value is a list of the numbers 1 to 300.
How is the best way to do this? I have tried other methods listed on stack but since my df doesn't have column headings it is not working
Something like this should do what your question asks:
d = {row[1]:list(row[2:]) for row in df.itertuples()}
Here is sample code showing the above in the context of your question:
records = [
['afghanistan', -0.9,-0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9],
['albania', -0.9,-0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9],
['algeria', -0.9,-0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9],
['andorra', -0.9,-0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9]
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
d = {row[1]:list(row[2:]) for row in df.itertuples()}
print()
[print(f"{k} : {v}") for k, v in d.items()]
Output:
0 1 2 3 4 5 6 7 8 9 10
0 afghanistan -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
1 albania -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
2 algeria -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
3 andorra -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
afghanistan : [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
albania : [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
algeria : [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
andorra : [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]

Pandas row-wise addition with another column

I have a dataframe df
A B C
0.1 0.3 0.5
0.2 0.4 0.6
0.3 0.5 0.7
0.4 0.6 0.8
0.5 0.7 0.9
For each row I would I would like to add a value to each element from dataframe df1
X
0.1
0.2
0.3
0.4
0.5
Such that the final result would be
A B C
0.2 0.4 0.6
0.4 0.6 0.8
0.6 0.8 1.0
0.8 1.0 1.2
1.0 1.2 1.4
I have tried using df_new =df.sum(df1, axis=0), but got the following error TypeError: stat_func() got multiple values for argument 'axis' I would be open to numpy solutions as well
You can use np.add:
df = np.add(df, df1.to_numpy())
print(df)
Prints:
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
import pandas as pd
df = pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df1 = [0.1, 0.2, 0.3, 0.4, 0.5]
# In one Pandas instruction
df = df.add(pd.Series(df1), axis=0)
results :
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
Try concat with .stack() and .sum()
df_new = pd.concat([df1.stack(),df2.stack()],1).bfill().sum(axis=1).unstack(1).drop('X',1)
A B C
0 0.2 0.4 0.6
1 0.4 0.6 0.8
2 0.6 0.8 1.0
3 0.8 1.0 1.2
4 1.0 1.2 1.4
df= pd.DataFrame([[0.1,0.3, 0.5],
[0.2, 0.4, 0.6],
[0.3, 0.5, 0.7],
[0.4, 0.6, 0.8],
[0.5, 0.7, 0.9]],
columns=['A', 'B', 'C'])
df["X"]=[0.1, 0.2, 0.3, 0.4, 0.5]
columns_to_add= df.columns[:-1]
for col in columns_to_add:
df[col]+=df['X'] #this is where addition or any other operation can be performed
df.drop('X',axis=0)

How to select nested columns in a multi-indexed pandas dataframe

I created a 3D Pandas dataframe like this:
A= ['ECFP', 'ECFP', 'ECFP', 'FCFP', 'FCFP', 'FCFP', 'RDK5', 'RDK5', 'RDK5']
B = ['R', 'tau', 'RMSEc', 'R', 'tau', 'RMSEc', 'R', 'tau', 'RMSEc']
C = array([[ 0.1 , 0.3 , 0.5 , nan, 0.6 , 0.4 ],
[ 0.4 , 0.3 , 0.3 , nan, 0.4 , 0.3 ],
[ 1.2 , 1.3 , 1.1 , nan, 1.5 , 1. ],
[ 0.4 , 0.3 , 0.4 , 0.8 , 0.1 , 0.2 ],
[ 0.2 , 0.3 , 0.3 , 0.3 , 0.5 , 0.6 ],
[ 1. , 1.2 , 1. , 0.9 , 1.2 , 1. ],
[ 0.4 , 0.7 , 0.5 , 0.4 , 0.6 , 0.6 ],
[ 0.6 , 0.5 , 0.3 , 0.3 , 0.3 , 0.5 ],
[ 1.2 , 1.5 , 1.3 , 0.97, 1.5 , 1. ]])
df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
df = df.dropna(axis=0, how='any')
The final Dataframe looks like this:
ECFP FCFP RDK5
R tau RMSEc R tau RMSEc R tau RMSEc
0 0.1 0.4 1.2 0.4 0.2 1.0 0.4 0.6 1.2
1 0.3 0.3 1.3 0.3 0.3 1.2 0.7 0.5 1.5
2 0.5 0.3 1.1 0.4 0.3 1.0 0.5 0.3 1.3
4 0.6 0.4 1.5 0.1 0.5 1.2 0.6 0.3 1.5
5 0.4 0.3 1.0 0.2 0.6 1.0 0.6 0.5 1.0
How can I get the correlation matrix only between 'R' values for all types of data ('ECFP', 'FCFP', 'RDK5')?
use IndexSlice:
In [53]: df.loc[:, pd.IndexSlice[:, 'R']]
Out[53]:
ECFP FCFP RDK5
R R R
0 0.1 0.4 0.4
1 0.3 0.3 0.7
2 0.5 0.4 0.5
4 0.6 0.1 0.6
5 0.4 0.2 0.6
By using slice
df.loc[:,(slice(None),'R')]
Out[375]:
ECFP FCFP RDK5
R R R
0 0.1 0.4 0.4
1 0.3 0.3 0.7
2 0.5 0.4 0.5
4 0.6 0.1 0.6
5 0.4 0.2 0.6
Both answers work, but first I must lexstort, otherwise I get this error:
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'
The solution is:
df.sortlevel(axis=1, inplace=True)
print "Correlation matrix of Pearson's R values among all feature vector types:"
df.loc[:, pd.IndexSlice[:, 'R']].corr()

Categories