Mapping numpy arrays into pandas DataFrame results in ValueError - python

Say I have a dictionary with keys: arrays, such as:
In[0]: arrs = {
...: 'a': np.array([1, 2, 3]),
...: 'b': np.array([4, 5, 6])
}
And a pandas DataFrame whose index contains these keys:
In[1]: df = pd.DataFrame(index=list('abc'), columns = list('def'))
...: df
Out[1]:
d e f
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN Na
I would like to populate the DataFrame with the values from the array dictionary.
This works:
In[2]: for idx in ['a', 'b']:
...: df.loc[idx, :] = arrs[idx]
...: df
Out[2]:
d e f
a 1 2 3
b 4 5 6
c NaN NaN NaN
Which is fine, but I would like to vectorize the operation. I tried what I thought would work:
In[3]: df.loc[('a', 'b'), :] = df.loc[('a', 'b'), :].index.map(lambda x: arrs[x])
But this results in a ValueError:
ValueError: could not broadcast input array from shape (2) into shape (2,3)
Why is my mapping only counting the number of arrays, and not actually seeing the shape of the arrays?

Use the DataFrame constructor on your dictionary, then update the first DataFrame.
import pandas as pd
df.update(pd.DataFrame.from_dict(arrs, orient='index', columns=['d', 'e', 'f']))
Output: df
d e f
a 1 2 3
b 4 5 6
c NaN NaN NaN

Related

Why transpose data to get a multiindexed dataframe?

I'm a bit confused with data orientation when creating a Multiindexed DataFrame from a DataFrame.
I import data with read_excel() and I begin with something like:
import pandas as pd
df = pd.DataFrame([['A', 'B', 'A', 'B'], [1, 2, 3, 4]],
columns=['k', 'k', 'm', 'm'])
df
Out[3]:
k k m m
0 A B A B
1 1 2 3 4
I want to multiindex this and to obtain:
A B A B
k k m m
0 1 2 3 4
Mainly from Pandas' doc, I did:
arrays = df.iloc[0].tolist(), list(df)
tuples = list(zip(*arrays))
multiindex = pd.MultiIndex.from_tuples(tuples, names=['topLevel', 'downLevel'])
df = df.drop(0)
If I try
df2 = pd.DataFrame(df.values, index=multiindex)
(...)
ValueError: Shape of passed values is (4, 1), indices imply (4, 4)
I then have to transpose the values:
df2 = pd.DataFrame(df.values.T, index=multiindex)
df2
Out[11]:
0
topLevel downLevel
A k 1
B k 2
A m 3
B m 4
Last I re-transpose this dataframe to obtain:
df2.T
Out[12]:
topLevel A B A B
downLevel k k m m
0 1 2 3 4
OK, this is what I want, but I don't understand why I have to transpose 2 times. It seems useless.
You can create the MultiIndex yourself, and then drop the row. From your starting df:
import pandas as pd
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.columns], names=[None]*2)
df = df.iloc[1:].reset_index(drop=True)
A B A B
k k m m
0 1 2 3 4

pandas stacking dataframe reshapes data

I'm trying to stack two 3 column data frames using either concat, append, or merge. The result is a 5 column dataframe where the original columns have a different order in places. Here are some of the things I've tried:
dfTrain = pd.read_csv("agr_hi_train.csv")
dfTrain2 = pd.read_csv("english/agr_en_train.csv")
dfTrain2.reset_index()
frames = [dfTrain, dfTrain2]
test = dfTrain2.append(dfTrain, ignore_index=True)
test2 = dfTrain2.append(dfTrain)
test3 = pd.concat(frames, axis=0, ignore_index=True)
test4 = pd.merge(dfTrain,dfTrain2, right_index=True, left_index=True)
With the following results:
print(dfTrain.shape)
print(dfTrain2.shape)
print(test.shape)
print(test2.shape)
print(test3.shape)
print(test4.shape)
Output is:
(20198, 5)
(20198, 5)
(11998, 6)
(8200, 6)
(8200, 3)
(11998, 3)
I want the result to be:
(20198,3) # i.e. last two stacked on top of each other. . .
Any ideas why I'm getting the extra columns, etc.?
If you have different column names, then your append will separate the columns. For example:
dfTrain = pd.DataFrame(np.random.rand(8200, 3), columns=['A', 'B', 'C'])
dfTrain2 = pd.DataFrame(np.random.rand(11998, 3), columns=['D', 'E', 'F'])
test = dfTrain.append(dfTrain2)
print(test)
has the output:
A B C D E F
0 0.617294 0.507264 0.330792 NaN NaN NaN
1 0.439806 0.355340 0.757864 NaN NaN NaN
2 0.740674 0.332794 0.530613 NaN NaN NaN
...
20195 NaN NaN NaN 0.295392 0.621741 0.255251
20196 NaN NaN NaN 0.096586 0.841174 0.392839
20197 NaN NaN NaN 0.071756 0.998280 0.451681
If you rename the columns in both dataframes to match, then it'll line up.
dfTrain2.columns = ['A','B','C']
test2 = dfTrain.append(dfTrain2)
print(test2)
A B C
0 0.545936 0.103332 0.939721
1 0.258807 0.274423 0.262293
2 0.374780 0.458810 0.955040
...
[20198 rows x 3 columns]

Boolean Indexing along the row axis of a DataFrame in pandas

a = [ [1,2,3,4,5], [6,np.nan,8,np.nan,10]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'], index=['foo', 'bar'])
In [5]: df
Out[5]:
a b c d e
foo 1 2.0 3 4.0 5
bar 6 NaN 8 NaN 10
I understand how normal boolean indexing works, for example if I want to select the rows that have c > 3 I would write df[df.c > 3]. However, what if I want to do that along the row axis. Say I want only the columns that have 'bar' == np.nan.
I would have assumed that the following should do it due to the similarly of df['a'] and df.loc['bar']:
df.loc[df.loc['bar'].isnull()]
But it doesn't, and obviously neither does results[results.loc['hl'].isnull()] giving the same error *** pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
So how would I do it?
IIUC you want to use the boolean mask to mask the columns:
In [135]:
df[df.columns[df.loc['bar'].isnull()]]
Out[135]:
b d
foo 2.0 4.0
bar NaN NaN
Or you can use ix and decay the series to np array:
In [138]:
df.ix[:,df.loc['bar'].isnull().values]
Out[138]:
b d
foo 2.0 4.0
bar NaN NaN
The problem here is that the boolean series returned is a mask on the columns:
In [136]:
df.loc['bar'].isnull()
Out[136]:
a False
b True
c False
d True
e False
Name: bar, dtype: bool
but your index contains none of these column values as the labels hence the error so you need to use the mask against the columns or you can pass a np array to mask the columns in ix

Pandas groupby result into multiple columns

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.
For example: say I have the following dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
Group ID Value
0 A 1 66
1 C 2 2
2 B 3 98
3 A 4 90
4 C 5 85
5 C 6 38
>>>
I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.
Currently I am able to do this as follows, but I am looking for a cleaner methodology:
First, I create a dataframe with a list of the IDs in each group.
>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
ID Value
Group
A [1, 4] 98
B [3] 76
C [2, 5, 6] 204
>>>
And then I use pd.Series to split those up into columns, rename them, and then join it back.
>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>>
>>> result.join(id_df)[id_cols+['Value']]
ID1 ID2 ID3 Value
Group
A 1 4 NaN 98
B 3 NaN NaN 76
C 2 5 6 204
>>>
Is there a way to do this without first having to create the list of values?
You could use
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
to create id_df without the intermediate result DataFrame.
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
yields
ID1 ID2 ID3 Value
Group
A 1 4 NaN 77
B 3 NaN NaN 84
C 2 5 6 86
Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":
Using #unutbu setup:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')
#Create helper column
df['ID_Count'] = grp['ID'].cumcount() + 1
#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
Output:
ID_Count ID1 ID2 ID3 Value
Group
A 1.0 4.0 NaN 77
B 3.0 NaN NaN 84
C 2.0 5.0 6.0 86
Using get_dummies and MultiLabelBinarizer (scikit-learn):
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)
mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])
df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)
ID1 ID2 ID3 Value
Group
A 1 4 NaN 63
B 3 NaN NaN 59
C 2 5 6 230

Python pandas: fill a dataframe row by row

The simple task of adding a row to a pandas.DataFrame object seems to be hard to accomplish. There are 3 stackoverflow questions relating to this, none of which give a working answer.
Here is what I'm trying to do. I have a DataFrame of which I already know the shape as well as the names of the rows and columns.
>>> df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
>>> df
a b c d
x NaN NaN NaN NaN
y NaN NaN NaN NaN
z NaN NaN NaN NaN
Now, I have a function to compute the values of the rows iteratively. How can I fill in one of the rows with either a dictionary or a pandas.Series ? Here are various attempts that have failed:
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df['y'] = y
AssertionError: Length of values does not match length of index
Apparently it tried to add a column instead of a row.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.join(y)
AttributeError: 'builtin_function_or_method' object has no attribute 'is_unique'
Very uninformative error message.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.set_value(index='y', value=y)
TypeError: set_value() takes exactly 4 arguments (3 given)
Apparently that is only for setting individual values in the dataframe.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.append(y)
Exception: Can only append a Series if ignore_index=True
Well, I don't want to ignore the index, otherwise here is the result:
>>> df.append(y, ignore_index=True)
a b c d
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 1 5 2 3
It did align the column names with the values, but lost the row labels.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.ix['y'] = y
>>> df
a b \
x NaN NaN
y {'a': 1, 'c': 2, 'b': 5, 'd': 3} {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z NaN NaN
c d
x NaN NaN
y {'a': 1, 'c': 2, 'b': 5, 'd': 3} {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z NaN NaN
That also failed miserably.
So how do you do it ?
df['y'] will set a column
since you want to set a row, use .loc
Note that .ix is equivalent here, yours failed because you tried to assign a dictionary
to each element of the row y probably not what you want; converting to a Series tells pandas
that you want to align the input (for example you then don't have to to specify all of the elements)
In [6]: import pandas as pd
In [7]: df = pd.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
In [8]: df.loc['y'] = pd.Series({'a':1, 'b':5, 'c':2, 'd':3})
In [9]: df
Out[9]:
a b c d
x NaN NaN NaN NaN
y 1 5 2 3
z NaN NaN NaN NaN
Update: because append has been deprecated
df = pd.DataFrame(columns=["firstname", "lastname"])
entry = pd.DataFrame.from_dict({
"firstname": ["John"],
"lastname": ["Johny"]
})
df = pd.concat([df, entry], ignore_index=True)
This is a simpler version
import pandas as pd
df = pd.DataFrame(columns=('col1', 'col2', 'col3'))
for i in range(5):
df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']`
If your input rows are lists rather than dictionaries, then the following is a simple solution:
import pandas as pd
list_of_lists = []
list_of_lists.append([1,2,3])
list_of_lists.append([4,5,6])
pd.DataFrame(list_of_lists, columns=['A', 'B', 'C'])
# A B C
# 0 1 2 3
# 1 4 5 6
The logic behind the code is quite simple and straight forward
Make a df with 1 row using the dictionary
Then create a df of shape (1, 4) that only contains NaN and has the same columns as the dictionary keys
Then concatenate a nan df with the dict df and then another nan df
import pandas as pd
import numpy as np
raw_datav = {'a':1, 'b':5, 'c':2, 'd':3}
datav_df = pd.DataFrame(raw_datav, index=[0])
nan_df = pd.DataFrame([[np.nan]*4], columns=raw_datav.keys())
df = pd.concat([nan_df, datav_df, nan_df], ignore_index=True)
df.index = ["x", "y", "z"]
print(df)
gives
a b c d
x NaN NaN NaN NaN
y 1.0 5.0 2.0 3.0
z NaN NaN NaN NaN
[Program finished]

Categories