I have a DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
df
a b c d
x NaN NaN NaN NaN
y NaN NaN NaN NaN
z NaN NaN NaN NaN
I also have a Series like this
s = pd.Series(np.random.randn(3), index=['b', 'c', 'd'])
s
b -0.738283
c -0.649760
d -0.777346
dtype: float64
I want to insert the Series into a row of the DataFrame, for example the 1th row, resulting the final DataFrame:
a b c d
x NaN NaN NaN NaN
y NaN -0.738283 -0.649760 -0.777346
z NaN NaN NaN NaN
How can I do it? thanks in advance ;)
You can use iloc if need select index of df by position:
#select second row (python counts from 0)
df.iloc[1] = s
print (df)
a b c d
x NaN NaN NaN NaN
y NaN 1.71523 0.269975 -1.3663
z NaN NaN NaN NaN
Or ix (loc) if need select by label:
df.ix['y'] = s
df.loc['z'] = s
print (df)
a b c d
x NaN NaN NaN NaN
y NaN 0.619316 0.917374 -0.559769
z NaN 0.619316 0.917374 -0.559769
Just pass the series as an array:
df.iloc[1,1:] = np.asarray(s)
Related
I'm having some problems iteratively filling a pandas DataFrame with two different types of values. As a simple example, please consider the following initialization:
IN:
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
df
OUT:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
When I try to fill one row of the DataFrame, it won't adjust the value in the date column. Example:
IN:
df.iloc[0]['date'] = '2022-05-06'
df.iloc[0]['price'] = 100
df
OUT:
date price
0 NaN 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I'm suspecting it has something to do with the fact that the default np.nan value cannot be replaced by a str type value, but I'm not sure how to solve it. Please note that changing the date column's type to str does not seem to make a difference.
This doesn't work because df.iloc[0] creates a temporary Series, which is what you update, not the original DataFrame.
If you need to mix positional and label indexing you can use:
df.loc[df.index[0], 'date'] = '2022-05-06'
df.loc[df.index[0], 'price'] = 100
output:
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
Using loc() as shown below may work better:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
print(df)
df.loc[0, 'date'] = '2022-05-06'
df.loc[0, 'price'] = 100
print(df)
Output:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I'm trying to forward fill specific columns in a df where a equal to a specific value. Using the df below, I want to fill 'Code','Val1','Val2','Val3' where code is equal to item.
The following works fine on this dummy data but when I apply to my actual data it's returning an error:
ValueError: Location based indexing can only have [labels (MUST BE IN THE INDEX), slices of labels (BOTH endpoints included! Can be slices of integers if the index is integers), listlike of labels, boolean] types
The function only works on my dataset when I drop null values prior to executing the update function. However, this is pointless as the df won't be filled.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'X' : ['X',np.nan,np.nan,'Y',np.nan,'Z',np.nan,np.nan,np.nan],
'Val1' : ['B',np.nan,np.nan,'A',np.nan,'C',np.nan,np.nan,np.nan],
'Val2' : ['B',np.nan,np.nan,'A',np.nan,'C',np.nan,np.nan,np.nan],
'Val3' : ['A',np.nan,np.nan,'C',np.nan,'C',np.nan,np.nan,np.nan],
'Code' : ['No',np.nan,np.nan,'item',np.nan,'Held',np.nan,np.nan,np.nan],
})
# This function works for this dummy df
df.update(df.loc[df['Code'].str.contains('item').ffill(), ['Code','Val1','Val2','Val3']].ffill())
Intended output:
Col FULLNAME PERSON_ID STATISTIC_CODE Helper
0 X B B A No
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 Y A A C Assign
4 NaN A A C NaN
5 Z C C C Held
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
I think this can do what you want... It is not very elegant, but, you get the idea:
cols = ['Val1', 'Val2', 'Val3', 'Code']
len_df = len(df)
indexes = [i for i, x in enumerate(df['Code'].str.contains('item')) if x is True]
for i in indexes:
item_row = df.loc[i, cols]
j = i+1
current_code = df.loc[j, 'Code']
while current_code is np.nan:
df.loc[j, cols] = item_row
j += 1
if j < len_df:
current_code = df.loc[j, 'Code']
else:
break
Example (I modified a little bit your example):
Input:
X Val1 Val2 Val3 Code
0 X B B A No
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 Y A A C item
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 Z C C C item
7 NaN NaN NaN NaN NaN
8 K T P X Held
9 NaN NaN NaN NaN NaN
Result:
X Val1 Val2 Val3 Code
0 X B B A No
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 Y A A C item
4 NaN A A C item
5 NaN A A C item
6 Z C C C item
7 NaN C C C item
8 K T P X Held
9 NaN NaN NaN NaN NaN
I am trying to insert a list of data into a multi-level pandas dataframe.
It seems to work just fine, but when I view the entire dataframe, the new sub-row is not there.
Here is an example:
Create an empty multi-index dataframe:
ind = pd.MultiIndex.from_product([['A','B','C'], ['a', 'b','c']]) #set up index
df = pd.DataFrame(columns=['col1'], index=ind) #create empty df with multi-level nested index
print(df)
col1
A a NaN
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
Inserting a new column works fine:
newcol = 'col2' #new column name
df[newcol] = np.nan #fill new column with nans
print(df)
col1 col2
A a NaN NaN
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Inserting data into an existing sub-row works with point data but not with a list:
df[newcol]['A','a'] = 1 #works with point data but not with list
print(df)
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Inserting into new sub-row looks OK when viewing just the one column:
df[newcol]['A','d'] = [1,2,3] #insert into new sub-row 'd'
print(df[newcol]) #view just new column
A a 1
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
A d [1, 2, 3]
Name: col2, dtype: object
But it's not visible when viewing the entire dataframe - why?
print(df)
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Also, when I try different methods of inserting the data, I run into issues:
Using df.loc[] works perfectly for a single data point, but not for lists:
df.loc[('A','f'), newcol] = 1 #create new row at [(row,sub-row),column] & insert point data
print(df) #works fine
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
A f NaN 1.0
Same method but inserting a list returns an error:
df.loc[('A','f'), newcol] = [1,2,3] #create new row at [(row,sub-row),column] & insert list data
TypeError: object of type 'numpy.float64' has no len()
Using df.at[] returns error with both point and list data:
data.at[('A','f'), newcol] = [1,2,3] #insert into existing sub-row 'f'
KeyError: ('A', 'f')
when you do df[newcol]['A','d'] = [1,2,3], it is chained-indexing assignment, so the result is unpredictable. Pandas doesn't guarantee correct behaviors when you do chained-indexing. When you run that command, pandas executes with a warning. This warning even includes the link to the full explanation in case you want to know. I don't go into the detail because the link in the warning explains very well on this chained-indexing.
On assigning list to a cell, it is always a pain. However, it is doable. I guess your issue with df.loc[('A','f'), newcol] = [1,2,3] because col2 is dtype float, so pandas doesn't consider [1,2,3] as a single object list. It considers [1,2,3] as a list of multiple numeric values, so it failed. I don't know whether it is a bug or intentional.
To solve your issue with .loc, convert col2 to dtype object and do assignment
df['col2'] = df['col2'].astype('O')
df.loc[('A','f'), 'col2'] = [1,2,3]
print(df)
Out[1911]:
col1 col2
A a NaN NaN
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
A f NaN [1, 2, 3]
print(df['col2'])
Out[1912]:
A a NaN
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
A f [1, 2, 3]
Name: col2, dtype: object
I'm trying to forward fill specific columns but only where a row is equal to a certain value. For instance, using the df below, I want to .ffill() Val1, Val2, Helper where rows in Helper = 'Forward'. Everything else should remain the same.
df = pd.DataFrame({
'Col' : ['X',np.nan,np.nan,'Y',np.nan,'Z',np.nan,np.nan,np.nan],
'Val1' : ['B',np.nan,np.nan,'A',np.nan,'C',np.nan,np.nan,np.nan],
'Val2' : ['A',np.nan,np.nan,'C',np.nan,'C',np.nan,np.nan,np.nan],
'Helper' : ['No',np.nan,np.nan,'Forward',np.nan,'Held',np.nan,np.nan,np.nan],
})
mask = df['Helper'].str.contains('Forward', na = True)
df.loc[mask, 'Val1'] = df['Val1']
df['Val1'] = df['Val1'].ffill()
df.loc[mask, 'Val1'] = np.nan
Intended Output:
Col Val1 Val2 Helper
0 X B A No
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 Y A C Forward
4 NaN A C Forward
5 Z C C Held
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
Try this
df.update(df.loc[df['Helper'].str.contains('Forward').ffill(), ['Val1','Val2','Helper']].ffill())
Output
print(df)
Col Val1 Val2 Helper
0 X B A No
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 Y A C Forward
4 NaN A C Forward
5 Z C C Held
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
Create a mask after forward fill and then use the condition to fill the column using np.where
>>> m = df['Helper'].ffill().str.contains('Forward')
>>> req_cols = ['Val1', 'Val2', 'Helper']
>>> df[cols] = np.where(m, df[cols].ffill(), df[cols])
>>> df
Col Val1 Val2 Helper
0 X B A No
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 Y A C Forward
4 NaN A C Forward
5 Z C C Held
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
This is to go further from the following thread:
How to do join of multiindex dataframe with a single index dataframe?
The multi-indices of df1 are sublevel indices of df2.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import itertools
In [4]: inner = ('a','b')
In [5]: outer = ((10,20), (1,2))
In [6]: cols = ('one','two','three','four')
In [7]: sngl = pd.DataFrame(np.random.randn(2,4), index=inner, columns=cols)
In [8]: index_tups = list(itertools.product(*(outer + (inner,))))
In [9]: index_mult = pd.MultiIndex.from_tuples(index_tups)
In [10]: mult = pd.DataFrame(index=index_mult, columns=cols)
In [11]: sngl
Out[11]:
one two three four
a 2.946876 -0.751171 2.306766 0.323146
b 0.192558 0.928031 1.230475 -0.256739
In [12]: mult
Out[12]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
In [13]: mult.ix[(10,1)] = sngl
In [14]: mult
Out[14]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
# the new dataframes
sng2=pd.concat([sng1,sng1],keys=['X','Y'])
mult2=pd.concat([mult,mult],keys=['X','Y'])
In [110]:
sng2
Out[110]:
one two three four
X a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
Y a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
In [121]: mult2
Out[121]:
one two three four
X 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
Y 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
the code above is long, please scroll
The two multilevel indices of sng2 share the 1st and 4th indices of mul2. ('X','a') for example.
#DSM proposed a solution to work with a multiindex df2 and single index df1
mult[:] = sngl.loc[mult.index.get_level_values(2)].values
BUt DataFrame.index.get_level_values(2) can only work for one level of index.
It's not clear from the question which index levels the data frames share. I think you need to revise the set-up code as it gives an error at the definition of sngl. Anyway, suppose mult shares the first and second level with sngl you can just drop the second level from the index of mult and index in:
mult[:] = sngl.loc[mult.index.droplevel(2)].values
On a side note, you can construct a multi index from a product directly using pd.MultiIndex.from_product rather than using itertools