The simple task of adding a row to a pandas.DataFrame object seems to be hard to accomplish. There are 3 stackoverflow questions relating to this, none of which give a working answer.
Here is what I'm trying to do. I have a DataFrame of which I already know the shape as well as the names of the rows and columns.
>>> df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
>>> df
a b c d
x NaN NaN NaN NaN
y NaN NaN NaN NaN
z NaN NaN NaN NaN
Now, I have a function to compute the values of the rows iteratively. How can I fill in one of the rows with either a dictionary or a pandas.Series ? Here are various attempts that have failed:
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df['y'] = y
AssertionError: Length of values does not match length of index
Apparently it tried to add a column instead of a row.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.join(y)
AttributeError: 'builtin_function_or_method' object has no attribute 'is_unique'
Very uninformative error message.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.set_value(index='y', value=y)
TypeError: set_value() takes exactly 4 arguments (3 given)
Apparently that is only for setting individual values in the dataframe.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.append(y)
Exception: Can only append a Series if ignore_index=True
Well, I don't want to ignore the index, otherwise here is the result:
>>> df.append(y, ignore_index=True)
a b c d
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 1 5 2 3
It did align the column names with the values, but lost the row labels.
>>> y = {'a':1, 'b':5, 'c':2, 'd':3}
>>> df.ix['y'] = y
>>> df
a b \
x NaN NaN
y {'a': 1, 'c': 2, 'b': 5, 'd': 3} {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z NaN NaN
c d
x NaN NaN
y {'a': 1, 'c': 2, 'b': 5, 'd': 3} {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z NaN NaN
That also failed miserably.
So how do you do it ?
df['y'] will set a column
since you want to set a row, use .loc
Note that .ix is equivalent here, yours failed because you tried to assign a dictionary
to each element of the row y probably not what you want; converting to a Series tells pandas
that you want to align the input (for example you then don't have to to specify all of the elements)
In [6]: import pandas as pd
In [7]: df = pd.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
In [8]: df.loc['y'] = pd.Series({'a':1, 'b':5, 'c':2, 'd':3})
In [9]: df
Out[9]:
a b c d
x NaN NaN NaN NaN
y 1 5 2 3
z NaN NaN NaN NaN
Update: because append has been deprecated
df = pd.DataFrame(columns=["firstname", "lastname"])
entry = pd.DataFrame.from_dict({
"firstname": ["John"],
"lastname": ["Johny"]
})
df = pd.concat([df, entry], ignore_index=True)
This is a simpler version
import pandas as pd
df = pd.DataFrame(columns=('col1', 'col2', 'col3'))
for i in range(5):
df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']`
If your input rows are lists rather than dictionaries, then the following is a simple solution:
import pandas as pd
list_of_lists = []
list_of_lists.append([1,2,3])
list_of_lists.append([4,5,6])
pd.DataFrame(list_of_lists, columns=['A', 'B', 'C'])
# A B C
# 0 1 2 3
# 1 4 5 6
The logic behind the code is quite simple and straight forward
Make a df with 1 row using the dictionary
Then create a df of shape (1, 4) that only contains NaN and has the same columns as the dictionary keys
Then concatenate a nan df with the dict df and then another nan df
import pandas as pd
import numpy as np
raw_datav = {'a':1, 'b':5, 'c':2, 'd':3}
datav_df = pd.DataFrame(raw_datav, index=[0])
nan_df = pd.DataFrame([[np.nan]*4], columns=raw_datav.keys())
df = pd.concat([nan_df, datav_df, nan_df], ignore_index=True)
df.index = ["x", "y", "z"]
print(df)
gives
a b c d
x NaN NaN NaN NaN
y 1.0 5.0 2.0 3.0
z NaN NaN NaN NaN
[Program finished]
Related
I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0
I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?
Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.
Say I have a dictionary with keys: arrays, such as:
In[0]: arrs = {
...: 'a': np.array([1, 2, 3]),
...: 'b': np.array([4, 5, 6])
}
And a pandas DataFrame whose index contains these keys:
In[1]: df = pd.DataFrame(index=list('abc'), columns = list('def'))
...: df
Out[1]:
d e f
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN Na
I would like to populate the DataFrame with the values from the array dictionary.
This works:
In[2]: for idx in ['a', 'b']:
...: df.loc[idx, :] = arrs[idx]
...: df
Out[2]:
d e f
a 1 2 3
b 4 5 6
c NaN NaN NaN
Which is fine, but I would like to vectorize the operation. I tried what I thought would work:
In[3]: df.loc[('a', 'b'), :] = df.loc[('a', 'b'), :].index.map(lambda x: arrs[x])
But this results in a ValueError:
ValueError: could not broadcast input array from shape (2) into shape (2,3)
Why is my mapping only counting the number of arrays, and not actually seeing the shape of the arrays?
Use the DataFrame constructor on your dictionary, then update the first DataFrame.
import pandas as pd
df.update(pd.DataFrame.from_dict(arrs, orient='index', columns=['d', 'e', 'f']))
Output: df
d e f
a 1 2 3
b 4 5 6
c NaN NaN NaN
I have a pandas data frame and created a dictionary based on columns of the data frame. The dictionary is almost well generated but the only problem is that I try to filter out the NaN value but my code doesn't work, so there are NaN as key in the dictionary. My code is the following:
for key,row in mr.iterrows():
# With this line I try to filter out the NaN values but it doesn't work
if pd.notnull(row['Company nameC']) and pd.notnull(row['Company nameA']) and pd.notnull(row['NEW ID']) :
newppmr[row['NEW ID']]=row['Company nameC']
The output is:
defaultdict(<type 'list'>, {nan: '1347 PROPERTY INS HLDGS INC', 1.0: 'AFLAC INC', 2.0: 'AGCO CORP', 3.0: 'AGL RESOURCES INC', 4.0: 'INVESCO LTD', 5.0: 'AK STEEL HOLDING CORP', 6.0: 'AMN HEALTHCARE SERVICES INC', nan: 'FOREVERGREEN WORLDWIDE CORP'
So, I don't know how to filer out the nan values and what's wrong with my code.
EDIT:
An example of my pandas data frames is:
CUSIP Company nameA A�O NEW ID Company nameC
42020 98912M201 NaN NaN NaN ZAP
42021 989063102 NaN NaN NaN ZAP.COM CORP
42022 98919T100 NaN NaN NaN ZAZA ENERGY CORP
42023 98876R303 NaN NaN NaN ZBB ENERGY CORP
Pasting an example - how to remove "nan" keys from your dictionary:
Lets create dict with 'nan' keys (NaN in numeric arrays)
>>> a = float("nan")
>>> b = float("nan")
>>> d = {a: 1, b: 2, 'c': 3}
>>> d
{nan: 1, nan: 2, 'c': 3}
Now, lets remove all 'nan' keys
>>> from math import isnan
>>> c = dict((k, v) for k, v in d.items() if not (type(k) == float and isnan(k)))
>>> c
{'c': 1}
Other scenario that works fine. Maybe I'm missing something ?
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'a':[1,2,3,4,np.nan],'b':[np.nan,np.nan,np.nan,5,np.nan]})
In [4]: df
Out[4]:
a b
0 1 NaN
1 2 NaN
2 3 NaN
3 4 5
4 NaN NaN
In [5]: for key, row in df.iterrows(): print pd.notnull(row['a'])
True
True
True
True
False
In [6]: for key, row in df.iterrows(): print pd.notnull(row['b'])
False
False
False
True
False
In [7]: x = {}
In [8]: for key, row in df.iterrows():
....: if pd.notnull(row['b']) and pd.notnull(row['a']):
....: x[row['b']]=row['a']
....:
In [9]: x
Out[9]: {5.0: 4.0}
Given the following pandas data frame:
df = pd.DataFrame({'A': ['foo' ] * 3 + ['bar'],
'B': ['w','x']*2,
'C': ['y', 'z', 'a','a'],
'D': rand.randn(4),
})
print df.to_string()
"""
A B C D
0 foo w y 0.06075020
1 foo x z 0.21112476
2 foo w a 0.01652757
3 bar x a 0.17718772
"""
Notice how there is no bar,w combination. When doing the following:
pv0 = pandas.pivot_table(df, rows=['A','B'],cols=['C'], aggfunc=numpy.sum)
pv0.ix['bar','x'] #returns result
pv0.ix['bar','w'] #key error though i would like it to return all Nan's
pv0.index #returns
[(bar, x), (foo, w), (foo, x)]
As long as there is at least one entry in column 'C' as in the case of foo,x (it only has a value for 'z' in the 'C' column) it will return NaN for the other column values of 'C' not present for foo,x (e.g. 'a','y')
What I would like would be to have all multiindex combinations, even those that have no data for all column values.
pv0.index #I would like it to return
[(bar, w), (bar, x), (foo, w), (foo, x)]
I can wrap the .ix commands in try/except blocks, but is there a way that pandas can fill this in automatically?
You can use reindex() method:
>>> df1 = pd.pivot_table(df, rows=['A','B'], cols='C', aggfunc=np.sum)
>>> df1
D
C a y z
A B
bar x 0.161702 NaN NaN
foo w 0.749007 0.85552 NaN
x NaN NaN 0.458701
>>> index = list(iter.product(df['A'].unique(), df['B'].unique()))
>>> df1.reindex(index)
D
C a y z
foo w 0.749007 0.85552 NaN
x NaN NaN 0.458701
bar w NaN NaN NaN
x 0.161702 NaN NaN