I'm creating a pandas DataFrame object using the DataFrame constructor. My data is a dict of lists and categorical data Series objects. When I pass an index to the constructor, my categorical data series gets reset with NaN values. What's going on here? Thanks in advance!
Example:
import pandas as pd
import numpy as np
a = pd.Series(['a','b','c'],dtype="category")
b = pd.Series(['a','b','c'],dtype="object")
c = pd.Series(['a','b','cc'],dtype="object")
A = pd.DataFrame({'A':a,'B':[1,2,3]},index=["0","1","2"])
AA = pd.DataFrame({'A':a,'B':[1,2,3]})
B = pd.DataFrame({'A':b,'C':[4,5,6]})
print("DF A:")
print(A)
print("\nDF A, without specifying an index in the constructor:")
print(AA)
print("\nDF B:")
print(B)
This doesn't have anything to do with categories vs. object, it has to do with index alignment.
You're getting NaNs in A because you're telling the constructor you want an index of three strings. But a has an index of its own, consisting of the integers [0, 1, 2]. Since that doesn't match the index you've said you want, the data doesn't align, and so you get a DataFrame with the index you said you wanted and the NaNs highlight that the data is missing. By contrast, B is simply a list, and so there's no index to ignore, and accordingly it assumes the data is given in index-appropriate order.
This might be easier to see than to explain. Regardless of dtype, if the indices don't match, you get NaN:
In [147]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="category"),'B':[1,2,3]},
index=["0","1","2"])
Out[147]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
In [148]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=["0","1","2"])
Out[148]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
If you use a fully-matching index, it works:
In [149]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,2])
Out[149]:
A B
0 a 1
1 b 2
2 c 3
And if you use a partially-matching index, you'll get values where the indices align and NaN where they don't:
In [150]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,10])
Out[150]:
A B
0 a 1
1 b 2
10 NaN 3
Related
I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7
I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!
Normally when I index a DataFrame (or a Series) with a list of integer indices, I get back a subset of the rows, unless some of my indices are out of bounds, in which case I get an IndexError:
s = pd.Series(range(4))
0 0
1 1
2 2
3 3
s.iloc[[1,3]]
1 1
3 3
s.iloc[[1,3,5]]
IndexError
But I'd like to get back a DataFrame (or Series) having an index identical to the list I queried with (i.e., parallel to the query list), with (the rows corresponding to) any out-of-bounds indices filled in with NaN :
s.something[[1,3,5]]
1 1
3 3
5 NaN
I don't think join tricks work because those want to operate on the DataFrame index (or columns). As far as I can tell there's not even an "iget" integer-based get method if I wanted to manually loop over the indices myself. That leaves something like:
indices = [1,3,5]
pd.Series([s.iloc[i] if 0 <= i < len(s) else np.nan for i in indices], index=indices)
Is that the best Pandas 0.18 can do?
You can use reindex to achieve this:
In [119]:
s.reindex([1,3,5])
Out[119]:
1 1
3 3
5 NaN
dtype: float64
This will use the passed index and return existing values or NaN
Thanks to #EdChum for inspiration, the general solution is:
s.reset_index(drop=True).reindex([1,3,5])
I want to solve a problem that essentially boils down to this:
I have identifier numbers (thousands of them) and each should be uniquely linked to a set of letters. Let's call them a through e. These can be filled from another column (y) if that helps.
Ocassionally one of the letters is missing and is registered as NAN. How can I replace such that I get all the required numbers.
Idnumber X y
1 a a
2 a a
1 b b
1 NaN d
2 b NaN
1 d c
2 c NaN
1 NaN e
2 d d
2 e e
Any given x can be missing.
The dataset it too big to simply add all posibilities and drop dupplicates.
The idea is to get:
Idnumber X
1 a
2 a
1 b
1 c
2 b
1 d
2 c
1 e
2 d
2 e
The main issue is getting a unique solution. So making sure that I replace one NaN by c and one by e.
Is this what you're looking for? Or does this use too much RAM? If it does use too much RAM, you can use the chunksize parameter in read_csv. Then write results (with duplicates and nans dropped) for each individual chunk to csv, then load those and drop duplicates again - this time just dropping duplicates that conflict across chunks.
#Loading Dataframe
from StringIO import StringIO
x=StringIO('''Idnumber,X,y
1,a,a
2,a,a
1,b,b
1,NaN,d
2,b,NaN
1,d,c
2,c,NaN
1,NaN,e
2,d,d
2,e,e''')
#Operations on Dataframe
df = pd.read_csv(x)
df1 = df[['Idnumber','X']]
df2 = df[['Idnumber','y']]
df2.rename(columns={'y': 'X'}, inplace=True)
pd.concat([df1,df2]).dropna().drop_duplicates()
I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve