Pandas drop method behaves inconsistently dropping a NaN header - python

I've run into an issue trying to drop a nan column from a table.
Here's the example that works as expected:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['A', 'B', 'C'],
index=['Foo', 'Bar'])
mapping1 = pd.DataFrame([['a', 'x'], ['b', 'y']],
index=['A', 'B'],
columns=['Test', 'Control'])
# rename the columns using the mapping file
df1.columns = mapping1.loc[df1.columns, 'Test']
From here we see that the C column in df1 doesn't have an entry in the mapping file, and so that header is replaced with a nan.
# drop the nan column
df1.drop(np.nan, axis=1)
In this situation, calling np.nan finds the final header and drops it.
However, in the situation below, the df.drop does not work:
# set up table
sample1 = np.random.randint(0, 10, size=3)
sample2 = np.random.randint(0, 5, size=3)
df2 = pd.DataFrame([sample1, sample2],
index=['sample1', 'sample2'],
columns=range(3))
mapping2 = pd.DataFrame(['foo']*2, index=range(2),
columns=['test'])
# assign columns using mapping file
df2.columns = mapping2.loc[df2.columns, 'test']
# try and drop the nan column
df2.drop(np.nan, axis=1)
And the nan column remains.

This may be an answer (from https://stackoverflow.com/a/16629125/5717589):
When index is unique, pandas use a hashtable to map key to value.
When index is non-unique and sorted, pandas use binary search,
when index is random ordered pandas need to check all the keys in the
index.
So, if entries are unique, np.nan gets hashed I think. In a non-unique cases, pandas compares values, but:
np.nan == np.nan
Out[1]: False
Update
I guess it's impossible to access a NaN column by label. But it's doable by index position. Here is a workaround for dropping columns with null labels:
notnull_col_idx = np.arange(len(df.columns))[~pd.isnull(df.columns)]
df = df.iloc[:, notnull_col_idx]

Hmmm... this might be considered a bug but it seems like this problem occurs if your columns are labeled with the same label, in this case as foo. If I switch up the labels, the issue disappears:
mapping2 = pd.DataFrame(['foo','boo'], index=range(2),
columns=['test'])
I also attempted to call the columns by their index positions and the problem still occurs:
# try and drop the nan column
df2.drop(df2.columns[[2]], axis=1)
Out[176]:
test foo foo nan
sample1 4 4 4
sample2 4 0 1
But after altering the 2nd column label to something other than foo, the problem resolves itself. My best piece of advice is to have unique column labels.
Additional info:
So this also occurs when there are multiple nan columns as well...

Related

Trying to overwrite subset of pandas dataframe but get empty values

I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.

Insert NA in specific row/column pandas

I want to place NA's in certain row/column positions.
df_test = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'b', 'c', 'd', 'e'],
columns=['one', 'two', 'three'])
rows = pd.Series([True, False, False, False, True], index = df_test.index)
I want to add in the NA's for the rows specified and without column 'two'. I tried this:
df_test[rows].drop(['two'], axis = 1) = np.nan
But this returns error:
SyntaxError: can't assign to function call
This is not going to work, Python simply does not support this kind of syntax, i.e., assigning to function calls. Furthermore, drop returns a copy, so dropping the column and operating on the returned DataFrame does not modify the original.
Below are a couple of alternatives you may work with.
loc + pd.Index.difference
Here, you'll want loc based assignment:
df_test.loc[rows, df_test.columns.difference(['two'])] = np.nan
df_test
one two three
a NaN 0.205799 NaN
b 0.296389 -0.508247 0.026844
c 0.970879 -0.549491 -0.056991
d -1.474168 -1.694579 1.493165
e NaN -0.159641 NaN
loc works in-place, modifying the original DataFrame as you want. You can also replace df_test.columns.difference(['two']) with ['one', 'three'] if you so please.
df.set_value
For older pandas versions, you can use df.set_value (not in-place)โ€”
df_test.set_value(df_test.index[rows], df_test.columns.difference(['two']), np.nan)
one two three
a NaN 1.562233 NaN
b -0.755127 -0.862368 -0.850386
c -0.193353 -0.033097 1.005041
d -1.679028 1.006895 -0.206164
e NaN -1.376300 NaN

Python Pandas dividing columns by column [duplicate]

I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)

Is there a way to copy only the structure (not the data) of a Pandas DataFrame?

I received a DataFrame from somewhere and want to create another DataFrame with the same number and names of columns and rows (indexes). For example, suppose that the original data frame was created as
import pandas as pd
df1 = pd.DataFrame([[11,12],[21,22]], columns=['c1','c2'], index=['i1','i2'])
I copied the structure by explicitly defining the columns and names:
df2 = pd.DataFrame(columns=df1.columns, index=df1.index)
I don't want to copy the data, otherwise I could just write df2 = df1.copy(). In other words, after df2 being created it must contain only NaN elements:
In [1]: df1
Out[1]:
c1 c2
i1 11 12
i2 21 22
In [2]: df2
Out[2]:
c1 c2
i1 NaN NaN
i2 NaN NaN
Is there a more idiomatic way of doing it?
That's a job for reindex_like. Start with the original:
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
Construct an empty DataFrame and reindex it like df1:
pd.DataFrame().reindex_like(df1)
Out:
c1 c2
i1 NaN NaN
i2 NaN NaN
In version 0.18 of pandas, the DataFrame constructor has no options for creating a dataframe like another dataframe with NaN instead of the values.
The code you use df2 = pd.DataFrame(columns=df1.columns, index=df1.index) is the most logical way, the only way to improve on it is to spell out even more what you are doing is to add data=None, so that other coders directly see that you intentionally leave out the data from this new DataFrame you are creating.
TLDR: So my suggestion is:
Explicit is better than implicit
df2 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
Very much like yours, but more spelled out.
Not exactly answering this question, but a similar one for people coming here via a search engine
My case was creating a copy of the data frame without data and without index. One can achieve this by doing the following. This will maintain the dtypes of the columns.
empty_copy = df.drop(df.index)
Let's start with some sample data
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],
...: columns=['num', 'char'])
In [3]: df
Out[3]:
num char
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
num int64
char object
dtype: object
Now let's use a simple DataFrame initialization using the columns of the original DataFrame but providing no data:
In [5]: empty_copy_1 = pd.DataFrame(data=None, columns=df.columns)
In [6]: empty_copy_1
Out[6]:
Empty DataFrame
Columns: [num, char]
Index: []
In [7]: empty_copy_1.dtypes
Out[7]:
num object
char object
dtype: object
As you can see, the column data types are not the same as in our original DataFrame.
So, if you want to preserve the column dtype...
If you want to preserve the column data types you need to construct the DataFrame one Series at a time
In [8]: empty_copy_2 = pd.DataFrame.from_items([
...: (name, pd.Series(data=None, dtype=series.dtype))
...: for name, series in df.iteritems()])
In [9]: empty_copy_2
Out[9]:
Empty DataFrame
Columns: [num, char]
Index: []
In [10]: empty_copy_2.dtypes
Out[10]:
num int64
char object
dtype: object
A simple alternative -- first copy the basic structure or indexes and columns with datatype from the original dataframe (df1) into df2
df2 = df1.iloc[0:0]
Then fill your dataframe with empty rows -- pseudocode that will need to be adapted to better match your actual structure:
s = pd.Series([Nan,Nan,Nan], index=['Col1', 'Col2', 'Col3'])
loop through the rows in df1
df2 = df2.append(s)
To preserve column type you can use the astype method,
like pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
import pandas as pd
df1 = pd.DataFrame(
[
[11, 12, 'Alice'],
[21, 22, 'Bob']
],
columns=['c1', 'c2', 'c3'],
index=['i1', 'i2']
)
df2 = pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
print(df2.shape)
print(df2.dtypes)
output:
(0, 3)
c1 int64
c2 int64
c3 object
dtype: object
Working example
You can simply mask by notna() i.e
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
df2 = df1.mask(df1.notna())
c1 c2
i1 NaN NaN
i2 NaN NaN
A simple way to copy df structure into df2 is:
df2 = pd.DataFrame(columns=df.columns)
This has worked for me in pandas 0.22:
df2 = pd.DataFrame(index=df.index.delete(slice(None)), columns=df.columns)
Convert types:
df2 = df2.astype(df.dtypes)
delete(slice(None))
In case you do not want to keep the values โ€‹โ€‹of the indexes.
I know this is an old question, but I thought I would add my two cents.
def df_cols_like(df):
"""
Returns an empty data frame with the same column names and types as df
"""
df2 = pd.DataFrame({i[0]: pd.Series(dtype=i[1])
for i in df.dtypes.iteritems()},
columns=df.dtypes.index)
return df2
This approach centers around the df.dtypes attribute of the input data frame, df, which is a pd.Series. A pd.DataFrame is constructed from a dictionary of empty pd.Series objects named using the input column names with the column order being taken from the input df.

Python: Divide each row of a DataFrame by another DataFrame vector

I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)

Categories