import pandas as pd
import numpy as np
df = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', np.nan])
df.pop(np.nan)
Type error : can not do label indexing on class pandas.core.indexes.base.Index with these index [nan] of class float
I tried doing
df.reset_index().dropna().set_index('index')
But then when I do df.pop('a') it gives me error
If s is a pandas Series, then s.reset_index() returns a DataFrame with the index of the Series as one of its columns (named index by default). Note that s.reset_index(drop=True) returns a Series, but discards the index.
One solution to your task is to select the one and only column named 0 from the DataFrame built by your last line:
# setup with the name "s" to represent a Series (keep "df" for DataFrames)
s = pd.Series([1,2,3,4], index=['a','b','c',np.nan])
res1 = s.reset_index().dropna().set_index('index')[0]
res1
index
a 1
b 2
c 3
Name: 0, dtype: int64
Another option is to drop null index labels by reindexing the Series:
res2 = s.loc[s.index.dropna()]
res2
a 1
b 2
c 3
dtype: int64
Related
I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6
I have a pandas dataframe with floating numbers in some rows and numpy datetime64 in another.
df2 = pd.DataFrame(
[[np.datetime64('2021-01-01'), np.datetime64('2021-01-01')], [2, 3]],
columns=['A', 'B'])
when I sum each row (axis=1), I get 0 for all rows:
df2.sum(axis=1)
0 0.0
1 0.0
dtype: float64
Why does this happen? I have tried the numeric_only=True option with same result.
I would expect each row to be handled individually, and get 5 as result for the second row, as happens if I replace the datetime64 objects with strings:
df = pd.DataFrame(
{'A': ['2021-01-01', 2],
'B': ['2021-01-01', 3]})
print(df.sum(axis=1))
0 2021-01-012021-01-01
1 5
dtype: object
Thanks!
You get get something like you're after if you make your rows columns
df2.transpose().sum(axis=0)
Your rows won't get coerced to a numerical dtype, i.e.
df2.loc[1]
results in
A 2
B 3
Name: 1, dtype: object
Rows and columns are not treated equally in pandas, rightly or wrongly.
Let's say that I have a DataFrame df and a Series s like this:
>>> df = pd.DataFrame(np.random.randn(2,3), columns=["A", "B", "C"])
>>> df
A B C
0 -0.625816 0.793552 -1.519706
1 -0.955960 0.142163 0.847624
>>> s = pd.Series([1, 2, 3])
>>> s
0 1
1 2
2 3
dtype: int64
I'd like to add the values of s to each row in df. I guess I should use some apply with axis=1 or applymap but I can't figure out how (do I have to transpose at some point?).
Actually my problem is more complex that that and the final DataFrame will be composed of the elements of the initial DataFrame that will have been processed according to the values of two Series.
Possible solution is add 1d numpy array created from Series for prevent alignment columns of DataFrame to index of Series:
df = df + s.values
print (df)
A B C
0 0.207070 1.995021 4.829518
1 0.819741 2.802982 2.801355
If same columns and index values it working with sum:
#index is same like columns names
s = pd.Series([1, 2, 3], index=df.columns)
print (s)
A 1
B 2
C 3
dtype: int64
df = df + s
I created a new column called 'order_num' for instance
import pandas
import numpy as np
import os
df=pandas.read_excel(os.getcwd() + r"/excel.xlsx", sheet=0, skiprows=0,)
df['order_num']=np.nan
and I wanted to put some value to newly created column
df.set_value(index, 'order_num', 'somestr')
and ther came error message
ValueError: could not convert string to float: 'somestr'
what is the problem? I guess defalut setting of new column creation is float. and I want to change it to string
how can I do it?
The problem is that you create a column of type float, because type(np.nan) returns float.
On a mock DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': [0, 1, 2, 3, 4, 5],\
'Column2': ['a', 'b', 'c', 'd', 'e', 'f']})
If you create a new column and assign np.nan to it, the new column will be numeric:
df['numeric'] = np.nan
df['numeric'].dtype
Returns:
dtype('float64')
You could instead create a column with empty strings, i.e. '':
df['order_num'] = ''
Column1 Column2 order_num
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 5 f
And then add a specific string at a specific index in the column `order_num':
index = 0
df = df.set_value(index, 'order_num', 'somestr')
This will give you the expected outcome:
Column1 Column2 order_num
0 0 a somestr
1 1 b
2 2 c
3 3 d
4 4 e
5 5 f
Suppose the data at hand is in the following form:
import pandas as pd
df = pd.DataFrame({'A':[1,10,20], 'B':[4,40,50], 'C':[10,11,12]})
I can compute the minimum by row with:
df.min(axis=1)
which returns 1 10 12.
Instead of the values, I would like to create a pandas Series containing the column labels of the corresponding cells.
That is, I would like to get A A C.
Thanks for any suggestions.
you can use idxmin(axis=1) method:
In [8]: df.min(axis=1)
Out[8]:
0 1
1 10
2 12
dtype: int64
In [9]: df.idxmin(axis=1)
Out[9]:
0 A
1 A
2 C
dtype: object
In [11]: df.idxmin(axis=1).values
Out[11]: array(['A', 'A', 'C'], dtype=object)