Boolean indexing to retain falsy values as NaN - python

Given a dataframe:
Data
1 246804
2 135272
3 898.01
4 3453.33
5 shine
6 add
7 522
8 Nan
9 string
10 29.11
11 20
I would like two new columns Floats and Strings, both having the same length as the original dataframe. Getting the Floats column is easy:
In [176]: pd.to_numeric(df.Data, errors='coerce')
Out[176]:
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN
6 NaN
7 522.00
8 NaN
9 NaN
10 29.11
11 20.00
Name: Data, dtype: float64
As you can see, non Floats are coerced to NaN, which is exactly what I want.
To get strings, this is what I do:
In [177]: df[df.Data.str.isalpha()]
Out[177]:
Data
5 shine
6 add
8 Nan
9 string
But as you can see, it does not retain the non-String values as NaN. I want something like this:
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan (not NaN)
9 string
10 NaN
11 NaN
How can I get it to do so?

To get Strings, you can use Boolean Indexing on the Data column and located where Floats is null.
df['Floats'] = pd.to_numeric(df.Data, errors='coerce')
df['Strings'] = df.Data.loc[df.Floats.isnull()] # Optional: .astype(str)
>>> df
# Output:
# Data Floats Strings
# 1 246804 246804.00 NaN
# 2 135272 135272.00 NaN
# 3 898.01 898.01 NaN
# 4 3453.33 3453.33 NaN
# 5 shine NaN shine
# 6 add NaN add
# 7 522 522.00 NaN
# 8 Nan NaN Nan
# 9 string NaN string
# 10 29.11 29.11 NaN
# 11 20 20.00 NaN

floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull())
))
floats strings
1 246804.00 NaN
2 135272.00 NaN
3 898.01 NaN
4 3453.33 NaN
5 NaN shine
6 NaN add
7 522.00 NaN
8 NaN Nan
9 NaN string
10 29.11 NaN
11 20.00 NaN
You can even make it more obvious within mask by passing an alternative
floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull(), '')
))
floats strings
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN shine
6 NaN add
7 522.00
8 NaN Nan
9 NaN string
10 29.11
11 20.00

How about
df.Data.where(pd.to_numeric(df.Data, errors='coerce').isnull())
Out[186]:
Data
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan #not NaN
9 string
10 NaN
11 NaN
Or base on your df.Data.str.isalpha()
df['Data'].where(df['Data'].str.isalpha())

Related

How can I update an existing dataframe to add values, without overwriting other existing values in the same column?

I have an existing dataframe with two columns as follows:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F None
8 NaN CONTINUOUS_TRADING
9 SL None
10 NaN HALTED
11 NaN None
12 NaN None
13 L None
I am trying to apply the following 3 mappings to the above dataframe:
market_info_df['market_state'] = market_info_df['reason'].map({'F': OPENING_AUCTION})
market_info_df['market_state'] = market_info_df['reason'].map({'SL': CLOSING_AUCTION})
market_info_df['market_state'] = market_info_df['reason'].map({'L': CLOSED})
But when I run the above 3 lines, it seems to overwrite the existing mappings:
market_state reason
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN F
8 NaN NaN
9 NaN SL
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 CLOSED L
(And it seems to have swapped the columns? - though this doesn't matter)
Each of the lines seems to overwrite the dataframe. Is there a way simply to update the dataframe, i.e. so it just updates the three mappings, like this:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED
Join values to one dictionary and add Series.fillna by same column market_state:
d = {'F': 'OPENING_AUCTION','SL': 'CLOSING_AUCTION', 'L': 'CLOSED'}
market_info_df['market_state'] = (market_info_df['reason'].map(d)
.fillna(market_info_df['market_state']))
print (market_info_df)
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED
Use a single dictionary, then fillna with the original values if needed:
market_info_df['market_state'] = (
market_info_df['reason']
.map({'F': 'OPENING_AUCTION', # only ONE dictionary
'SL': 'CLOSING_AUCTION',
'L': 'CLOSED'})
.fillna(market_info_df['market_state'])
)
Or, to only update the NA values:
df.loc[df['market_state'].isna(), 'market_state'] = (
market_info_df['reason']
.map({'F': 'OPENING_AUCTION', # only ONE dictionary
'SL': 'CLOSING_AUCTION',
'L': 'CLOSED'})
)
Output:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED

How to select and print some columns for all the rows that are not NA and are a specifc number in python?

I am having trouble selecting the rowns I want and printing the colums of choice.
So I have 8 columns and what I am looking to do is take all the rows where column 8 is not NA and is equal to 2, and print only columns 2 to 5.
I have tried this:
df.where(df['dhch'].notnull())[['scchdg', 'dhch']]
Here I have just entered 2 columns to check that the conditional statement that dhch is not NA worked and I got as expected:
scchdg dhch
0 3 1
1 -1 2
2 -1 2
3 1 1
4 3 1
... ...
12094 -9 1
12095 1 1
12096 4 1
12097 3 1
12098 4 1
[12099 rows x 2 columns]
And when I check the conditional value I get expected output (i.e., values of 2 and nans in the dhch col:
df.where(df['dhch']==2)[['scchdg', 'dhch']]
Out[50]:
scchdg dhch
0 NaN NaN
1 -1.0 2.0
2 -1.0 2.0
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
But When I combine these, I just get piles of NAs
df.where(df['dhch'].notnull() & df['dhch']==2)[['scchdg', 'dhch']]
Out[51]:
scchdg dhch
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
What am I doing wrong please?
in R what I want to do is as follows:
df[!is.na(df$dhch) & df$dhch==2, c('scchdg', 'dhch')]
But how do i do exactly this in Python please?

Pandas assign series to a column based on index

I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.
Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0

reshape a pandas dataframe index to columns

Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>

Error adding date column in Pandas

Needed some help solving why my dataframe is returning all NaNs.
print df
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
3 4 3 0 3 30
4 5 1 0 3 30
Then I added date index. I only need to increment by one day for 5 days.
date = pd.date_range(datetime.datetime.today(), periods=5)
data = DataFrame(df, index=date)
print data
0 1 2 3 4
2014-04-10 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-11 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-12 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-13 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-14 17:16:09.433000 NaN NaN NaN NaN NaN
Tried a few different things to no avail. If I switch my original dataframe to
np.random.randn(5,5)
Then it works. Anyone have an idea of what is going on here?
Edit: Going to add that the data type is float64
print df.dtypes
0 float64
1 float64
2 float64
3 float64
4 float64
dtype: object
You should overwrite the index of the original dataframe with the following:
df.index = date
What DataFrame(df, index=date) does is that it creates new dataframe by matching the values of index to the df being used, for example:
DataFrame(df, index=[0,1,2,5,5])
returns the following:
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
5 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
because 5 is not included in the index of the original dataframe.

Categories