Boolean indexing to retain falsy values as NaN

Boolean indexing to retain falsy values as NaN - python

Given a dataframe:
Data
1 246804
2 135272
3 898.01
4 3453.33
5 shine
6 add
7 522
8 Nan
9 string
10 29.11
11 20
I would like two new columns Floats and Strings, both having the same length as the original dataframe. Getting the Floats column is easy:
In [176]: pd.to_numeric(df.Data, errors='coerce')
Out[176]:
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN
6 NaN
7 522.00
8 NaN
9 NaN
10 29.11
11 20.00
Name: Data, dtype: float64
As you can see, non Floats are coerced to NaN, which is exactly what I want.
To get strings, this is what I do:
In [177]: df[df.Data.str.isalpha()]
Out[177]:
Data
5 shine
6 add
8 Nan
9 string
But as you can see, it does not retain the non-String values as NaN. I want something like this:
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan (not NaN)
9 string
10 NaN
11 NaN
How can I get it to do so?

To get Strings, you can use Boolean Indexing on the Data column and located where Floats is null.
df['Floats'] = pd.to_numeric(df.Data, errors='coerce')
df['Strings'] = df.Data.loc[df.Floats.isnull()] # Optional: .astype(str)
>>> df
# Output:
# Data Floats Strings
# 1 246804 246804.00 NaN
# 2 135272 135272.00 NaN
# 3 898.01 898.01 NaN
# 4 3453.33 3453.33 NaN
# 5 shine NaN shine
# 6 add NaN add
# 7 522 522.00 NaN
# 8 Nan NaN Nan
# 9 string NaN string
# 10 29.11 29.11 NaN
# 11 20 20.00 NaN

floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull())
))
floats strings
1 246804.00 NaN
2 135272.00 NaN
3 898.01 NaN
4 3453.33 NaN
5 NaN shine
6 NaN add
7 522.00 NaN
8 NaN Nan
9 NaN string
10 29.11 NaN
11 20.00 NaN
You can even make it more obvious within mask by passing an alternative
floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull(), '')
))
floats strings
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN shine
6 NaN add
7 522.00
8 NaN Nan
9 NaN string
10 29.11
11 20.00

How about
df.Data.where(pd.to_numeric(df.Data, errors='coerce').isnull())
Out[186]:
Data
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan #not NaN
9 string
10 NaN
11 NaN
Or base on your df.Data.str.isalpha()
df['Data'].where(df['Data'].str.isalpha())

Related

How can I update an existing dataframe to add values, without overwriting other existing values in the same column?

I have an existing dataframe with two columns as follows:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F None
8 NaN CONTINUOUS_TRADING
9 SL None
10 NaN HALTED
11 NaN None
12 NaN None
13 L None
I am trying to apply the following 3 mappings to the above dataframe:
market_info_df['market_state'] = market_info_df['reason'].map({'F': OPENING_AUCTION})
market_info_df['market_state'] = market_info_df['reason'].map({'SL': CLOSING_AUCTION})
market_info_df['market_state'] = market_info_df['reason'].map({'L': CLOSED})
But when I run the above 3 lines, it seems to overwrite the existing mappings:
market_state reason
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN F
8 NaN NaN
9 NaN SL
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 CLOSED L
(And it seems to have swapped the columns? - though this doesn't matter)
Each of the lines seems to overwrite the dataframe. Is there a way simply to update the dataframe, i.e. so it just updates the three mappings, like this:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED

Join values to one dictionary and add Series.fillna by same column market_state:
d = {'F': 'OPENING_AUCTION','SL': 'CLOSING_AUCTION', 'L': 'CLOSED'}
market_info_df['market_state'] = (market_info_df['reason'].map(d)
.fillna(market_info_df['market_state']))
print (market_info_df)
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED

Use a single dictionary, then fillna with the original values if needed:
market_info_df['market_state'] = (
market_info_df['reason']
.map({'F': 'OPENING_AUCTION', # only ONE dictionary
'SL': 'CLOSING_AUCTION',
'L': 'CLOSED'})
.fillna(market_info_df['market_state'])
)
Or, to only update the NA values:
df.loc[df['market_state'].isna(), 'market_state'] = (
market_info_df['reason']
.map({'F': 'OPENING_AUCTION', # only ONE dictionary
'SL': 'CLOSING_AUCTION',
'L': 'CLOSED'})
)
Output:
reason market_state
0 NaN UNSCHEDULED_AUCTION
1 NaN None
2 NaN CLOSED
3 NaN CONTINUOUS_TRADING
4 NaN None
5 NaN UNSCHEDULED_AUCTION
6 NaN UNSCHEDULED_AUCTION
7 F OPENING_AUCTION
8 NaN CONTINUOUS_TRADING
9 SL CLOSING_AUCTION
10 NaN HALTED
11 NaN None
12 NaN None
13 L CLOSED

How to select and print some columns for all the rows that are not NA and are a specifc number in python?

I am having trouble selecting the rowns I want and printing the colums of choice.
So I have 8 columns and what I am looking to do is take all the rows where column 8 is not NA and is equal to 2, and print only columns 2 to 5.
I have tried this:
df.where(df['dhch'].notnull())[['scchdg', 'dhch']]
Here I have just entered 2 columns to check that the conditional statement that dhch is not NA worked and I got as expected:
scchdg dhch
0 3 1
1 -1 2
2 -1 2
3 1 1
4 3 1
... ...
12094 -9 1
12095 1 1
12096 4 1
12097 3 1
12098 4 1
[12099 rows x 2 columns]
And when I check the conditional value I get expected output (i.e., values of 2 and nans in the dhch col:
df.where(df['dhch']==2)[['scchdg', 'dhch']]
Out[50]:
scchdg dhch
0 NaN NaN
1 -1.0 2.0
2 -1.0 2.0
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
But When I combine these, I just get piles of NAs
df.where(df['dhch'].notnull() & df['dhch']==2)[['scchdg', 'dhch']]
Out[51]:
scchdg dhch
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
What am I doing wrong please?
in R what I want to do is as follows:
df[!is.na(df$dhch) & df$dhch==2, c('scchdg', 'dhch')]
But how do i do exactly this in Python please?

Pandas assign series to a column based on index

I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.

Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0

reshape a pandas dataframe index to columns

Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0

A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>

Error adding date column in Pandas

Needed some help solving why my dataframe is returning all NaNs.
print df
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
3 4 3 0 3 30
4 5 1 0 3 30
Then I added date index. I only need to increment by one day for 5 days.
date = pd.date_range(datetime.datetime.today(), periods=5)
data = DataFrame(df, index=date)
print data
0 1 2 3 4
2014-04-10 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-11 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-12 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-13 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-14 17:16:09.433000 NaN NaN NaN NaN NaN
Tried a few different things to no avail. If I switch my original dataframe to
np.random.randn(5,5)
Then it works. Anyone have an idea of what is going on here?
Edit: Going to add that the data type is float64
print df.dtypes
0 float64
1 float64
2 float64
3 float64
4 float64
dtype: object

You should overwrite the index of the original dataframe with the following:
df.index = date
What DataFrame(df, index=date) does is that it creates new dataframe by matching the values of index to the df being used, for example:
DataFrame(df, index=[0,1,2,5,5])
returns the following:
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
5 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
because 5 is not included in the index of the original dataframe.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boolean indexing to retain falsy values as NaN - python

How about df.Data.where(pd.to_numeric(df.Data, errors='coerce').isnull()) Out[186]: Data 1 NaN 2 NaN 3 NaN 4 NaN 5 shine 6 add 7 NaN 8 Nan #not NaN 9 string 10 NaN 11 NaN Or base on your df.Data.str.isalpha() df['Data'].where(df['Data'].str.isalpha())

Related

How can I update an existing dataframe to add values, without overwriting other existing values in the same column?

How to select and print some columns for all the rows that are not NA and are a specifc number in python?

Pandas assign series to a column based on index

reshape a pandas dataframe index to columns

Error adding date column in Pandas

Categories

Resources