How to edit the realization process of OneHotEncoder in pandas? - python

I want to use a function "pandas.get_dummies()" to convert a DataFrame "Type: [N,N,N,Y,Y,N,Y……]" to be "Type: [0,0,0,1,1,0,1……]"
import pandas as pd
# Define a DataFrame "df"
df = pd.get_dummies(df)
But the function always convert "N" to be "1" and "Y" to be "0". So the result is "Type: [1,1,1,0,0,1,0……]". That's not what I want ,because it will bring some trouble to my calculation.
What should I do to change it?

You can use 1-pd.get_dummies(df) to map 1 to 0 and vice versa, like:
>>> df
0
0 N
1 N
2 N
3 Y
4 Y
5 N
6 Y
>>> 1 - pd.get_dummies(df)
0_N 0_Y
0 0 1
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
or if you convert this to booleans, you can use the boolean negation, like:
>>> ~pd.get_dummies(df, dtype=bool)
0_N 0_Y
0 False True
1 False True
2 False True
3 True False
4 True False
5 False True
6 True False

Related

How can I select and index the highest value in each group of a Pandas dataframe?

I have a dataframe with multiple columns, each combination of columns describing one experiment (e.g. multiple super-labels, for each super-label multiple episodes with different number of timesteps). I want to set the last timestep in each episode for all experiments to True, but I can't figure out how to do this. I have tried three different approaches, all using .loc and 1) using .max().index, 2) .idxmax() and 3) .tail(1).index, but they all fail (the first two with for me ununderstandable exceptions and the last one being wrong.
This is my minimal example:
import numpy as np
import pandas as pd
np.random.seed(4)
def gen(t):
results = []
for episode_id, episode in enumerate(range(np.random.randint(2, 4))):
for i in range(np.random.randint(2, 6)):
results.append(
{
"episode": episode_id,
"timestep": i,
"t": t,
}
)
return pd.DataFrame(results)
df = pd.concat([gen("a"), gen("b")])
base_groups = ["t", "episode"]
df["last_timestep"] = False
print("Expected:")
print(df.groupby(base_groups).timestep.max())
#df.loc[df.groupby(base_groups).timestep.max().index, "last_timestep"] = True
#df.loc[df.groupby(base_groups).timestep.idxmax(), "last_timestep"] = True
df.loc[df.groupby(base_groups).tail(1).index, "last_timestep"] = True
print("Is:")
print(df[df.last_timestep])
The output of df.groupby(base_groups).timestep.max() is exactly what I expect, the correct rows are selected:
Expected:
t episode
a 0 3
1 4
b 0 2
1 1
2 4
But when filtering the dataframe, this is what I get:
Is:
episode timestep t last_timestep
2 0 2 a True
3 0 3 a True
4 1 0 a True
8 1 4 a True
2 0 2 b True
3 1 0 b True
4 1 1 b True
8 2 3 b True
9 2 4 b True
The rows 0, 2, 5 and 7 should not be selected.
Use GroupBy.transform for repeat max aggregated values and compare by column timestep:
df["last_timestep"] = df.groupby(base_groups)['timestep'].transform(max).eq(df['timestep'])
print (df)
episode timestep t last_timestep
0 0 0 a False
1 0 1 a False
2 0 2 a False
3 0 3 a True
4 1 0 a False
5 1 1 a False
6 1 2 a False
7 1 3 a False
8 1 4 a True
0 0 0 b False
1 0 1 b False
2 0 2 b True
3 1 0 b False
4 1 1 b True
5 2 0 b False
6 2 1 b False
7 2 2 b False
8 2 3 b False
9 2 4 b True

Python Dataframe: How to check specific columns for elements

I want to check whether all elements from a certain column contain the number 0?
I have a dataset that I read with df=pd.read_table('ad-data')
From this I felt a data frame with elements
[0] [1.] [2] [3] [4] [5] [6] [7] ....1559
[1.] 3 2 3 0 0 0 0
[2] 2 3 2 0 0 0 0
[3] 3 2 2 0 0 0 0
[4] 6 7 3 0 0 0 0
[5] 3 2 1 0 0 0 0
...
3220
I would like to check whether the data set from column 4 to 1559 contains only 0 or also other values.
You can check for equality with 0 element-wise and use all for rows:
df['all_zeros'] = (df.iloc[:, 4:1560] == 0).all(axis=1)
Small example to demonstrate it (based on columns 1 to 3 here):
N = 5
df = pd.DataFrame(np.random.binomial(1, 0.4, size=(N, N)))
df['all_zeros'] = (df.iloc[:, 1:4] == 0).all(axis=1)
df
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
3 0 0 0 0 0 True
4 1 0 0 0 0 True
Update: Filtering non-zero values:
df[~df['all_zeros']]
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
Update 2: To show only non-zero values:
pd.melt(
df_filtered.iloc[:, 1:4].reset_index(),
id_vars='index', var_name='column'
).query('value != 0').sort_values('index')
Output:
index column value
0 0 1 1
3 0 2 1
4 1 2 1
7 1 3 1
2 2 1 1
5 2 2 1
df['Check']=df.loc[:,4:].sum(axis=1)
here is the way to check if all of values are zero or not: it's simple
and doesn't need advanced functions as above answers. only basic
functions like filtering and if loops and variable assigning.
first is the way to check if one column has only zeros or not and
second is how to find if all the columns have zeros or not. and it
prints and answer statement.
the method to check if one column has only zero values or not:
first make a series:
has_zero = df[4] == 0
# has_zero is a series which contains bool values for each row eg. True, False.
# if there is a zero in a row it result will be "row_number : True"
next:
rows_which_have_zero = df[has_zero]
# stores the rows which have zero as a data frame
next:
if len[rows_which_have_zero] == total_number_rows:
print("contains only zeros")
else:
print("contains other numbers than zero")
# substitute total_number_rows for 3220
the above method only checks if rows_which_have_zero is equal to amount of the rows in the column.
the method to see if all of the columns have only zero or not:
it uses the above function and puts it into a if loop.
no_of_columns = 1559
value_1 = 1
if value_1 <= 1559
has_zero = df[value_1] == 0
rows_which_have_zero = df[has_zero]
value_1 += 1
if len[rows_which_have_zero] == 1559
no_of_rows_with_only_zero += 1
else:
return
to check if all rows have zero only or not:
#since it doesn't matter if first 3 columns have zero or not:
no_of_rows_with_only_zero = no_of_rows_with_only_zero - 3
if no_of_rows_with_only_zero == 1559:
print("there are only zero values")
else:
print("there are numbers which are not zero")
above checks if no_of_rows_with_only_zero is equal to the amount of rows (which is 1559 minus 3 because only rows 4 - 1559 need to be checked)
update:
# convert the value_1 to str if the column title is a str instead of int
# when updating value_1 by adding: convert it back to int and then back to str

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

Python - Pandas - DataFrame - Explode single column into multiple boolean columns based on conditions

Good morning chaps,
Any pythonic way to explode a dataframe column into multiple columns with boolean flags, based on some condition (str.contains in this case)?
Let's say I have this:
Position Letter
1 a
2 b
3 c
4 b
5 b
And I'd like to achieve this:
Position Letter is_a is_b is_C
1 a TRUE FALSE FALSE
2 b FALSE TRUE FALSE
3 c FALSE FALSE TRUE
4 b FALSE TRUE FALSE
5 b FALSE TRUE FALSE
Can do with a loop through 'abc' and explicitly creating new df columns, but wondering if some built-in method already exists in pandas. Number of possible values, and hence number of new columns is variable.
Thanks and regards.
use Series.str.get_dummies():
In [31]: df.join(df.Letter.str.get_dummies())
Out[31]:
Position Letter a b c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 b 0 1 0
4 5 b 0 1 0
or
In [32]: df.join(df.Letter.str.get_dummies().astype(bool))
Out[32]:
Position Letter a b c
0 1 a True False False
1 2 b False True False
2 3 c False False True
3 4 b False True False
4 5 b False True False

Pandas conditional subset for dataframe with bool values and ints

I have a dataframe with three series. Column A contains a group_id. Column B contains True or False. Column C contains a 1-n ranking (where n is the number of rows per group_id).
I'd like to store a subset of this dataframe for each row that:
1) Column C == 1
OR
2) Column B == True
The following logic copies my old dataframe row for row into the new dataframe:
new_df = df[df.column_b | df.column_c == 1]
IIUC, starting from a sample dataframe like:
A,B,C
01,True,1
01,False,2
02,False,1
02,True,2
03,True,1
you can:
df = df[(df['C']==1) | (df['B']==True)]
which returns:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
You've couple of methods for filtering, and performance varies based on size of your data
In [722]: df[(df['C']==1) | df['B']]
Out[722]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [723]: df.query('C==1 or B==True')
Out[723]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [724]: df[df.eval('C==1 or B==True')]
Out[724]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1

Categories