List columns in data frame that have missing values as '?' - python

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1

Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True

If your DataFrame is named df, try (df == '?').sum()

Related

How can I select and index the highest value in each group of a Pandas dataframe?

I have a dataframe with multiple columns, each combination of columns describing one experiment (e.g. multiple super-labels, for each super-label multiple episodes with different number of timesteps). I want to set the last timestep in each episode for all experiments to True, but I can't figure out how to do this. I have tried three different approaches, all using .loc and 1) using .max().index, 2) .idxmax() and 3) .tail(1).index, but they all fail (the first two with for me ununderstandable exceptions and the last one being wrong.
This is my minimal example:
import numpy as np
import pandas as pd
np.random.seed(4)
def gen(t):
results = []
for episode_id, episode in enumerate(range(np.random.randint(2, 4))):
for i in range(np.random.randint(2, 6)):
results.append(
{
"episode": episode_id,
"timestep": i,
"t": t,
}
)
return pd.DataFrame(results)
df = pd.concat([gen("a"), gen("b")])
base_groups = ["t", "episode"]
df["last_timestep"] = False
print("Expected:")
print(df.groupby(base_groups).timestep.max())
#df.loc[df.groupby(base_groups).timestep.max().index, "last_timestep"] = True
#df.loc[df.groupby(base_groups).timestep.idxmax(), "last_timestep"] = True
df.loc[df.groupby(base_groups).tail(1).index, "last_timestep"] = True
print("Is:")
print(df[df.last_timestep])
The output of df.groupby(base_groups).timestep.max() is exactly what I expect, the correct rows are selected:
Expected:
t episode
a 0 3
1 4
b 0 2
1 1
2 4
But when filtering the dataframe, this is what I get:
Is:
episode timestep t last_timestep
2 0 2 a True
3 0 3 a True
4 1 0 a True
8 1 4 a True
2 0 2 b True
3 1 0 b True
4 1 1 b True
8 2 3 b True
9 2 4 b True
The rows 0, 2, 5 and 7 should not be selected.
Use GroupBy.transform for repeat max aggregated values and compare by column timestep:
df["last_timestep"] = df.groupby(base_groups)['timestep'].transform(max).eq(df['timestep'])
print (df)
episode timestep t last_timestep
0 0 0 a False
1 0 1 a False
2 0 2 a False
3 0 3 a True
4 1 0 a False
5 1 1 a False
6 1 2 a False
7 1 3 a False
8 1 4 a True
0 0 0 b False
1 0 1 b False
2 0 2 b True
3 1 0 b False
4 1 1 b True
5 2 0 b False
6 2 1 b False
7 2 2 b False
8 2 3 b False
9 2 4 b True

pandas groupby return a boolean vector

I have a time series database where I would like to group the data to compare them both to another cell in the same row, and the previous value.
The code below will return a vector against the whole dataframe, but if I try to group it I get a dataframe with apply() and an error with agg or transform.
Sample data frame
df = pd.DataFrame({ 'group': [1, 1, 1, 2,2,2,1,2, 1], 'target': [100,100,100,100,10,10,10,10,50],'val' :[90,80,70,4,120,6,60,8, 50] })
df
group target val
0 1 100 90
1 1 100 80
2 1 100 70
3 2 100 4
4 2 10 120
5 2 10 6
6 1 10 60
7 2 10 8
8 1 50 50
Here is my attempt at a function
def spike(df):
high = df['val'] > df['target']+25
rising = df['val'] > df['val'].shift()
return high & rising
print(spike(df))
print( df.groupby('group').apply(spike))
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 True
7 False
8 False
dtype: bool
0 1 2 6 8
group
1 False False False False False
2 False True False False True
Here is my output, I was trying to get the second output to look like the first except row 6 should be false.
You are over thinking it:
shift = df.groupby('group')['val'].shift()
df['val'].gt(df['target']+25) & df['val'].gt(shift)
Output:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
dtype: bool

How to edit the realization process of OneHotEncoder in pandas?

I want to use a function "pandas.get_dummies()" to convert a DataFrame "Type: [N,N,N,Y,Y,N,Y……]" to be "Type: [0,0,0,1,1,0,1……]"
import pandas as pd
# Define a DataFrame "df"
df = pd.get_dummies(df)
But the function always convert "N" to be "1" and "Y" to be "0". So the result is "Type: [1,1,1,0,0,1,0……]". That's not what I want ,because it will bring some trouble to my calculation.
What should I do to change it?
You can use 1-pd.get_dummies(df) to map 1 to 0 and vice versa, like:
>>> df
0
0 N
1 N
2 N
3 Y
4 Y
5 N
6 Y
>>> 1 - pd.get_dummies(df)
0_N 0_Y
0 0 1
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
or if you convert this to booleans, you can use the boolean negation, like:
>>> ~pd.get_dummies(df, dtype=bool)
0_N 0_Y
0 False True
1 False True
2 False True
3 True False
4 True False
5 False True
6 True False

Pandas incremental values when boolean changes

There is a dataframe that contains column in which boolean values alternate. I want to create incremental value series based on those boolean changes. I want to increment only when boolean value differs from previous value. I want to do this without loop.
Example, Here's dataframe:
column
0 True
1 True
2 False
3 False
4 False
5 True
I want to get this:
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Compare shifted column by not equal and add cumulative sum:
df['inc'] = df['column'].ne(df['column'].shift()).cumsum()
print (df)
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Detail:
print (df['column'].ne(df['column'].shift()))
0 True
1 False
2 True
3 False
4 False
5 True
Name: column, dtype: bool

Python - Pandas - DataFrame - Explode single column into multiple boolean columns based on conditions

Good morning chaps,
Any pythonic way to explode a dataframe column into multiple columns with boolean flags, based on some condition (str.contains in this case)?
Let's say I have this:
Position Letter
1 a
2 b
3 c
4 b
5 b
And I'd like to achieve this:
Position Letter is_a is_b is_C
1 a TRUE FALSE FALSE
2 b FALSE TRUE FALSE
3 c FALSE FALSE TRUE
4 b FALSE TRUE FALSE
5 b FALSE TRUE FALSE
Can do with a loop through 'abc' and explicitly creating new df columns, but wondering if some built-in method already exists in pandas. Number of possible values, and hence number of new columns is variable.
Thanks and regards.
use Series.str.get_dummies():
In [31]: df.join(df.Letter.str.get_dummies())
Out[31]:
Position Letter a b c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 b 0 1 0
4 5 b 0 1 0
or
In [32]: df.join(df.Letter.str.get_dummies().astype(bool))
Out[32]:
Position Letter a b c
0 1 a True False False
1 2 b False True False
2 3 c False False True
3 4 b False True False
4 5 b False True False

Categories