pandas delete the 0 appears most times - python

I have a DataFrame like this:
a b c d
1 0 0 0
0 1 0 7
5 2 0 4
6 3 0 0
0 0 8 8
0 7 7 7
0 0 0 1
1: fow each row, if the counts of 0 is >90% of the column counts(in this case: mean: 0.9*4 ), then delete the row.
2: fow each column, if the counts of 0 is >90% of the row counts(in this case: mean: 0.9*7 ), then delete the column.

I guess you want something like:
mask_rows = pd.DataFrame.sum(df == 0, axis=1) > 0.9*len(df.columns)
mask_cols = pd.DataFrame.sum(df == 0, axis=0) > 0.9*len(df.columns)
This creates mask following my interpretation of your question...

First create a mask that reveals the place zeros are:
df_temp = (df == 0)
Then drop the lines:
df.drop(df_temp.mean(axis = 1) > 0.9,inplace = True)
And finally the columns:
df.drop(df_temp.mean() > 0.9, 1, inplace = True)

Related

How to write conditional for dataframe for first and last occurrences of conditions that are met?

I have the following dataframe:
row issue_status market_phase trade_type
0 20 0
1 10 0
2 20 0
3 10 0
4 10 0
5 10 0
I would like to map the first instance of (issue_status == 10 & market_phase == 0) to OPENING_AUCTION.
And any subsequent occurrences of the above, I would like to map it to CONTINUOUS_AUCTION.
So I would like the dataframe to look like this:
row issue_status market_phase trade_type
0 20 0 ->
1 10 0 -> OPENING_AUCTION
2 20 0 ->
3 10 0 -> CONTINUOUS_TRADING
4 10 0 -> CONTINUOUS_TRADING
5 10 0 -> CONTINUOUS_TRADING
Here is my code:
market_info_df.loc[market_info_df['issue_status' == '10', 'market_phase' == '0'].iloc[0]] = MARKET_STATES.OPENING_AUCTION
market_info_df.loc[market_info_df['issue_status' == '10', 'market_phase' == '0']].iloc[1:] = MARKET_STATES.INTRADAY_AUCTION
When I run the above I get KeyError: (False, False, False)
Note that I need to use iloc in the code- any ideas how I would achieve the above?
Disregarding performances, you can iterate through your dataframe:
import pandas as pd
df = pd.DataFrame({'row': [0, 1, 2, 3, 4, 5],
'issue_status': [20, 10, 20, 10, 10, 10],
'market_phase': [0, 0, 0, 0, 0, 0],
'trade_type': None},)
trade_type_column = df.columns.get_loc("trade_type") # so you can use iloc
already_done = False
for row in range(len(df)):
if df.loc[row, 'issue_status'] == 10 and df.loc[row, 'market_phase'] == 0:
if not already_done:
df.iloc[row, trade_type_column] = 'OPENING_AUCTION'
already_done = True
else:
df.iloc[row, trade_type_column] = 'CONTINUOUS_TRADING'
This is what I get:
This might not be the most efficient solution, but it's possible using df.loc and np.argmax.
import pandas as pd
df = pd.DataFrame({"row": [0,1,2,3,4,5],
"issue_status": [20,10,20,10,10,10],
"market_phase": [0,0,0,0,0,0],
"trade_type": [""] * 6})
# First set the trade_type column for all rows where the condition matches to `CONTINUOUS_TRADING`
df.loc[(df.issue_status.values == 10) & (df.market_phase.values == 0), 'trade_type'] = 'CONTINUOUS_TRADING'
# Set trade_type for first occurrence of condition to "OPENING_AUCTION"
df.loc[((df.issue_status.values == 10) & (df.market_phase.values == 0)).argmax(), 'trade_type'] = 'OPENING_AUCTION'
print(df)
Output:
row issue_status market_phase trade_type
0 0 20 0
1 1 10 0 OPENING_AUCTION
2 2 20 0
3 3 10 0 CONTINUOUS_TRADING
4 4 10 0 CONTINUOUS_TRADING
5 5 10 0 CONTINUOUS_TRADING
Identify the rows where issue_status is 10 and market_pahase is 0 then identify the duplicate rows.. now using this information along with loc to fill the corresponding values in the trade_type column
cols = ['issue_status', 'market_phase']
m1 = df[cols].eq([10, 0]).all(1)
m2 = df[cols].duplicated()
df.loc[m1 & ~m2, 'trade_type'] = 'OPENING_AUCTION'
df.loc[m1 & m2, 'trade_type'] = 'CONTINUOUS_TRADING'
Result
row issue_status market_phase trade_type
0 0 20 0 NaN
1 1 10 0 OPENING_AUCTION
2 2 20 0 NaN
3 3 10 0 CONTINUOUS_TRADING
4 4 10 0 CONTINUOUS_TRADING
5 5 10 0 CONTINUOUS_TRADING

How to get the name of a column at a specific index and print it into another column using python and pandas

I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D

unique and replace in python

I have huge dataset with more than 100 columns that contain non-null values that I want to replace (and leave all the null values as is). Some columns, however, should stay untouched.
I am planning to do the following:
1) find unique values in these columns
2) replace this values with 1
Problem:
1) something like this barely possible to use for 100+ columns:
np.unique(df[['Col1', 'Col2']].values)
2) how do I apply than loc to all these columns? code below does not work
df_2.loc[df_2[['col1','col2','col3']] !=0, ['col1','col2','col3']] = 1
Maybe there is more reasonable and elegant way to solve the problem. Thanks!
Use DataFrame.mask:
c = ['col1','col2','col3']
df_2[c] = df_2[c].mask(df_2[c] != 0, 1)
Or compare by not equal with DataFrame.ne and cast mask by integers with DataFrame.astype:
df_2 = pd.DataFrame({
'A':list('abcdef'),
'col1':[0,5,0,5,5,0],
'col2':[7,8,9,0,2,0],
'col3':[0,0,5,7,0,0],
'E':[5,0,6,9,2,0],
})
c = ['col1','col2','col3']
df_2[c] = df_2[c].ne(0).astype(int)
print (df_2)
A col1 col2 col3 E
0 a 0 1 0 5
1 b 1 1 0 0
2 c 0 1 1 6
3 d 1 0 1 9
4 e 1 1 0 2
5 f 0 0 0 0
EDIT: For select columns by positions use DataFrame.iloc:
idx = np.r_[6:71,82]
df_2.iloc[:, idx] = df_2.iloc[:, idx].ne(0).astype(int)
Or first solution:
df_2.iloc[:, idx] = df_2.iloc[:, idx].mask(df_2.iloc[:, idx]] != 0, 1)

Python: combine boolean columns in Pandas dataframes

I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1

group values in intervals

I have a pandas series containing zeros and ones:
df1 = pd.Series([ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
df1
Out[3]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 0
I would like to create a dataframe df2 that contains the start and the end of intervals with the same value, together with the value associated... df2 in this case should be...
df2
Out[5]:
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
My attempt was:
from operator import itemgetter
from itertools import groupby
a=[next(group) for key, group in groupby(enumerate(df1), key=itemgetter(1))]
df2 = pd.DataFrame(a,columns=['Start','Value'])
but I don't know how to get the 'End' indeces
You can groupby by Series which is create by cumsum of shifted Series df1 by shift.
Then apply custum function and last reshape by unstack.
s = df1.ne(df1.shift()).cumsum()
df2 = df1.groupby(s).apply(lambda x: pd.Series([x.index[0], x.index[-1], x.iat[0]],
index=['Start','End','Value']))
.unstack().reset_index(drop=True)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
Another solution with aggregation by agg with first and last, but there is necessary more code for handling output by desired output.
s = df1.ne(df1.shift()).cumsum()
d = {'first':'Start','last':'End'}
df2 = df1.reset_index(name='Value') \
.groupby([s, 'Value'])['index'] \
.agg(['first','last']) \
.reset_index(level=0, drop=True) \
.reset_index() \
.rename(columns=d) \
.reindex_axis(['Start','End','Value'], axis=1)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
You could use the pd.Series.diff() method so as to identify the starting indexes:
df2 = pd.DataFrame()
df2['Start'] = df1[df1.diff().fillna(1) != 0].index
Then compute end indexes from this:
df2['End'] = [e - 1 for e in df2['Start'][1:]] + [df1.index.max()]
And finally gather the associated values :
df2['Value'] = df1[df2['Start']].values
ouput
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
The thing you are looking for is get first and last values in a groupby
import pandas as pd
def first_last(df):
return df.ix[[0,-1]]
df = pd.DataFrame([3]*4+[4]*4+[1]*4+[3]*3,columns=['value'])
print df
df['block'] = (df.value.shift(1) != df.value).astype(int).cumsum()
df = df.reset_index().groupby(['block','value'])['index'].agg(['first', 'last']).reset_index()
del df['block']
print df
You can groupby using shift and cumsum and find first and last valid index
df2 = df1.groupby((df1 != df1.shift()).cumsum()).apply(lambda x: np.ravel([x.index[0], x.index[-1], x.unique()]))
df2 = pd.DataFrame(df2.values.tolist()).rename(columns = {0: 'Start', 1: 'End',2:'Value'})
You get
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0

Categories