When I use indexes to visit a pandas.DataFrame, it gave a userswarnings and they didn't interfere its output. I expect to know how did this userwarning happen and what should I do to avoid these userwarning? Thanks for everyone's attention.
df = pandas.DataFrame([[k, ass4Dict[k], k[:2], k[-2:]] for k in ass4Dict])
df.columns = ['string', 'count', 'lstr', 'rstr']
df = df[df['count'] >= 10]
**df = df[df['lstr'].map(lambda x:x in gram2Dict)][df['rstr'].map(lambda x:x in gram2Dict)]**
df['lstrCNT'] = df['lstr'].map(lambda x: float(gram2Dict[x]))
df['rstrCNT'] = df['rstr'].map(lambda x: float(gram2Dict[x]))
df['conPow'] = df['lstrCNT'] * df['rstrCNT']
df['lstrPow'] = df['count'] / df['lstrCNT']
df['rstrPow'] = df['count'] / df['rstrCNT']
df['aux4Ratio'] = df['count'] / df['conPow']
df['aux4Log'] = df['aux4Ratio'].map(lambda x: -log(x))
**df = df[df['aux4Log'] < 11][df['lstrPow'] >= 0.5][df['rstrPow'] >= 0.5]**
....
沈钦言 359
纪小蕊 158
顾持钧 949
林晋修 642
4
0.256721019745 1.22976207733
ch_extract.py:153: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[df['lstr'].map(lambda x:x in gram2Dict)][df['rstr'].map(lambda x:x in gram2Dict)]
ch_extract.py:161: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[df['aux4Log'] < 11][df['lstrPow'] >= 0.5][df['rstrPow'] >= 0.5]
If we just take the last line and recreate it as the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,50,size=(50, 4)), columns=list('ABCD'))
df[df.A < 11][df.B >= 25][df.C >= 25]
The final line is stringing together a series of slices. After the first slice, each subsequent slice needs to be reindexed as the resulting data frame does not contain all the items in the previous slice.
In this case, the proper form would be to combine your boolean slice into one expression:
df[(df.A < 11) & (df.B >= 25) & (df.C >= 25)]
Some other cases that might cause this warning would be as follows:
df[df.sort_values(['A'], ascending=[False]).duplicated(subset='B', keep='first')]
In which case, use the loc command:
df.loc[df.sort_values(['A'], ascending=[False]).duplicated(subset='B', keep='first')]
Related
I have a dataframe and need to drop all the columns that contain more than 55% of repeated/duplicate values in each column.
Would anyone be able to assist me on how to do this?
Please try this:
Let df1 be your dataframe:
drop_columns = []
drop_threshold = 0.55 #define the percentage criterion for drop
for cols in df1.columns:
df_count = df1[cols].value_counts().reset_index()
df_count['drop_percentage'] = df_count[cols]/df1.shape[0]
df_count['drop_criterion'] = df_count['drop_percentage'] > drop_threshold
if True in df_count.drop_criterion.values:
drop_columns.append(cols)
df1 = df1.drop(columns=drop_columns,axis=1)
Let's use pd.Series.duplciated:
cols_to_keep=df.columns[df.apply(pd.Series.duplicated).mean() <= .55]
df[cols_to_keep]
If you're referring to columns in which the most common value is repeated in more than 55% or rows, here's a solution
from collections import Counter
# assuming some DataFrame named df
bool_idx = df.apply(lambda x: max(Counter(x).values()) < len(x) * .55, axis=0)
df = df.loc[:, bool_idx]
if you're talking about non-unique values, this works:
bool_idx = df.apply(
lambda x: sum(
y for y in Counter(x).values() if y > 1
) < .55 * len(x),
axis=0
)
df = df.loc[:, bool_idx]
I'm trying to apply Pandas style to my dataset and add a column with a string with the matching result.
This is what I want to achieve:
Link
Below is my code, an expert from stackflow assisted me to apply the df.style so I believe for the df.style is correct based on my test. However, how can I run iterrows() and check the cell for each column and return/store a string to the new column 'check'? Thank you so much. I'm trying to debug but not able to display what I want.
df = pd.DataFrame([[10,3,1], [3,7,2], [2,4,4]], columns=list("ABC"))
df['check'] = None
def highlight(x):
c1 = 'background-color: yellow'
m = pd.concat([(x['A'] > 6), (x['B'] > 2), (x['C'] < 3)], axis=1)
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
return df1.mask(m, c1)
def check(v):
for index, row in v[[A]].iterrows():
if row[A] > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row[A]) + ">6"
return A_check
for index, row in v[[B]].iterrows():
if row[B] > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row[B]) + ">2"
return B_check
for index, row in v[[C]].iterrows():
if row[C] < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row[C]) + "<3"
return C_check
df['check'] = df.apply(lambda v: check(v), axis=1)
df.style.apply(highlight, axis=None)
This is the error message I got:
NameError: name 'A' is not defined
My understanding is that the following produces what you are trying to achieve with the check function:
def check(v):
row_str = 'row:{}, '.format(v.name)
checks = []
if v['A'] > 6:
checks.append(row_str + '{:.1f}'.format(v['A']) + ">6")
if v['B'] > 2:
checks.append(row_str + '{:.1f}'.format(v['B']) + ">2")
if v['C'] < 3:
checks.append(row_str + '{:.1f}'.format(v['C']) + "<3")
return '\n'.join(checks)
df['check'] = df.apply(check, axis=1)
Result (print(df)):
A B C check
0 10 3 1 row:0, 10.0>6\nrow:0, 3.0>2\nrow:0, 1.0<3
1 3 7 2 row:1, 7.0>2\nrow:1, 2.0<3
2 2 4 4 row:2, 4.0>2
(Replace \n with ' ' if you don't want the line breaks in the result.)
The axis=1 option in apply gives the function check one row of df as a Series with the column names of df as index (-> v). With v.name you'll get the corresponding row index. Therefore I don't see the need to use .iter.... Did I miss something?
There are few mistakes in program which we will fix one by one
Import pandas
import pandas as pd
In function check(v): var A, B, C are not defined, replace them with 'A', 'B', 'C'. Then v[['A']] will become a series, and to iterate in series we use iteritems() and not iterrows, and also index will be column name in series. Replacing will give
def check(v):
truth = []
for index, row in v[['A']].iteritems():
if row > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row) + ">6"
truth.append(A_check)
for index, row in v[['B']].iteritems():
if row > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row) + ">2"
truth.append(B_check)
for index, row in v[['C']].iteritems():
if row < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row) + "<3"
truth.append(C_check)
return '\n'.join(truth)
This should give expected output, although you need to also add additional logic so that check column doesnt get yellow color. This answer has minimal changes, but I recommend trying axis=1 to apply style columnwise as it seems more convenient. Also you can refer to style guide
I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb
I have a pandas dataframe with the following general format:
id,atr1,atr2,orig_date,fix_date
1,bolt,l,2000-01-01,nan
1,screw,l,2000-01-01,nan
1,stem,l,2000-01-01,nan
2,stem,l,2000-01-01,nan
2,screw,l,2000-01-01,nan
2,stem,l,2001-01-01,2001-01-01
3,bolt,r,2000-01-01,nan
3,stem,r,2000-01-01,nan
3,bolt,r,2001-01-01,2001-01-01
3,stem,r,2001-01-01,2001-01-01
This result would be the following:
id,atr1,atr2,orig_date,fix_date,failed_part_ind
1,bolt,l,2000-01-01,nan,0
1,screw,l,2000-01-01,nan,0
1,stem,l,2000-01-01,nan,0
2,stem,l,2000-01-01,nan,1
2,screw,l,2000-01-01,nan,0
2,stem,l,2001-01-01,2001-01-01,0
3,bolt,r,2000-01-01,nan,1
3,stem,r,2000-01-01,nan,1
3,bolt,r,2001-01-01,2001-01-01,0
3,stem,r,2001-01-01,2001-01-01,0
Any tips or tricks most welcome!
Update2:
A better way to describe what I need to accomplish is that in a .groupby(['id','atr1','atr2']) to create a new indicator column where the following criteria are met for records within the groups:
(df['orig_date'] < df['fix_date'])
I think this should work:
df['failed_part_ind'] = df.apply(lambda row: 1 if ((row['id'] == row['id']) &
(row['atr1'] == row['atr1']) &
(row['atr2'] == row['atr2']) &
(row['orig_date'] < row['fix_date']))
else 0, axis=1)
Update: I think this is what you want:
import numpy as np
def f(g):
min_fix_date = g['fix_date'].min()
if np.isnan(min_fix_date):
g['failed_part_ind'] = 0
else:
g['failed_part_ind'] = g['orig_date'].apply(lambda d: 1 if d < min_fix_date else 0)
return g
df.groupby(['id', 'atr1', 'atr2']).apply(lambda g: f(g))
How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.