Switch boolean output to string in pandas python - python

When comparing two dataframes and putting the result back into a dataframe:
dfA = pd.DataFrame({'Column':[1,2,3,4]})
or in human readable form:
Column
0 1
1 2
2 3
3 4
dfB = pd.DataFrame({'Column':[1,2,4,3]})
or in human readable form:
Column
0 1
1 2
2 4
3 3
pd.DataFrame(dfA > dfB)
pandas outputs a dataframe with true or false values.
Column
0 False
1 False
2 False
3 True
Is it possible to change the name from 'true' or 'false' to 'lower' or 'higher'?
I want to know if the outcome is higher, lower or equal, that is why I ask.
If the output is not higher or lower, (true or false) then it must be equal.

You may use map:
In [10]: pd.DataFrame(dfA > dfB)['Column'].map({True: 'higher', False: 'lower'})
Out[10]:
0 lower
1 lower
2 lower
3 higher
Name: Column, dtype: object

I'd recommend np.where for performance/simplicity.
pd.DataFrame(np.where(dfA > dfB, 'higher', 'lower'), columns=['col'])
col
0 lower
1 lower
2 lower
3 higher
You can also nest conditions if needed with np.where,
m1 = dfA > dfB
m2 = dfA < dfB
pd.DataFrame(
np.where(m1, 'higher', np.where(m2, 'lower', 'equal')),
columns=['col']
)
Or, follow a slightly different approach with np.select.
pd.DataFrame(
np.select([m1, m2], ['higher', 'lower'], default='equal'),
columns=['col']
)
col
0 equal
1 equal
2 lower
3 higher

Related

pandas dataframe loc usage: what does supplying length of index to loc actually mean?

I have read about dataframe loc. I could not understand why the length of dataframe(indexPD) is being supplied to loc as a first argument. Basically what does this loc indicate?
tp_DataFrame = pd.DataFrame(columns=list(props_file_data["PART_HEADER"].split("|")))
indexPD = len(tp_DataFrame)
tp_DataFrame.loc[indexPD, 'item_id'] = something
That is simply telling pandas you want to do the operation on all of the rows of that column of your dataframe. Consider this pandas Dataframe:
df = pd.DataFrame(zip([1,2,3], [4,5,6]), columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 6
Your transformation df.loc[len(df), 'b'] = -1 is equivalent to df.loc[:, 'b'] = -1. You are applying this -1 transformation to all rows of the desired column, both yield:
a b
0 1 -1
1 2 -1
2 3 -1
The purpose of the first argument is so you specify which indices in that column will suffer the transformation. For instance, if you only want the first 2 rows to suffer the transformation then you can specify it like this:
df.loc[[0,1], 'b'] = -1
a b
0 1 -1
1 2 -1
2 3 6

A faster method than "for" to scan a DataFrame - Python

I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False

Indexing rows by boolean expression and column by position pandas data frame

How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?
Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0
One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0
I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

Categories