unique and replace in python - python

I have huge dataset with more than 100 columns that contain non-null values that I want to replace (and leave all the null values as is). Some columns, however, should stay untouched.
I am planning to do the following:
1) find unique values in these columns
2) replace this values with 1
Problem:
1) something like this barely possible to use for 100+ columns:
np.unique(df[['Col1', 'Col2']].values)
2) how do I apply than loc to all these columns? code below does not work
df_2.loc[df_2[['col1','col2','col3']] !=0, ['col1','col2','col3']] = 1
Maybe there is more reasonable and elegant way to solve the problem. Thanks!

Use DataFrame.mask:
c = ['col1','col2','col3']
df_2[c] = df_2[c].mask(df_2[c] != 0, 1)
Or compare by not equal with DataFrame.ne and cast mask by integers with DataFrame.astype:
df_2 = pd.DataFrame({
'A':list('abcdef'),
'col1':[0,5,0,5,5,0],
'col2':[7,8,9,0,2,0],
'col3':[0,0,5,7,0,0],
'E':[5,0,6,9,2,0],
})
c = ['col1','col2','col3']
df_2[c] = df_2[c].ne(0).astype(int)
print (df_2)
A col1 col2 col3 E
0 a 0 1 0 5
1 b 1 1 0 0
2 c 0 1 1 6
3 d 1 0 1 9
4 e 1 1 0 2
5 f 0 0 0 0
EDIT: For select columns by positions use DataFrame.iloc:
idx = np.r_[6:71,82]
df_2.iloc[:, idx] = df_2.iloc[:, idx].ne(0).astype(int)
Or first solution:
df_2.iloc[:, idx] = df_2.iloc[:, idx].mask(df_2.iloc[:, idx]] != 0, 1)

Related

Adding new rows with default value based on dataframe values into dataframe

I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.
Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1

Best way to get same number of each case of binary data in pandas

Say there's Dataframe df with columns A and B
A B
0 1 1
1 0 1
2 0 1
3 0 1
4 1 0
If I want to 'equalize' the cases of column A I just have to drop one of the rows [1, 2, 3]. If I want to equalize the cases of col B then I'd have to drop three of the rows [0, 1, 2, 3].
However, if I want to equalize the cases of both columns so that the general imbalance is minimized how could I do that through pandas? Bear in mind that efficiency is very important.
Use:
def remove(df, col):
#get counts of column
s = df[col].value_counts()
#subtract for number of removed rows
d = s.sub(s.min())
#remove filtered rows with samples
return df.drop(df[df[col].eq(d.idxmax())].sample(d.max()).index)
df = remove(df, 'A')
print (df)
A B
0 1 1
1 0 1
3 0 1
4 1 0
df = remove(df, 'B')
print (df)
A B
3 0 1
4 1 0

How to exclude few columns and replace negative values in Big data?

I have a dataframe like as shown below
import pandas as pd
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1]})
In my real data, I have more than 100 columns. What I would like to do is excluding two columns, I would like replace the negative values in all other columns to zero
I tried this but it works for all columns.
df[df < 0] = 0
Is the only way is to have all column names in a list and run through a loop like as shown below
col_list = ['a1','a2','a3','a4',..........'a100'] # in this `a21`,a22` columns are ignored from the list
for col in col_list:
df[col] = [df[col]<0] = 0
As you can see it's lengthy and inefficient.
Can you help me with any efficient approach to do this?
There is problem df[col_list] return boolean DataFrame, so cannot be filtered by df[df < 0] = 0 with specified columns names, is necessary use DataFrame.mask:
col_list = df.columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
EDIT:
For numeric columns without a21 and a22 use DataFrame.select_dtypes with Index.difference:
df = pd.DataFrame({
'a21':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[-7,8,9,4,2,3],
'D':[1,3,5,-7,1,'a'], <- object column because last `a`
'E':[5,3,-6,9,2,-4],
'a22':list('aaabbb')
})
col_list = df.select_dtypes(np.number).columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
print (df)
a21 B C D E a22
0 a 4 0 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 0 a
3 d 5 4 -7 9 b
4 e 5 2 1 2 b
5 f 4 3 a 0 b
How about simple clipping at 0?
df[col_list] = df[col_list].clip(0)

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below
You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.
Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0
import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?
You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)
Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0
We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

Categories