unique and replace in python

unique and replace in python - python

I have huge dataset with more than 100 columns that contain non-null values that I want to replace (and leave all the null values as is). Some columns, however, should stay untouched.
I am planning to do the following:
1) find unique values in these columns
2) replace this values with 1
Problem:
1) something like this barely possible to use for 100+ columns:
np.unique(df[['Col1', 'Col2']].values)
2) how do I apply than loc to all these columns? code below does not work
df_2.loc[df_2[['col1','col2','col3']] !=0, ['col1','col2','col3']] = 1
Maybe there is more reasonable and elegant way to solve the problem. Thanks!

Use DataFrame.mask:
c = ['col1','col2','col3']
df_2[c] = df_2[c].mask(df_2[c] != 0, 1)
Or compare by not equal with DataFrame.ne and cast mask by integers with DataFrame.astype:
df_2 = pd.DataFrame({
'A':list('abcdef'),
'col1':[0,5,0,5,5,0],
'col2':[7,8,9,0,2,0],
'col3':[0,0,5,7,0,0],
'E':[5,0,6,9,2,0],
})
c = ['col1','col2','col3']
df_2[c] = df_2[c].ne(0).astype(int)
print (df_2)
A col1 col2 col3 E
0 a 0 1 0 5
1 b 1 1 0 0
2 c 0 1 1 6
3 d 1 0 1 9
4 e 1 1 0 2
5 f 0 0 0 0
EDIT: For select columns by positions use DataFrame.iloc:
idx = np.r_[6:71,82]
df_2.iloc[:, idx] = df_2.iloc[:, idx].ne(0).astype(int)
Or first solution:
df_2.iloc[:, idx] = df_2.iloc[:, idx].mask(df_2.iloc[:, idx]] != 0, 1)

Related

Adding new rows with default value based on dataframe values into dataframe

I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.

Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1

Best way to get same number of each case of binary data in pandas

Say there's Dataframe df with columns A and B
A B
0 1 1
1 0 1
2 0 1
3 0 1
4 1 0
If I want to 'equalize' the cases of column A I just have to drop one of the rows [1, 2, 3]. If I want to equalize the cases of col B then I'd have to drop three of the rows [0, 1, 2, 3].
However, if I want to equalize the cases of both columns so that the general imbalance is minimized how could I do that through pandas? Bear in mind that efficiency is very important.

Use:
def remove(df, col):
#get counts of column
s = df[col].value_counts()
#subtract for number of removed rows
d = s.sub(s.min())
#remove filtered rows with samples
return df.drop(df[df[col].eq(d.idxmax())].sample(d.max()).index)
df = remove(df, 'A')
print (df)
A B
0 1 1
1 0 1
3 0 1
4 1 0
df = remove(df, 'B')
print (df)
A B
3 0 1
4 1 0

How to exclude few columns and replace negative values in Big data?

I have a dataframe like as shown below
import pandas as pd
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1]})
In my real data, I have more than 100 columns. What I would like to do is excluding two columns, I would like replace the negative values in all other columns to zero
I tried this but it works for all columns.
df[df < 0] = 0
Is the only way is to have all column names in a list and run through a loop like as shown below
col_list = ['a1','a2','a3','a4',..........'a100'] # in this `a21`,a22` columns are ignored from the list
for col in col_list:
df[col] = [df[col]<0] = 0
As you can see it's lengthy and inefficient.
Can you help me with any efficient approach to do this?

There is problem df[col_list] return boolean DataFrame, so cannot be filtered by df[df < 0] = 0 with specified columns names, is necessary use DataFrame.mask:
col_list = df.columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
EDIT:
For numeric columns without a21 and a22 use DataFrame.select_dtypes with Index.difference:
df = pd.DataFrame({
'a21':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[-7,8,9,4,2,3],
'D':[1,3,5,-7,1,'a'], <- object column because last `a`
'E':[5,3,-6,9,2,-4],
'a22':list('aaabbb')
})
col_list = df.select_dtypes(np.number).columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
print (df)
a21 B C D E a22
0 a 4 0 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 0 a
3 d 5 4 -7 9 b
4 e 5 2 1 2 b
5 f 4 3 a 0 b

How about simple clipping at 0?
df[col_list] = df[col_list].clip(0)

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below

You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.

Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0

import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?

You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)

Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0

We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.

Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given

I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]

you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

unique and replace in python - python

Related

Adding new rows with default value based on dataframe values into dataframe

Best way to get same number of each case of binary data in pandas

How to exclude few columns and replace negative values in Big data?

Pandas Python : how to create multiple columns from a list

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

Categories

Resources