Replace numeric values in a pandas dataframe - python

Problem: Polluted Dataframe.
Details: Frame consists of NaNs string values which i know the meaning of and numeric values.
Task: Replaceing the numeric values with NaNs
Example
import numpy as np
import pandas as pd
df = pd.DataFrame([['abc', 'cdf', 1], ['k', 'sum', 'some'], [1000, np.nan, 'nothing']])
out:
0 1 2
0 abc cdf 1
1 k sum some
2 1000 NaN nothing
Attempt 1 (Does not work, because regex only looks at string cells)
df.replace({'\d+': np.nan}, regex=True)
out:
0 1 2
0 abc cdf 1
1 k sum some
2 1000 NaN nothing
Preliminary Solution
val_set = set()
[val_set.update(i) for i in df.values]
def dis_nums(myset):
str_s = set()
num_replace_dict = {}
for i in range(len(myset)):
val = myset.pop()
if type(val) == str:
str_s.update([val])
else:
num_replace_dict.update({val:np.nan})
return str_s, num_replace_dict
strs, rpl_dict = dis_nums(val_set)
df.replace(rpl_dict, inplace=True)
out:
0 1 2
0 abc cdf NaN
1 k sum some
2 NaN NaN nothing
Question
Is there any easier/ more pleasant solution?

You can do a round-conversion to str to replace the values and back.
df.astype('str').replace({'\d+': np.nan, 'nan': np.nan}, regex=True).astype('object')
#This makes sure already existing np.nan are not lost
Output
0 1 2
0 abc cdf NaN
1 k sum some
2 NaN NaN nothing

You can use a loop to go through each columns, and check each item. If it is an integer or float then replace it with np.nan. It can be done easily with map function applied on the column.
you can change the condition of the if to incorporate any data type u want.
for x in df.columns:
df[x] = df[x].map(lambda item : np.nan if type(item) == int or type(item) == float else item)
This is a naive approach and there have to be better solutions than this.!!

Related

Create a numerical column out of a url, with 1 for url present and 0 for all NaNs

I'm trying to create a column that would identify whether a url is present or not from an existing column called "links". I'd like all NaN values to become zeros and any urls to be denoted as 1, in the new column. I tried the following but was unable to get the correct values.
def url(x):
if x == 'NaN':
return 0
else:
return 1
df['url1'] = df['links'].apply(url)
df.head()
You can use pd.isnull(x) instead of the x == 'NaN' comparison
import pandas as pd
df['url1'] = df['links'].apply(lambda x: 0 if pd.isnull(x) else 1)
See my comment, but the simplest and most performant thing you can to do to get your desired output is to use a pandas method:
input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'links' : [np.nan, 'a', 'b', np.nan]})
In[1]:
links
0 NaN
1 a
2 b
3 NaN
output:
df['url1'] = df['links'].notnull().astype(int)
df
Out[801]:
links url1
0 NaN 0
1 a 1
2 b 1
3 NaN 0
notnull() returns True or False and .astype(int) change True to 1 and False to 0, because True and False are boolean values with an underlying value of 1 and 0, respectively, even though they say True and False. So, when you change the data type to int, it will show its integer underlying value of 1 or 0.
Related to my comment 'True' would also not equal to True and 'False' not equal to False , just like 'NaN' does not equal NaN (notice apostrophes versus no apostrophes).

How to remove certain values from a pandas dataframe, which are not in a list?

By writing the following code I create a dataframe
data = [['A', 'B','D'], ['A','D'], ['F', 'G','C','B','A']]
df = pd.DataFrame(data)
df
My goal is to remove the values from the dataframe that are not in the list below.
list_items = ['A','B','C']
My expected output is as under
I have tried traversing the values in loops and check one by one, but let's say the dataframe is very large in size (9108, 1616) and the list has over 130 items that need to be checked. In that case it's taking too long to run the code. Please suggest the most efficient way to achieve the expected output.
I don't think doing it in pandas is a good ideas as columns don't matter here. It's easier to do it with lists, that you can convert to a pandas dataframe in the end if you really need it.
# convert df to list of lists
data = df.values.tolist()
# filter each element of the list to contain only list_items values
data_filtered = [ [el for el in l if el in list_items] for l in data]
# convert back to dataframe
df_filtered = pd.DataFrame(data_filtered)
print(df_filtered)
# 0 1 2
#0 A B None
#1 A None None
#2 C B A
Let us try not use for loop
s=df.where(df.isin(list_items)).reset_index().melt('index').dropna()
s=s.assign(Key=s.groupby('index').cumcount()).pivot('index','Key','value')
Key 0 1 2
index
0 A B NaN
1 A NaN NaN
2 C B A
Method two not good for the large dataframe
s=df.where(df.isin(list_items)).T.apply(lambda x : sorted(x,key=pd.isnull)).T.dropna(thresh=1, axis=1)
0 1 2
0 A B NaN
1 A NaN NaN
2 C B A

Compare the columns of a dataframe in reverse order and create a new column with the index of the column which has value 0

I have imported data from a csv file into my program and then used set_index to set 'rule_id' as index. I used this code:
df = pd.read_excel('stack.xlsx')
df.set_index(['rule_id'])
and the data looks like this:
Now I want to compare one column with another but in reverse order , for eg; I want to compare 'c' data with 'b' , then compare 'b' with 'a' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
The result should look like this:
I am not able to write the code of how should I approach this problem, if you guys could help me , that would be great.
Edit: A minor edit. I have imported the data from an excel which looks like this , this is just a part of data , there are multiple columns:
Then I used pivot_table to manipulate the data as per my requirement using this code:
df = df.pivot_table(index = 'rule_id' , columns = ['date'], values = 'rid_fc', fill_value = 0)
and my data looks like this now:
Now I want to compare one column with another but in reverse order , for eg; I want to compare '2019-04-25 16:36:32' data with '2019-04-25 16:29:05' , then compare '2019-04-25 16:29:05' with '2019-04-25 16:14:14' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
IIUC you can try with:
d={i:e for e,i in enumerate(df.columns)}
m1=df[['c','b']]
m2=df[['b','a']]
df['comp1']=m1.eq(0).dot(m1.columns).map(d)
m3=m2.eq(0).dot(m2.columns)
m3.loc[m3.str.len()!=1]=np.nan
df['comp2']=m3.map(d)
print(df)
a b c comp1 comp2
rule_id
51234 0 7 6 NaN 0.0
53219 0 0 1 1.0 NaN
56195 0 2 2 NaN 0.0
I suggest use numpy - compare shifted values with logical_and and set new columns by range created by np.arange with swap order and numpy.where with DatFrame constructor:
df = pd.DataFrame({
'a':[0,0,0],
'b':[7,0,2],
'c':[6,1,2],
})
#change order of array
x = df.values[:, ::-1]
#compare for equal 0 and and not equal 0
a = np.logical_and(x[:, 1:] == 0, x[:, :-1] != 0)
#create range from top to 0
b = np.arange(a.shape[1]-1, -1, -1)
#new columns names
c = [f'comp{i+1}' for i in range(x.shape[1] - 1)]
#set values by boolean array a and set values
df1 = pd.DataFrame(np.where(a, b[None, :], np.nan), columns=c, index=df.index)
print (df1)
comp1 comp2
0 NaN 0.0
1 1.0 NaN
2 NaN 0.0
You can make use of this code snippet. I did not have time to perfect it with loops etc. so please make the change as per requirements.
import pandas as pd
import numpy as np
# Data
print(df.head())
a b c
0 0 7 6
1 0 0 1
2 0 2 2
cp = df.copy()
cp[cp != 0] = 1
cp['comp1'] = cp['a'] + cp['b']
cp['comp2'] = cp['b'] + cp['c']
# Logic
cp = cp.replace([0, 1, 2], [1, np.nan, 0])
cp[['a', 'b', 'c']] = df[['a', 'b', 'c']]
# Results
print(cp.head())
a b c comp1 comp2
0 0 7 6 NaN 0.0
1 0 0 1 1.0 NaN
2 0 2 2 NaN 0.0

How to remove blanks/NA's from dataframe and shift the values up

I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!

pandas dataframe remove constant column

I have a dataframe that may or may not have columns that are the same value. For example
row A B
1 9 0
2 7 0
3 5 0
4 2 0
I'd like to return just
row A
1 9
2 7
3 5
4 2
Is there a simple way to identify if any of these columns exist and then remove them?
I believe this option will be faster than the other answers here as it will traverse the data frame only once for the comparison and short-circuit if a non-unique value is found.
>>> df
0 1 2
0 1 9 0
1 2 7 0
2 3 7 0
>>> df.loc[:, (df != df.iloc[0]).any()]
0 1
0 1 9
1 2 7
2 3 7
Ignoring NaNs like usual, a column is constant if nunique() == 1. So:
>>> df
A B row
0 9 0 1
1 7 0 2
2 5 0 3
3 2 0 4
>>> df = df.loc[:,df.apply(pd.Series.nunique) != 1]
>>> df
A row
0 9 1
1 7 2
2 5 3
3 2 4
I compared various methods on data frame of size 120*10000. And found the efficient one is
def drop_constant_column(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
return dataframe.loc[:, (dataframe != dataframe.iloc[0]).any()]
1 loop, best of 3: 237 ms per loop
The other contenders are
def drop_constant_columns(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
result = dataframe.copy()
for column in dataframe.columns:
if len(dataframe[column].unique()) == 1:
result = result.drop(column,axis=1)
return result
1 loop, best of 3: 19.2 s per loop
def drop_constant_columns_2(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
for column in dataframe.columns:
if len(dataframe[column].unique()) == 1:
dataframe.drop(column,inplace=True,axis=1)
return dataframe
1 loop, best of 3: 317 ms per loop
def drop_constant_columns_3(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
keep_columns = [col for col in dataframe.columns if len(dataframe[col].unique()) > 1]
return dataframe[keep_columns].copy()
1 loop, best of 3: 358 ms per loop
def drop_constant_columns_4(dataframe):
"""
Drops constant value columns of pandas dataframe.
"""
keep_columns = dataframe.columns[dataframe.nunique()>1]
return dataframe.loc[:,keep_columns].copy()
1 loop, best of 3: 1.8 s per loop
Assuming that the DataFrame is completely of type numeric:
you can try:
>>> df = df.loc[:, df.var() == 0.0]
which will remove constant(i.e. variance = 0) columns.
If the DataFrame is of type both numeric and object, then you should try:
>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)
which will drop constant columns of numeric type only.
If you also want to ignore/delete constant enum columns, you should try:
>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> enum_df = enum_df.loc[:, [True if y !=1 else False for y in [len(np.unique(x, return_counts=True)[-1]) for x in enum_df.T.as_matrix()]]]
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)
Here is my solution since I needed to do both object and numerical columns. Not claiming its super efficient or anything but it gets the job done.
def drop_constants(df):
"""iterate through columns and remove columns with constant values (all same)"""
columns = df.columns.values
for col in columns:
# drop col if unique values is 1
if df[col].nunique(dropna=False) == 1:
del df[col]
return df
Extra caveat, it won't work on columns of lists or arrays since they are not hashable.
Many examples in this thread does not work properly. Check this my answer with collection of examples that work

Categories