I have a dataframe containing some string values:
df:
column1
0 | a
1 | b
2 | c
3 | d
now I also have a list = (b , c). It contains some of the values of the df.
I want to be able to find if for each of the values in the dataframe it can be found in the list.
0 | False
1 | True
2 | True
3 | False
So far I have used x = df['column1'].isin(list) but then it say False for all of the observations in the dataframe. I am assuming because it checks if all the values in the df are in the list. How can I achieve the desired result?
Thanks
Following code works for me:
import pandas as pd
data = ['a','b','c','d']
df = pd.DataFrame(data = data, columns=['Column 1'])
list1 =('a','b') #If you are using round brackets then that is not a list, its a tuple.
df.isin(list1)
Output:
Column 1
0 False
1 True
2 True
3 False
Note: If still not work then recheck all the values in the dataframe or a list, it might have unnecessary spaces something else.
Let me know if it works for you or not.
Related
I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False
By writing the following code I create a dataframe
data = [['A', 'B','D'], ['A','D'], ['F', 'G','C','B','A']]
df = pd.DataFrame(data)
df
My goal is to remove the values from the dataframe that are not in the list below.
list_items = ['A','B','C']
My expected output is as under
I have tried traversing the values in loops and check one by one, but let's say the dataframe is very large in size (9108, 1616) and the list has over 130 items that need to be checked. In that case it's taking too long to run the code. Please suggest the most efficient way to achieve the expected output.
I don't think doing it in pandas is a good ideas as columns don't matter here. It's easier to do it with lists, that you can convert to a pandas dataframe in the end if you really need it.
# convert df to list of lists
data = df.values.tolist()
# filter each element of the list to contain only list_items values
data_filtered = [ [el for el in l if el in list_items] for l in data]
# convert back to dataframe
df_filtered = pd.DataFrame(data_filtered)
print(df_filtered)
# 0 1 2
#0 A B None
#1 A None None
#2 C B A
Let us try not use for loop
s=df.where(df.isin(list_items)).reset_index().melt('index').dropna()
s=s.assign(Key=s.groupby('index').cumcount()).pivot('index','Key','value')
Key 0 1 2
index
0 A B NaN
1 A NaN NaN
2 C B A
Method two not good for the large dataframe
s=df.where(df.isin(list_items)).T.apply(lambda x : sorted(x,key=pd.isnull)).T.dropna(thresh=1, axis=1)
0 1 2
0 A B NaN
1 A NaN NaN
2 C B A
Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4
Is there a way to return the name/header of a column into a string in a pandas dataframe? I want to work with a row of data which has the same prefix. The dataframe header looks like this:
col_00 | col_01 | ... | col_51 | bc_00 | cd_00 | cd_01 | ... | cd_90
I'd like to apply a function to each row, but only from col_00 to col_51 and to cd_00 to cd_90 separately. To do this, I thought I'd collect the column names into a list, fe. to_work_with would be the list of columns starting with the prefix 'col', apply the function to df[to_work_with]. Then I'd change the to_work_with and it would contain the list of columns starting with the 'cd' prefix et cetera. But I don't know how to iterate through the column names.
So basically, the thing I'm looking for is this function:
to_work_with = column names in the df that start with "thisstring"
How can I do that? Thank you!
You can use boolean indexing with str.startswith:
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
Sample:
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 6 7 8 9
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
#if want apply some function for filtered columns only
def f(x):
return x + 1
df[cols] = df[cols].apply(f)
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 7 8 9 10
Another solution with list comprehension:
cols = [col for col in df.columns if col.startswith("cd")]
print (cols)
['cd_00', 'cd_01', 'cd_02', 'cd_90']
The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.