So I have a dataframe, lets call it df1, that looks like the following.
Index ID
1 90
2 508
3 692
4 944
5 1172
6 1998
7 2022
Now if I call (508 == df['ID']).any() it returns true as it should. But if I have another dataframe, df2, that looks like the following:
Index Num
1 83
2 508
3 912
and I want to check if the Nums are contained in the IDs from df1 using iloc returns an error of len() of unsized object. This is the exact code I've used:
(df2.iloc[1][0] == df2['ID']).any()
which returns the error mentioned above. I've also tried setting a variable to df1.iloc[1][0], didn't work, and calling int() on that variable, also didn't work. Can anyone provide some insight on this?
Try turning it around.
(df1['ID'] == df2.iloc[1][0]).any()
True
This is happening as a result of how the == is being handled for the objects being passed to it.
In this case you have the first object of type
type(df2.iloc[1][0])
numpy.int64
And the second of type
pandas.core.series.Series
== or __eq__ doesn't handle that combination well.
However, this works too:
(int(df2.iloc[1][0]) == df1['ID']).any()
Or:
(int(df2.iloc[1, 0]) == df1['ID']).any()
This works
(df['ID']==df2.iloc[1][0]).any()
Something like this to check if the ID column is in the Num column of df2:
>>> df1.ID.isin(df2.Num)
Index
1 False
2 True
3 False
4 False
5 False
6 False
7 False
Name: ID, dtype: bool
or:
>>> df2.Num.isin(df1.ID)
Index
1 False
2 True
3 False
Name: Num, dtype: bool
Or if you just want to see the matching numbers by index location:
>>> df2.where(df2.Num.isin(df1.ID) * df2.Num, np.nan)
Num
Index
1 NaN
2 508
3 NaN
Related
I have a dataframe that looks like this
date id type
02/02/2020 2 A
29/02/2020 2 B
04/03/2020 2 B
02/01/2020 3 B
15/01/2020 3 A
19/01/2020 3 C
... ... ...
I want to create a new column, called flagged. For each row, I want the value of flagged to be equal to True if there exists another row with
The same id
Type A
A date for which the difference in days with the date of the current row is bigger than 0 and smaller than 30
I would want the dataframe above to be transformed to this
date id type flagged
02/02/2020 2 A False
29/02/2020 2 B True
04/03/2020 2 B False
02/01/2020 3 B False
15/01/2020 3 A False
19/01/2020 3 C True
... ... ... ...
My Approach:
I created the following function
def check_type(id, date):
if df[(df.id == id) & (df.type == 'A') & (date - df.date > datetime.timedelta(0)) & (date - df.date < datetime.timedelta(30))].empty:
return False
else:
return True
so that if I run
df['flagged'] = df.apply(lambda x: check_type(x.id, x.date), axis = 1)
I get the desired result.
Questions:
How do I change the function check_type so that it can be applicable to any dataframe, no matter its name? The current function only works if the dataframe that it is used on is called df.
How do I make this process faster? I want to run this function on a large dataframe, and it's not performing as fast as I would want.
Thanks in advance!
I would find the last date with A type and propagate it throughout the id with ffill and find the difference:
last_dates = df.date.where(df['type'].eq('A')).groupby(df['id']).ffill()
# this is the new column
df.date.sub(last_dates).lt(pd.to_timedelta('30D')) & df['type'].ne('A')
Output:
0 False
1 True
2 False
3 False
4 False
5 True
dtype: bool
Note: this works given that you always mask A with False.
I was passing an Index type variable (Pandas.Index) containing the labels of columns I want to drop from my DataFrame and it was working correctly. It was Index type because I was extracting the column names based on certain condition from the DataFrame itself.
Afterwards, I needed to add another column name to that list, so I converted the Index object to a Python list so I could append the additional label name. But on passing the list as columns parameter to the drop() method on the Dataframe, I now keep getting the error :
ValueError: Need to specify at least one of 'labels', 'index' or 'columns'
How to resolve this error?
The code I use is like this:
unique_count = df.apply(pd.Series.nunique)
redundant_columns = unique_count[unique_count == 1].index.values.tolist()
redundant_columns.append('DESCRIPTION')
print(redundant_columns)
df.drop(columns=redundant_columns, inplace=True)
Out: None
I found why the error is occurring. After the append() statement, redundant_columns is becoming None. I don't know why. I would love if someone can explain why this is happening?
For me your solution working nice.
Another solution for remove columns by boolean indexing:
df = pd.DataFrame({'A':list('bbbbbb'),
'DESCRIPTION':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'DESCRIPTION':list('aaabbb')})
print (df)
A C D DESCRIPTION E
0 b 7 1 a 5
1 b 8 3 a 3
2 b 9 5 a 6
3 b 4 7 b 9
4 b 2 1 b 2
5 b 3 0 b 4
mask = df.nunique().ne(1)
mask['DESCRIPTION'] = False
df = df.loc[:, mask]
print (df)
C D E
0 7 1 5
1 8 3 3
2 9 5 6
3 4 7 9
4 2 1 2
5 3 0 4
Explanation:
First get length of unique values by nunique and compare by ne for not equal
Change boolean mask - column DESCRIPTION to False for always remove
Filter by boolean indexing
Details:
print (df.nunique())
A 1
C 6
D 5
DESCRIPTION 2
E 6
dtype: int64
mask = df.nunique().ne(1)
print (mask)
A False
C True
D True
DESCRIPTION True
E True
mask['DESCRIPTION'] = False
print (mask)
A False
C True
D True
DESCRIPTION False
E True
dtype: bool
After trying around, this got fixed by using numpy.ndarray instead of plain Python list, although I don't know why.
In my trials, using plain Python List is giving ValueError, pandas.Index or numpy.ndarray type object containing the labels is working fine. So I went with np.ndarray as that is appendable.
Current working code:
unique_count = df.apply(pd.Series.nunique)
redundant_columns: np.ndarray = unique_count[unique_count == 1].index.values
redundant_columns = np.append(redundant_columns, 'DESCRIPTION')
self.full_data.drop(columns=redundant_columns, inplace=True)
I had the same error when using .remove in the line of initialization:
myNewList = [i for i in myOldList].remove('Last Item')
myNewList would become none type. Using .tolist() in a separate column might help you:
redundant_columns = unique_count[unique_count == 1].index.values
redundant_columns.tolist()
redundant_columns.append('DESCRIPTION')
I am trying to find the null values in a DataFrame. Though I reviewed the following post from Stackoverflow that describes the process to determine the null values, I am having a hard time to do the same for my dataset.
How to count the Nan values in the column in Panda Data frame
Working code:
import pandas as pd
a = ['america','britain','brazil','','china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.isnull()
#Output:
False
1 False
2 False
3 False
4 False
5 False
a.isnull().sum()
#Output
#0 0
#dtype: int64
What am I doing wrong?
If you want '', None and NaN to all count as null, you can use the applymap method on each value in the dataframe coerced to a boolean and then use .sum subsequently:
import pandas as pd
import numpy as np
a = ['america','britain','brazil',None,'', np.nan, 'china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.applymap(lambda x: not x or pd.isnull(x)).sum()
# 0 3
# dtype: int64
I hope this helps.
The '' in your list isn't a null value, it's an empty string. To get a null, use None instead. This is described in the pandas.isnull() documentation that missing values are "NaN in numeric arrays, [or] None/NaN in object arrays".
import pandas as pd
a = ['america','britain','brazil',None,'china','jamaica']
a = pd.DataFrame(a)
a.isnull()
0
0 False
1 False
2 False
3 True
4 False
5 False
You can see the difference by printing the two dataframes. In the first case, the dataframe looks like:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3
4 china
5 jamaica
Notice that the value at index 3 is an empty string.
In the second case, you get:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3 None
4 china
5 jamaica
The other posts addressed that '' is not a null value and therefore isn't counted as such with the isnull method...
...However, '' does evaluate to False when interpreted as a bool.
a.astype(bool)
0
0 True
1 True
2 True
3 False
4 True
5 True
This might be useful if you have '' in your dataframe and want to process it this way.
I have a regular DataFrame with a string type (object) column. When I try to filter on the column using the equivalent of a WHERE clause, I get a KeyError when I use the dot notation. When in bracket notation, all is well.
I am referring to these instructions:
df[df.colA == 'blah']
df[df['colA'] == 'blah']
The first gives the equivalent of
KeyError: False
Not posting an example as I cannot reproduce the issue on a bespoke DataFrame built for the purpose of illustration: when I do, both notations yield the same result.
Asking then if there is a difference in the two and why.
The dot notation is just a convenient shortcut for accessing things vs. the standard brackets. Notably, they don't work when the column name is something like sum that is already a DataFrame method. My bet would be that the column name in your real example is running into that issue, and so it works fine with the bracket selection but is otherwise testing whether a method is equal to 'blah'.
Quick example below:
In [67]: df = pd.DataFrame(np.arange(10).reshape(5,2), columns=["number", "sum"])
In [68]: df
Out[68]:
number sum
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [69]: df.number == 0
Out[69]:
0 True
1 False
2 False
3 False
4 False
Name: number, dtype: bool
In [70]: df.sum == 0
Out[70]: False
In [71]: df['sum'] == 0
Out[71]:
0 False
1 False
2 False
3 False
4 False
Name: sum, dtype: bool
Very simple question everyone, but nearly impossible to find answers to basic questions in official documentation.
I have a dataframe object in Pandas that has rows and columns.
One of the columns, named "CBSM", contains boolean values. I need to delete all rows from the dataframe where the value of the CBSM column = "Y".
I see that there is a method called dataframe.drop()
Label, Axis, and Level are 3 parameters that the drop() method takes in. I have no clue what values to provide these parameters to accomplish my need of deleting the rows in the fashion I described above. I have a feeling the drop() method is not the right way to do what I want.
Please advise, thanks.
This method is called boolean indexing.
You can try loc with str.contains:
df.loc[~df['CBSM'].str.contains('Y')]
Sample:
print df
A CBSM L
0 1 Y 4
1 1 N 6
2 2 N 3
print df['CBSM'].str.contains('Y')
0 True
1 False
2 False
Name: CBSM, dtype: bool
#inverted boolean serie
print ~df['CBSM'].str.contains('Y')
0 False
1 True
2 True
Name: CBSM, dtype: bool
print df.loc[~df['CBSM'].str.contains('Y')]
A CBSM L
1 1 N 6
2 2 N 3
Or:
print df.loc[~(df['CBSM'] == 'Y')]
A CBSM L
1 1 N 6
2 2 N 3