I've been working on a DataFrame with User_IDs, DateTime objects and other information, like the following extract:
User_ID;Latitude;Longitude;Datetime
222583401;41.4020375;2.1478710;2014-07-06 20:49:20
287280509;41.3671346;2.0793115;2013-01-30 09:25:47
329757763;41.5453577;2.1175164;2012-09-25 08:40:59
189757330;41.5844998;2.5621569;2013-10-01 11:55:20
624921653;41.5931846;2.3030671;2013-07-09 20:12:20
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30
I've grouped Users with:
g = df.groupby(['User_ID','Datetime'])
and then check for no-single DataTime objects:
df = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)
I've obtained the following boolean DataFrame:
User_ID
189757330 False
222583401 False
287280509 False
329757763 False
414673119 True
624921653 False
Name: Datetime, dtype: bool
which is fine for my purposes to keep only User_ID with a True masked value. Now I would like to keep only the User_ID values associated to the True values, and write them to a new DataFrame with pandas.to_csv, for instance. The expected DataFrame would contain only the User_ID with more than one DateTime object:
User_ID;Latitude;Longitude;Datetime
414673119;41.5550136;2.0965829;2014-02-24 20:15:30
414673119;41.5550136;2.0975829;2014-02-24 20:16:30
414673119;41.5550136;2.0985829;2014-02-24 20:17:30
How may I have access to the boolean values for each User_ID? Thanks for your kind help.
Assign the result of df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1) to a variable so you can perform boolean indexing and then use the index from this to call isin and filter your orig df:
In [366]:
users = df.groupby('User_ID')['Datetime'].apply(lambda g: len(g)>1)
users
Out[366]:
User_ID
189757330 False
222583401 False
287280509 False
329757763 False
414673119 True
624921653 False
Name: Datetime, dtype: bool
In [367]:
users[users]
Out[367]:
User_ID
414673119 True
Name: Datetime, dtype: bool
In [368]:
users[users].index
Out[368]:
Int64Index([414673119], dtype='int64')
In [361]:
df[df['User_ID'].isin(users[users].index)]
Out[361]:
User_ID Latitude Longitude Datetime
5 414673119 41.555014 2.096583 2014-02-24 20:15:30
6 414673119 41.555014 2.097583 2014-02-24 20:16:30
7 414673119 41.555014 2.098583 2014-02-24 20:17:30
You can then call to_csv on the above as normal
first, make sure you have no duplicate entries:
df = df.drop_duplicates()
then, figure out the counts for each:
counts = df.groupby('User_ID').Datetime.count()
finally, figure out where the indexes overlap:
df[df.User_ID.isin(counts[counts > 1].index)]
Related
i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?
For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False
I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.
After doing some operations, I am getting a dataframe with an index and a column with boolean values. I just need to get those indexes having boolean value to be True. How can I get that?
My output is like this: Here, "AC name" is the index as per the output dataframe.
AC name
Agiaon False
Alamnagar False
Alauli True
Alinagar False
Ziradei True
Name: Vote percentage, Length: 253, dtype: bool
Considering that the dataframe is df, it would be:
res = df[df['Vote percentage']].index
I need to check if the values from the column A contain the values from column B.
I tried using the isin() method:
import pandas as pd
df = pd.DataFrame({'A': ['filePath_en_GB_LU_file', 'filePath_en_US_US_file', 'filePath_en_GB_PL_file'],
'B': ['_LU_', '_US_', '_GB_']})
df['isInCheck'] = df.A.isin(df.B)
For some reason it's not working.
It returns only False values, whereas for first two rows it should return True.
What am I missing in there?
I think you need DataFrame.apply, but for last row is also match:
df['isInCheck'] = df.apply(lambda x: x.B in x.A, axis=1)
print (df)
A B isInCheck
0 filePath_en_GB_LU_file _LU_ True
1 filePath_en_US_US_file _US_ True
2 filePath_en_GB_PL_file _GB_ True
Try to use an apply:
df['isInCheck'] = df.apply(lambda r: r['B'] in r['A'], axis=1)
This will check row-wise. If you want to check if multiple elements are presents, maybe you should create a column for each one of them:
for e in df['B'].unique():
df[f'has_"{e}"'] = df.apply(lambda r: e in r['A'], axis=1)
print(df)
A B has_"_LU_" has_"_US_" has_"_GB_"
0 filePath_en_GB_LU_file _LU_ True False True
1 filePath_en_US_US_file _US_ False True False
2 filePath_en_GB_PL_file _GB_ False False True
I have a dataframe with a column named diff. I am able to group this column and get the number of True and False occurrences in the data frame.
df.groupby('diff').size()
returns
diff
True 5101
False 61
dtype: int64
I want to access to the value of True, 5101.
I have already tried
df.groupby('diff').size().loc['True']
It is Series, so loc should be omit:
s = pd.Series([5101, 61], index=[True, False])
print (s)
True 5101
False 61
dtype: int64
print (s[True])
5101
The answer is:
df_merged.groupby('diff').size().loc[True]
Explanation: note that
df_merged.groupby('diff').size().index
returns
Index([True, False], dtype='object', name='diff')
It's a bool True, not a "True" like in a string !!!!
Using .loc with lambda
s = df.groupby('diff').size().loc[lambda x :x]
Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well.
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1.
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64