Iterating with conditions over Distinct Values in a Column - python

I have a dataset that looks something like this:
category
value
ID
A
1
x
A
0.5
y
A
0.33
y
B
0.5
z
B
0.33
z
C
5
w
C
0.33
w
For each category, I want to grab all the instances that have a value of <= 0.5. I want to have a count of those instances for each category.
My ideal end goal would be to have a dataframe or list with the counts for each of these categories.
Thanks so much for your help.
EDIT:
To get more complex, let's say I want the count for each category where value is <=0.5 but only count each ID once.
Whereas before the values would be:
cat A -> 2, cat B -> 2, cat C -> 1
Now ideal values would be:
Cat A -> 1, cat B -> 1, cat C -> 1

You can use groupby.sum on the boolean Series of the comparison of value to 0.5:
out = df['value'].le(0.5).groupby(df['category']).sum()
Alternatively, use boolean indexing and value_counts:
df.loc[df['value'].le(0.5), 'category'].value_counts()
output:
category
A 2
B 2
C 1
Name: value, dtype: int64

Related

How do I get the latest entries in a DataFrame up to a certain time, for a given list of column values?

Say I have the following DataFrame df:
time person attributes
----------------------------
1 1 a
2 2 b
3 1 c
4 3 d
5 2 e
6 1 f
7 3 g
... ... ...
I want to write a function get_latest() that, when given a request_time and a list of persons ids, it will return a DataFrame containing the latest entry (row) for each person, up to the request_time.
So for instance, if I called get_latest(request_time = 4.5, ids = [1, 2]), then I want it to return
time person attributes
----------------------------
2 2 b
3 1 c
since those are the latest entries for persons 1 and 2 up to the time 4.5.
I've thought about doing a truncation of the DataFrame and then doing search from there by going up the DataFrame, but that's an okay efficiency of O(n), and I was wondering if there are functions or logic that make this a faster computation.
EDIT: I made this example DataFrame on the fly but it is perhaps important that I point out that the times are Python datetimes.
How about pd.DataFrame.query
def latest_entries(request_time: int or float, ids: list) -> pd.DataFrame:
return (
df
.query("time <= #request_time & person in #ids")
.sort_values(["time"], ascending=False)
.drop_duplicates(subset=["person"], keep="first")
.reset_index(drop=True)
)
print(latest_entries(4.5, [1, 2]))
time person attributes
0 3 1 c
1 2 2 b
def get_latest(tme, ids):
df2= ( df[(df['time']<=tme) &
(df['person'].isin(ids))])
return df2[~df2.duplicated(subset=['person'], keep='last')]
get_latest(4.5, [1,2])
time person attributes
1 2 2 b
2 3 1 c

Python - delete a row based on condition from a pandas.core.series.Series after groupby

I have this pandas.core.series.Series after grouping by 2 columns case and area
case
area
A
1
2494
2
2323
B
1
59243
2
27125
3
14
I want to keep only areas that are in case A , that means the result should be like this:
case
area
A
1
2494
2
2323
B
1
59243
2
27125
I tried this code :
a = df['B'][~df['B'].index.isin(df['A'].index)].index
df['B'].drop(a)
And it worked, the output was :
But it didn't drop it in the dataframe, it still the same.
when I assign the result of droping, all the values became NaN
df['B'] = df['B'].drop(a)
what should I do ?
it is possible to drop after grouping, here's one way
import pandas
import numpy as np
np.random.seed(1)
ungroup_df = pd.DataFrame({
'case':[
'A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'B','B','B','B','B','B',
],
'area':[
1,2,1,2,1,2,
1,2,1,2,1,2,
1,2,3,1,2,3,
1,2,3,1,2,3,
],
'value': np.random.random(24),
})
df = ungroup_df.groupby(['case','area'])['value'].sum()
print(df)
#index into the multi-index to just the 'A' areas
#the ":" is saying any value at the first level (A or B)
#then the df.loc['A'].index is filtering to second level of index (area) that match A's
filt_df = df.loc[:,df.loc['A'].index]
print(filt_df)
Test df:
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
3 2.079145
Name: value, dtype: float64
Output after dropping
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
Name: value, dtype: float64

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

Improving performance of Python for loops with Pandas data frames

please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition.
However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?
you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):
df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0
with your input sample, you get:
id condition count
0 1234 A 1
1 2323 B 1
2 3843 B 2
3 1234 C 1
4 8574 A 2
5 9483 A 3
which is faster than using loop for
and if you want just have the row with condition A for example, you can use a mask such as, if you do
print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,
arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])
EDIT: to create two column per conditions, you can do for example for condition A:
# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])
the output looks like this:
id condition nb_condition_A partial_arr_A nb_cond_A
0 1234 A 1 [1234] 1
1 2323 B 1 [1234] 1
2 3843 B 1 [1234] 1
3 1234 C 1 [1234] 1
4 8574 A 2 [1234, 8574] 2
5 9483 A 3 [1234, 8574, 9483] 3
then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation
EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:
# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
.groupby('condition').id.apply(list)) ,axis=1)
.applymap(lambda x: [] if not isinstance(x,list) else x))
Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).
For the length for each condition, you can do:
for cond in arr_cond:
df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)
THe result is like this:
id condition A B C len_A len_B len_C
0 1234 A [1234] [] [] 1 0 0
1 2323 B [1234] [2323] [] 1 1 0
2 3843 B [1234] [2323, 3843] [] 1 2 0
3 1234 C [] [2323, 3843] [1234] 0 2 1
4 8574 A [8574] [2323, 3843] [1234] 1 2 1
5 9483 A [8574, 9483] [2323, 3843] [1234] 2 2 1

How to access values returned by the column names idxmin/idxmax?

Let's say I have this dataframe
> df = pd.DataFrame({'A': [1,5], 'B':[3,4]})
A B
0 1 3
1 5 4
I can get the minimum value of each row with:
> df.min(1)
0 1
1 4
dtype: int64
Or its indexes with:
> df.idxmin(1)
0 A
1 B
dtype: object
Nevertheless, this implies searching the minimum values twice. Is there a way to use the idxmin results to access the respective columns and get the minimum value (without calling min)?
Edit: I am looking for something that is faster than calling min again. In theory, this should be possible as columns are indexed.
To get the values in a list, you could do the following:
> indicies = df.idxmin(1)
> [df.iloc[k][indicies[k]] for k in range(len(indicies))]
[1, 4]

Categories