I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.
Related
This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
set
set
1
b
volvo
None
swe
0
0
1
45
set
set
2
c
bmw
p
us
0
0
1
56
test
test
3
d
bmw
p
us
0
1
1
43
test
test
4
e
bmw
d
germany
1
0
1
34
set
set
5
f
audi
d
germany
1
0
1
59
set
set
6
g
volvo
d
swe
1
0
0
65
test
set
7
h
audi
d
swe
1
0
0
78
test
set
8
i
volvo
d
us
1
1
1
32
set
set
To convert a column with String entries, one should do a map and then pandas.replace().
For example:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
1
1
1
b
volvo
None
swe
0
0
1
45
1
1
2
c
bmw
p
us
0
0
1
56
2
2
3
d
bmw
p
us
0
1
1
43
2
2
4
e
bmw
d
germany
1
0
1
34
1
1
5
f
audi
d
germany
1
0
1
59
1
1
6
g
volvo
d
swe
1
0
0
65
2
1
7
h
audi
d
swe
1
0
0
78
2
1
8
i
volvo
d
us
1
1
1
32
1
1
As seen above, the last two column's strings are replaced with numbers representing these strings.
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?
Something that makes the DataFrame end up like:
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
1
1
1
1
1
0
1
23
1
1
1
2
1
2
1
0
0
1
45
1
1
2
3
2
1
2
0
0
1
56
2
2
3
4
2
1
2
0
1
1
43
2
2
4
5
2
3
3
1
0
1
34
1
1
5
6
3
3
3
1
0
1
59
1
1
6
7
1
3
1
1
0
0
65
2
1
7
8
3
3
1
1
0
0
78
2
1
8
9
1
3
2
1
1
1
32
1
1
Also output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.
You can adapte the code given in this response
https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested
all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))
You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes:
df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes
If you need the mapping:
dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical
From the other answers, I've written this function to do solve the problem:
import pandas as pd
def convertStringColumnsToNum(data):
columns = data.columns
columns_dtypes = data.dtypes
maps = []
for col_idx in range(0, len(columns)):
# don't change columns already comprising of numbers
if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
continue
# inspired from Shivam Roy's answer
col = columns[col_idx]
tmp = pd.Categorical(data[col])
data[col] = tmp.codes
maps.append(tmp.categories)
return maps
This function returns the mapss used to replace strings with a numeral code. The code is the index in which a string resides inside the list. This function works, yet it comes with the SettingWithCopyWarning.
if it ain't broke don't fix it, right? ;)
*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. Yet it works *shrugs* *
This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id
I have dataframe like so,
ID,CLASS_ID,ACTIVE
1,123,0
2,123,0
3,456,1
4,123,0
5,456,1
11,123,1
18,123,0
7,456,0
19,123,0
8,456,1
I'm trying to get the cumulative counts of the CLASS_ID having same value for ACTIVE. In case of the dataframe given above, CLASS_ID is continuously having ACTIVE as 0, until the 4th record post which next value is 1. So up until 4th record, count should be 3. This process has to be continued and the count has to be resetted every time value of ACTIVE changes for the CLASS_ID The expected output is as follows..
ID,CLASS_ID,ACTIVE,ACTIVE_COUNT
1,123,0,3
2,123,0,3
3,456,1,2
4,123,0,3
5,456,1,2
11,123,1,1
18,123,0,2
7,456,0,1
19,123,0,2
8,456,1,1
I tried using df.groupby(..).transform(..) but its not working out for me. Could someone help me out a bit?
You can do this with groupby:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
df['ACTIVE_COUNT'] = df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
df
ID CLASS_ID ACTIVE ACTIVE_COUNT
0 1 123 0 3
1 2 123 0 3
2 3 456 1 2
3 4 123 0 3
4 5 456 1 2
5 11 123 1 1
6 18 123 0 2
7 7 456 0 1
8 19 123 0 2
9 8 456 1 1
Details
First, create an indicator column marking rows with the same value per group:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
ind
0 1
1 1
2 1
3 1
4 1
5 2
6 3
7 2
8 3
9 3
Name: ACTIVE, dtype: int64
We then use ind as a grouper argument to df.groupby along with "CLASS_ID", and then compute the size of each group using transform.
df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
0 3
1 3
2 2
3 3
4 2
5 1
6 2
7 1
8 2
9 1
Name: ACTIVE, dtype: int64
I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary
I have a dataframe which looks like this:
Trial Measurement Data
0 0 12
1 4
2 12
1 0 12
1 12
2 0 12
1 12
2 NaN
3 12
I want to resample my data so that every trial has just two measurements
So I want to turn it into something like this:
Trial Measurement Data
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12
This rather uncommon task stems from the fact that my data has an intentional jitter on the part of the stimulus presentation.
I know pandas has a resample function, but I have no idea how to apply it to my second-level index while keeping the data in discrete categories based on the first-level index :(
Also, I wanted to iterate, over my first-level indices, but apparently
for sub_df in np.arange(len(df['Trial'].max()))
Won't work because since 'Trial' is an index pandas can't find it.
Well, it's not the prettiest I've ever seen, but from a frame looking like
>>> df
Trial Measurement Data
0 0 0 12
1 0 1 4
2 0 2 12
3 1 0 12
4 1 1 12
5 2 0 12
6 2 1 12
7 2 2 NaN
8 2 3 12
then we can manually build the two "average-like" objects and then use pd.melt to reshape the output:
avg = df.groupby("Trial")["Data"].agg({0: lambda x: x.head((len(x)+1)//2).mean(),
1: lambda x: x.tail((len(x)+1)//2).mean()})
result = pd.melt(avg.reset_index(), "Trial", var_name="Measurement", value_name="Data")
result = result.sort("Trial").set_index(["Trial", "Measurement"])
which produces
>>> result
Data
Trial Measurement
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12