how to find one value is associated with multiple values in pandas - python

I have following dataframe in pandas
code tank product
1234 1 MS
1234 2 HS
1234 1 HS
1234 1 HS
1235 1 MS
1235 1 HS
1235 1 MS
1245 1 MS
1245 2 HS
I want to find how many tanks have multiple products associated with them, in above dataframe e.g for code 1234 tank 1 has Ms and HS as well
There are 2 cases in above dataframe
My Desired Dataframe would be
code tank flag
1234 1 yes
1234 2 no
1235 1 yes
1245 1 no
1245 2 no
How can I do it in pandas?

Use SeriesGroupBy.nunique for count unique values per groups:
df = df.groupby(['code','tank'])['product'].nunique().reset_index()
print (df)
code tank product
0 1234 1 2
1 1234 2 1
2 1235 1 2
3 1245 1 1
4 1245 2 1
And then extract column with pop and set values by numpy.where
df['flag'] = np.where(df.pop('product') == 1, 'no', 'yes')
print (df)
code tank flag
0 1234 1 yes
1 1234 2 no
2 1235 1 yes
3 1245 1 no
4 1245 2 no

Related

How to create new columns based off of columns from groupby results in Pandas? [duplicate]

There is pandas DataFrame as:
print(df)
call_id calling_number call_status
1 123 BUSY
2 456 BUSY
3 789 BUSY
4 123 NO_ANSWERED
5 456 NO_ANSWERED
6 789 NO_ANSWERED
In this case, records with different call_status, (say "ERROR" or something else, what i can't predict), values may appear in the dataframe. I need to add a new column on the fly for such a value.
I have applied the pivot_table() function and I get the result I want:
df1 = df.pivot_table(df,index='calling_number',columns='status_code', aggfunc = 'count').fillna(0).astype('int64')
calling_number ANSWERED BUSY NO_ANSWER
123 0 1 1
456 0 1 1
789 0 1 1
Now I need to add one more column that would contain the percentage of answered calls with the given calling_number, calculated as the ratio of ANSWERED to the total.
Source dataframe 'df' may not contain entries with call_status = 'ANSWERED', so in that case the percentage column should naturally has zero value.
Expected result is :
calling_number ANSWERED BUSY NO_ANSWER ANS_PERC(%)
123 0 1 1 0
456 0 1 1 0
789 0 1 1 0
Use crosstab:
df1 = pd.crosstab(df['calling_number'], df['status_code'])
Or if need exclude NaNs by count function use pivot_table with added parameter fill_value=0:
df1 = df.pivot_table(df,
index='calling_number',
columns='status_code',
aggfunc = 'count',
fill_value=0)
Then for ratio divide summed values per rows:
df1 = df1.div(df1.sum(axis=1), axis=0)
print (df1)
ANSWERED BUSY NO_ANSWER
calling_number
123 0.333333 0.333333 0.333333
456 0.333333 0.333333 0.333333
789 0.333333 0.333333 0.333333
EDIT: For add possible non exist some categories use DataFrame.reindex:
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=['ANSWERED','BUSY','NO_ANSWERED'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1['ANSWERED'].sum()).fillna(0)
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
If need total per rows:
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
EDIT1:
Soluton with replace some wrong values to ERROR:
print (df)
call_id calling_number call_status
0 1 123 ttt
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
L = ['ANSWERED', 'BUSY', 'NO_ANSWERED']
df['call_status'] = df['call_status'].where(df['call_status'].isin(L), 'ERROR')
print (df)
0 1 123 ERROR
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=L + ['ERROR'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ERROR ANS_PERC(%)
calling_number
123 0 0 1 1 0.0
456 0 1 1 0 0.0
789 0 1 1 0 0.0
I like the cross_tab idea but I am a fan of column manipulation so that it's easy to refer back to:
# define a function to capture all the other call_statuses into one bucket
def tester(x):
if x not in ['ANSWERED', 'BUSY', 'NO_ANSWERED']:
return 'OTHER'
else:
return x
#capture the simplified status in a new column
df['refined_status'] = df['call_status'].apply(tester)
#Do the pivot (or cross tab) to capture the sums:
df1= df.pivot_table(values="call_id", index = 'calling_number', columns='refined_status', aggfunc='count')
#Apply a division to get the percentages:
df1["TOTAL"] = df1[['ANSWERED', 'BUSY', 'NO_ANSWERED', 'OTHER']].sum(axis=1)
df1["ANS_PERC"] = df1["ANSWERED"]/df1.TOTAL * 100
print(df1)

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

select rows in a dataframe in python based on two criteria

Based on the dataframe (1) below, I wish to create a dataframe (2) where either y or z is equal to 2. Is there a way to do this conveniently?
And if I were to create a dataframe (3) that only contains rows from dataframe (1) but not dataframe (2), how should I approach it?
id x y z
0 324 1 2
1 213 1 1
2 529 2 1
3 347 3 2
4 109 2 2
...
df[df[['y','z']].eq(2).any(1)]
Out[1205]:
id x y z
0 0 324 1 2
2 2 529 2 1
3 3 347 3 2
4 4 109 2 2
You can create df2 easily enough using a condition:
df2 = df1[df1.y.eq(2) | df1.z.eq(2)]
df2
x y z
id
0 324 1 2
2 529 2 1
3 347 3 2
4 109 2 2
Given df2 and df1, you can perform a set difference operation on the index, like this:
df3 = df1.iloc[df1.index.difference(df2.index)]
df3
x y z
id
1 213 1 1
You can do the following:
import pandas as pd
df = pd.read_csv('data.csv')
df2 = df[(df.y == 2) | (df.z == 2)]
print(df2)
Results:
id x y z
0 0 324 1 2
2 2 529 2 1
3 3 347 3 2
4 4 109 2 2

Modify timestamps to sequence per ID

I have a Pandas dataframe (Python 3.5.1) with a timestamp column and an ID column.
Timestamp ID
0 2016-04-01T00:15:36.688 123
1 2016-04-01T00:12:52.688 123
2 2016-04-01T00:35:41.688 543
3 2016-04-01T00:01:12.688 543
4 2016-03-31T23:50:59.688 123
5 2016-04-01T01:05:52.688 543
I would like to sequence the timestamps per ID.
Timestamp ID Sequence
0 2016-04-01T00:15:36.688 123 3
1 2016-04-01T00:12:52.688 123 2
2 2016-04-01T00:35:41.688 543 2
3 2016-04-01T00:01:12.688 543 1
4 2016-03-31T23:50:59.688 123 1
5 2016-04-01T01:05:52.688 543 3
What is the best way to order the timestamps per ID, and generate a sequence number unique to each ID?
you can use sort_values(), groupby() and cumcount():
In [10]: df['Sequence'] = df.sort_values('Timestamp').groupby('ID').cumcount() + 1
In [11]: df
Out[11]:
Timestamp ID Sequence
0 2016-04-01 00:15:36.688 123 3
1 2016-04-01 00:12:52.688 123 2
2 2016-04-01 00:35:41.688 543 2
3 2016-04-01 00:01:12.688 543 1
4 2016-03-31 23:50:59.688 123 1
5 2016-04-01 01:05:52.688 543 3

Python Pandas operate on row

Hi my dataframe look like:
Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264
Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.
Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.
You can first convert it to strings (the integer columns) before concatenating with +:
In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1_1_2010-02-05
1 1 1 2010-02-12 449 1_1_2010-02-12
2 1 1 2010-02-19 455 1_1_2010-02-19
3 1 1 2010-02-26 154 1_1_2010-02-26
4 1 1 2010-03-05 29 1_1_2010-03-05
5 1 1 2010-03-12 239 1_1_2010-03-12
6 1 1 2010-03-19 264 1_1_2010-03-19
For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):
In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1_1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1_1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1_1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1_1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1_1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1_1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1_1_2010-03-19 5.575949
Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.
In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
Store Dept Date Sales
0 1 1 2010-02-05 245
1 1 1 2010-02-12 449
2 1 1 2010-02-19 455
3 1 1 2010-02-26 154
4 1 1 2010-03-05 29
5 1 1 2010-03-12 239
6 1 1 2010-03-19 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1 1_2010-02-05
1 1 1 2010-02-12 449 1 1_2010-02-12
2 1 1 2010-02-19 455 1 1_2010-02-19
3 1 1 2010-02-26 154 1 1_2010-02-26
4 1 1 2010-03-05 29 1 1_2010-03-05
5 1 1 2010-03-12 239 1 1_2010-03-12
6 1 1 2010-03-19 264 1 1_2010-03-19
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1 1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1 1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1 1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1 1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1 1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1 1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1 1_2010-03-19 5.575949
[7 rows x 6 columns]

Categories