How to select lists with the same id in python? - python

I have a dataframe that look like this:
id
place
age
gender
13
1
3
1
13
2
4
1
13
3
3
2
13
4
4
2
14
1
3
1
14
2
4
1
14
3
3
2
I want to select place, age and gender where id is unique in Python. For example for id=13, I want to select the matrix:
place
age
gender
1
3
1
2
4
1
3
3
2
4
4
2
Notice that the ids doesn't have the same length.
Thank you for your help.

You can select all the rows which have the id as 13 by doing just df[df['id'] == 13].
And if you only want the remaining columns place, age and gender, then:
df.loc[df['id'] == 13, ['place', 'age', 'gender']]
# or
df[df['id'] == 13][['place', 'age', 'gender']]

Related

Adding multiple columns randomly to a dataframe from columns in another dataframe

I've looked everywhere but can't find a solution.
Let's say I have two tables:
Year
1
2
3
4
and
ID Value
1 10
2 50
3 25
4 20
5 40
I need to pick randomly from both columns of the 2nd table to add to the first table - so if ID=3 is picked randomly as a column to add to the first table, I also add Value=25 i.e. end up with something like:
Year ID Value
1 3 25
2 1 10
3 1 10
4 5 40
5 2 50
IIUC, do you want?
df_year[['ID', 'Value']] = df_id.sample(n=len(df_year), replace=True).to_numpy()
Output:
Year ID Value
0 1 4 20
1 2 4 20
2 3 2 50
3 4 3 25

Newly created column in a data frame need to be updated with values based on condition ,from another column

DF has four columns and column 'Id' in unique and it is grouped by column 'idhogar'.
column ' parentesco1' has status 0 (or) 1 . 'Target' columns has values,which are different for various rows under same column values of 'idhogar'
INDEX Id parentesco1 idhogar Target
0 ID_fe8c32eba 0 4616164 2
1 ID_ca701e058 1 4616164 2
2 ID_5ad4372cd 0 4983866 3
3 ID_1e320689c 1 4983866 3
4 ID_700e30a8d 0 5905417 2
5 ID_bc99ecfb8 0 5905417 2
6 ID_308a05a16 1 5905417 2
7 ID_00186dde5 1 7.56E+06 4
8 ID_34570a74c 1 20713493 4
9 ID_b13870a19 1 27651991 3
10 ID_74e989389 1 45038655 4
11 ID_726ba7d34 0 60027579 4
12 ID_b75d7c648 0 60027579 4
13 ID_37e7b3aaa 1 60027579 4
14 ID_396da5a70 0 104578907 2
15 ID_4381374bb 1 104578907 2
16 ID_272a9b4d5 0 119024319 4
17 ID_1225f3779 0 119024319 4
18 ID_fc5dfaa2e 0 119024319 4
19 ID_7390a3f99 1 119024319 4
New column'Rev_target' created ,need to have the value of 'Target' of row having ' parentesco1' as 1 for all the rows falling under the group of same 'idhogar'.
I tried the following but not successful.
for idhogar in df['idhogar'].unique():
if len(df[df['idhogar'] == idhogar]['Target'].unique())!= 1:
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=rev_target_val
# NOT WORKING AS REQUIRED ---- gives output as NaN in all rows of newly created column
Tried the below but throwing error
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=np.where(len(df[df['idhogar'] == idhogar]['Target'].unique())!=
1,rev_target_val,df['Target'])
ValueError: operands could not be broadcast together with shapes () (0,) (9557,)
Tried the below but not working as intended,gives same value as 2 in all the rows of new'Rev_target' column
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=df.apply(lambda x: rev_target_val if (len(df[df['idhogar'] == idhogar]
['Target'].unique())!= 1) else df['Target'],axis=1)
Would appreciate a solution from you and thanks in advance.
I would sort the dataframe on parentesco1 in descending order to make sure that the parentesco1 1 row is the first row. Then a transform could easily access that row:
df['Rev_target'] = df.sort_values('parentesco1', ascending=False).groupby(
'idhogar').transform(lambda x: x.iloc[0])['Target']
It gives:
INDEX Id parentesco1 idhogar Target Rev_target
0 0 ID_fe8c32eba 0 4616164.0 2 2
1 1 ID_ca701e058 1 4616164.0 2 2
2 2 ID_5ad4372cd 0 4983866.0 3 3
3 3 ID_1e320689c 1 4983866.0 3 3
4 4 ID_700e30a8d 0 5905417.0 2 2
5 5 ID_bc99ecfb8 0 5905417.0 2 2
6 6 ID_308a05a16 1 5905417.0 2 2
7 7 ID_00186dde5 1 7560000.0 4 4
8 8 ID_34570a74c 1 20713493.0 4 4
9 9 ID_b13870a19 1 27651991.0 3 3
10 10 ID_74e989389 1 45038655.0 4 4
11 11 ID_726ba7d34 0 60027579.0 4 4
12 12 ID_b75d7c648 0 60027579.0 4 4
13 13 ID_37e7b3aaa 1 60027579.0 4 4
14 14 ID_396da5a70 0 104578907.0 2 2
15 15 ID_4381374bb 1 104578907.0 2 2
16 16 ID_272a9b4d5 0 119024319.0 4 4
17 17 ID_1225f3779 0 119024319.0 4 4
18 18 ID_fc5dfaa2e 0 119024319.0 4 4
19 19 ID_7390a3f99 1 119024319.0 4 4

How to append a specific string according to each value in a string pandas dataframe column?

Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5

Pandas groupby on one column witout losing others columns?

I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer

how to utilize Pandas aggregate functions on this DataFrame?

This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id

Categories