From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1
So I have been trying to use pandas to create a DataFrame that reports the number of graduates working at jobs that do require college degrees ('college_jobs'), and do not require college degrees ('non_college_jobs').
note: the name of the dataframe I am dealing with is recent_grads
I tried the following code:
df1 = recent_grads.groupby(['major_category']).college_jobs.non_college_jobs.sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs','non_college_jobs'].sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs'],['non_college_jobs'].sum()
none of them worked! what am I supposed to do? can somebody give me a simple explanation regarding this? I had been trying to read through pandas documentations and did not find the explanation wanted.
here is the head of the dataframe:
rank major_code major major_category \
0 1 2419 PETROLEUM ENGINEERING Engineering
1 2 2416 MINING AND MINERAL ENGINEERING Engineering
2 3 2415 METALLURGICAL ENGINEERING Engineering
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering
4 5 2405 CHEMICAL ENGINEERING Engineering
total sample_size men women sharewomen employed ... \
0 2339 36 2057 282 0.120564 1976 ...
1 756 7 679 77 0.101852 640 ...
2 856 3 725 131 0.153037 648 ...
3 1258 16 1123 135 0.107313 758 ...
4 32260 289 21239 11021 0.341631 25694 ...
part_time full_time_year_round unemployed unemployment_rate median \
0 270 1207 37 0.018381 110000
1 170 388 85 0.117241 75000
2 133 340 16 0.024096 73000
3 150 692 40 0.050125 70000
4 5180 16697 1672 0.061098 65000
p25th p75th college_jobs non_college_jobs low_wage_jobs
0 95000 125000 1534 364 193
1 55000 90000 350 257 50
2 50000 105000 456 176 0
3 43000 80000 529 102 0
4 50000 75000 18314 4440 972
[5 rows x 21 columns]
You could filter the initial DataFrame by the columns you're interested in and then perform the groupby and summation as below:
recent_grads[['major_category', 'college_jobs', 'non_college_jobs']].groupby('major_category').sum()
Conversely, if you don't perform the initial column filter and then do a .sum() on the recent_grads.groupby('major_category') it will be applied to all numeric columns possible.
I need to do 'pandas non equi join',
when first table joined with second table in range.
first_table
EMPLOYEE_ID SALARY
100 3000.00
101 17000.00
102 17000.00
103 9000.00
104 6000.00
105 4800.00
106 4800.00
………….. …………
………………. …………
second_table
grade_id lowest_sal highest_sal grade_level
1 0 3500 GRADE-A
2 3501 7000 GRADE-B
3 7001 10000 GRADE-C
4 10000 20000 GRADE-D
Need_table(OUTPUT):
EMPLOYEE_ID SALARY grade_level
115 3000 GRADE-A
116 17000 GRADE-D
117 17000 GRADE-D
118 9000 GRADE-C
119 6000 GRADE-B
125 4800 GRADE-B
126 4800 GRADE-B
This equivalent SQL query as:
SELECT f.EMPLOYEE_ID,
f.SALARY,
s.grade_level
FROM first_table f JOIN second_table s
ON f.SALARY BETWEEN s.lowest_sal AND s.highest_sal
Can't to use 'pd.merge' method to join tables because not have any common column....
Please help to find method
Thanks
If df1 is your first table and df2 is your second table, you could do for example this:
d = df2.set_index('grade_level').to_dict('split')
df1['GRADE'] = df1['SALARY'].apply(
lambda x: next((c for i, c in enumerate(d['index']) if d['data'][i][1] <= x <= d['data'][i][2]), np.nan)
)
print(df1)
Prints:
EMPLOYEE_ID SALARY GRADE
0 100 3000.0 GRADE-A
1 101 17000.0 GRADE-D
2 102 17000.0 GRADE-D
3 103 9000.0 GRADE-C
4 104 6000.0 GRADE-B
5 105 4800.0 GRADE-B
6 106 4800.0 GRADE-B
One option is with conditional_join from pyjanitor, which avoids a cartesian join (helpful with memory, and performance, depending on the data size):
# pip install pyjanitor
import pandas as pd
import janitor
(first_table
.astype({'SALARY':int})
.conditional_join(
second_table,
('SALARY', 'lowest_sal', '>='),
('SALARY', 'highest_sal', '<='))
.loc[:, ['EMPLOYEE_ID', 'SALARY', 'grade_level']]
)
EMPLOYEE_ID SALARY grade_level
0 100 3000 GRADE-A
1 101 17000 GRADE-D
2 102 17000 GRADE-D
3 103 9000 GRADE-C
4 104 6000 GRADE-B
5 105 4800 GRADE-B
6 106 4800 GRADE-B
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I have data that I've left in a format that will allow me to pivot on dates that look like:
Region 0 1 2 3
Date 2005-01-01 2005-02-01 2005-03-01 ....
East South Central 400 500 600
Pacific 100 200 150
.
.
Mountain 500 600 450
I need to pivot this table so it looks like:
0 Date Region value
1 2005-01-01 East South Central 400
2 2005-02-01 East South Central 500
3 2005-03-01 East South Central 600
.
.
4 2005-03-01 Pacific 100
4 2005-03-01 Pacific 200
4 2005-03-01 Pacific 150
.
.
Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output.
How can I go about this?
I think this is the solution you are looking for. Shown by example.
import pandas as pd
import numpy as np
N=100
regions = list('abcdef')
df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)],
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)),
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))])
df.index = ['Region', 'Date', 'a', 'b', 'c', 'd']
print(df)
This gives
0 1 2 3 4 5 6 7 \
Region 0 1 2 3 4 5 6 7
Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7
a 96 432 181 64 87 355 339 314
b 360 23 162 98 450 78 114 109
c 143 375 420 493 321 277 208 317
d 371 144 207 108 163 67 465 130
And the solution to pivot this into the form you want is
df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd'])
which gives
Date variable value
0 2016-0 a 96
1 2016-1 a 432
2 2016-2 a 181
3 2016-3 a 64
4 2016-4 a 87
5 2016-5 a 355
6 2016-6 a 339
7 2016-7 a 314
8 2016-8 a 111
9 2016-9 a 121
10 2016-10 a 124
11 2016-11 a 383
12 2016-12 a 424
13 2016-13 a 453
...
393 2016-93 d 176
394 2016-94 d 277
395 2016-95 d 256
396 2016-96 d 174
397 2016-97 d 349
398 2016-98 d 414
399 2016-99 d 132