Pandasql with conditions - python

I have two dataframes:
First one i have student information. I will call it df1
user_id | plan | subplan | matrix_code | student_semester
102532 | GADMSSP | GSP10 | 1501 | 8
106040 | GRINTSP | | 1901 | 4
106114 | GCSOSSULA | | 1901 | 4
106504 | GCSOSSP | | 1902 | 3
106664 | GCINESP | | 1901 | 4
Second one I have the requirements of electives for an institution. I will call it df2.
plan | subplan | matrix_code | semester | credits| cumulative_credits
GADMSSP | | 1501 | 5 | 4 | 4
GADMSSP | | 1501 | 6 | 4 | 8
GADMSSP | | 1501 | 7 | 4 | 12
GADMSSP | | 1501 | 8 | 0 | 12
GRINTSP | | 1901 | 7 | 2 | 2
GRINTSP | | 1901 | 8 | 0 | 2
GCSOSSULA | | 1901 | 3 | 4 | 4
GCSOSSULA | | 1901 | 4 | 0 | 4
GCSOSSULA | | 1901 | 5 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 5 | 4 | 8
GCSOSSULA | | 1901 | 6 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 6 | 0 | 8
GCSOSSULA | | 1901 | 7 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 7 | 0 | 8
GCSOSSULA | | 1901 | 8 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 8 | 0 | 8
GCSOSSP | | 1902 | 5 | 4 | 4
GCSOSSP | | 1902 | 6 | 4 | 8
GCSOSSP | | 1902 | 7 | 4 | 12
GCSOSSP | | 1902 | 8 | 0 | 12
GCINESP | | 1901 | 2 | 4 | 4
GCINESP | | 1901 | 3 | 4 | 8
GCINESP | | 1901 | 4 | 4 | 12
GCINESP | | 1901 | 5 | 4 | 16
GCINESP | | 1901 | 6 | 4 | 24
GCINESP | | 1901 | 7 | 4 | 32
GCINESP | | 1901 | 8 | 4 | 40
So i have to merge the df considering some conditions:
plan and matrix_code must be the same for df1 and df2.
df1.subplan is either the same of df2.subplan or it can be null. So user_id 102532 in line 1 of df1 will get the requirements of df2.subplan null, since there is no indication of specific subplan requirements for this plan and matrix_code.
Get student_semester +1, but considering max df2.semester as the limit of student_semester. So user_id 102532 in line 1 must remain in semester 8. This one I cannot add +1 semester, but i would like to indicate that it is a user that did not reach the requirements in the last semester.
I am only interested in cumulative_credits.
For this two dfs the result should be something like this:
user_id | plan | subplan | matrix_code | semester | student_semester | cumulative_credits
102532 | GADMSSP | GSP10 | 1501 | 8 | 9 | 12
106040 | GRINTSP | | 1901 | 5 | 4 | 0
106114 | GCSOSSULA | | 1901 | 5 | 4 | 4
106504 | GCSOSSP | | 1902 | 4 | 3 | 0
106664 | GCINESP | | 1901 | 5 | 4 | 16
But if there is no possible way to get the students with 0 cumulative_credits, the result should be:
user_id | plan | subplan | matrix_code | semester | student_semester | cumulative_credits
102532 | GADMSSP | GSP10 | 1501 | 8 | 9 | 12
106114 | GCSOSSULA | | 1901 | 5 | 4 | 4
106664 | GCINESP | | 1901 | 5 | 4 | 16
What i did untill now is the following:
pip install -U pandasql
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
df2 = df2.groupby(['plan', 'subplan', 'matrix_code', 'semester']).cumulative_credits.max()
df2 = df2.to_frame()
df2 = df2.reset_index()
electives = """
SELECT user_id
,a.plan
,a.subplan as "student_subplan"
,a.matrix_code
,a.student_semester
,b.subplan as "matrix_subplan"
,b.semester
,cumulative_credits
FROM df1 a
LEFT JOIN df2 b
ON a.plan = b.plan
AND a.matrix_code = b.matrix_code
WHERE (b.subplan = '' OR a.subplan = b.subplan)
"""
electives = pysqldf(electives)
Then i was trying to get the 3rd condition, but I have no clue in the right way to do this. I think i could use a lambda but I am not sure how.
df_s['semester_x'] = df_s['student_semester'] +1 | df_s['student_semester'] == df_s['semester'].max()
Also, if there is a better way to do the previous conditions steps using a merge with a condition, it could be nice.
EDIT - SOLUTION:
I used part of Parfait's solution. I just made a conditional logic to get the cumulative credits of student next semester instead of max cumulative credits of matrix code.
Here is what I've done:
First part - Parfait's solution:
agg = (pd.merge(df1, df2, on=['plano', 'matriz'], suffixes=["", "_"])
.fillna('')
.query("(subplano == '') | (subplano_aluno == subplano)")
.rename({'subplano':'subplano_matriz', 'semestre_': 'semestre_matriz', 'semestre': 'semestre_aluno'}, axis='columns')
Second part:
y = """
with a as
(
SELECT DISTINCT plan
,CASE
WHEN plan LIKE '%SULB%' OR plano LIKE '%SULC%' THEN 10
WHEN plan LIKE '%SULD%' OR plano LIKE '%SULE%' THEN 12
ELSE 8
END as "semester_max"
FROM agg
)
SELECT DISTINCT
user_id
,student_semester
,plan
,student_subplan
,matrix_code
,matrix_subplan
,cumulative_credits
,matrix_semester
,semester_max
,CASE
WHEN student_semester < semester_max THEN (student_semester)+1
WHEN student_semester = semester_max THEN student_semester
END as "next_semester"
FROM
(
SELECT DISTINCT
user_id
,student_semester
,b.plan
,student_subplan
,matrix_code
,matrix_subplan
,cumulative_credits
,matrix_semester
,semester_max
FROM agg b
INNER JOIN a ON b.plano = a.plano
) x
WHERE matrix_semester = next_semester
"""
z = pysqldf(x)

Consider adding a CASE statement in SQL query:
SELECT d1.user_id
, d1.plan
, d1.subplan AS student_subplan
, d1.matrix_code
, d1.student_semester
, d2.subplan AS matrix_subplan
, CASE
WHEN d1.student_semester = MAX(d2.semester)
THEN d1.student_semester
ELSE d1.student_semester + 1
END AS semester
, MAX(d2.cumulative_credits) AS cumulative_credits
FROM df1 d1
LEFT JOIN df2 d2
ON d1.plan = d2.plan
AND d1.matrix_code = d2.matrix_code
WHERE (d2.subplan IS NULL OR d1.subplan = d2.subplan)
GROUP BY d1.user_id
, d1.plan
, d1.subplan
, d1.matrix_code
, d1.student_semester
, d2.subplan;
Online Demo
In Pandas, translation would use merge + groupby + Series.where for case conditional logic:
# MERGE
agg = (pd.merge(df1, df2, on=['plan', 'matrix_code'], suffixes=["", "_"])
.fillna('')
.query("(subplan_ == '') | (subplan == subplan_)")
.rename({'subplan':'student_subplan', 'subplan_':'matrix_subplan'}, axis='columns')
)
# AGGRGEATION
agg = (agg.groupby(['user_id', 'plan', 'student_subplan', 'matrix_code',
'student_semester', 'matrix_subplan'], as_index=False)
.agg({'semester':'max', 'cumulative_credits':'max'})
)
# CONDITIONAL LOGIC
agg['semester'] = agg['student_semester'].where(agg['semester'] == agg['student_semester'],
agg['student_semester'].add(1))
agg
# user_id plan student_subplan matrix_code student_semester matrix_subplan semester cumulative_credits
# 0 102532 GADMSSP GSP10 1501 8 8 12
# 1 106040 GRINTSP 1901 4 5 2
# 2 106114 GCSOSSULA 1901 4 5 4
# 3 106504 GCSOSSP 1902 3 4 12
# 4 106664 GCINESP 1901 4 5 40

Related

Modify column in according another column dataframe python

I have two dataframes. One is the master dataframe and the other df is used to fil my master dataframe.
what I want is fil one column in according another column without alter the others columns.
This is example of master df
| id | Purch. order | cost | size | code |
| 1 | G918282 | 8283 | large| hchs |
| 2 | EE18282 | 1283 | small| ueus |
| 3 | DD08282 | 5583 | large| kdks |
| 4 | GU88912 | 8232 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
This is example of the another df
| id | Purch. order | cost |
| 1 | G918282 | 7728 |
| 2 | EE18282 | 2211 |
| 3 | DD08282 | 5321 |
| 4 | GU88912 | 4778 |
| 5 | NaN | 4283 |
| 6 | Nan | 9993 |
| 7 | Nan | 3442 |
This is the result I'd like
| id | Purch. order | cost | size | code |
| 1 | G918282 | 7728 | large| hchs |
| 2 | EE18282 | 2211 | small| ueus |
| 3 | DD08282 | 5321 | large| kdks |
| 4 | GU88912 | 4778 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
Where only the cost column is modified only if the secondary df coincides with the purch. order and if it's not NaN.
I hope you can help me... and I'm sorry if my english is so basic, not is my mother language. Thanks a lot.
lets try Update which works along indexes, by default overwrite is set to True which will overwrite overlapping values in your target dataframe. use overwrite=False if you only want to change NA values.
master_df = master_df.set_index(['id','Purch. order'])
another_df = another_df.dropna(subset=['Purch. order']).set_index(['id','Purch. order'])
master_df.update(another_df)
print(master_df)
cost size code
id Purch. order
1 G918282 7728.0 large hchs
2 EE18282 2211.0 small ueus
3 DD08282 5321.0 large kdks
4 GU88912 4778.0 large jdhd
5 NaN 1283.0 large jdjd
6 Nan 5583.0 large qqas
7 Nan 8232.0 large djjs
You can do it with merge followed by updating the cost column based on where the Nan are:
final_df = df1.merge(df2[~df2["Purch. order"].isna()], on = 'Purch. order', how="left")
final_df.loc[~final_df['Purch. order'].isnull(), "cost"] = final_df['cost_y'] # not nan
final_df.loc[final_df['Purch. order'].isnull(), "cost"] = final_df['cost_x'] # nan
final_df = final_df.drop(['id_y','cost_x','cost_y'],axis=1)
Output:
id _x Purch. order size code cost
0 1 G918282 large hchs 7728.0
1 2 EE18282 small ueus 2211.0
2 3 DD08282 large kdks 5321.0
3 4 GU88912 large jdhd 4778.0
4 5 NaN large jdjd 1283.0
5 6 NaN large qqas 5583.0
6 7 NaN large djjs 8232.0

Split a column and combine rows where there are multiple data measures

I'm trying to use python to solve my data analysis problem.
I have a table like this:
+----------+-----+------+--------+-------------+--------------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | Value_column |
+----------+-----+------+--------+-------------+--------------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 |
| 11 | 1 | 2020 | Name1 | QTRMAX | 6 |
| 11 | 2 | 2020 | Name1 | QTRMAX | 9 |
| 11 | 3 | 2020 | Name1 | QTRMAX | 7 |
| 11 | 4 | 2020 | Name1 | QTRMAX | 10 |
+----------+-----+------+--------+-------------+--------------+
I want to arrange the Value_column in a way that can capture when there is multiple Qtr_measures for unique IDs and MEF_IDs. When doing this, the overall size of the table will be reduced and I would like to have columns replacing Qtr_Measures with the type as below:
+----------+-----+------+--------+-------------+--------+--------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | QTRAVG | QTRMAX |
+----------+-----+------+--------+-------------+--------+--------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 | 6 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 | 9 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 | 7 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 | 10 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 | |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 | |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 | |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 | |
+----------+-----+------+--------+-------------+--------+--------+
How can I do this with python?
Thank you
Use pivot_table with reset_index and rename_axis:
piv = (df.pivot_table(index=['ID', 'QTR', 'Year', 'MEF_ID'],
values='Value_column',
columns='Qtr_Measure')
.reset_index()
.rename_axis(None, axis=1)
)
print(piv)
ID QTR Year MEF_ID QTRAVG QTRMAX
0 11 1 2020 Name1 5.0 6.0
1 11 2 2020 Name1 8.0 9.0
2 11 3 2020 Name1 6.0 7.0
3 11 4 2020 Name1 9.0 10.0
4 15 1 2020 Name2 67.0 NaN
5 15 2 2020 Name2 89.0 NaN
6 15 3 2020 Name2 100.0 NaN
7 15 4 2020 Name2 121.0 NaN

How to calculate percentatge change on this simple data frame?

I have data that looks like this:
+------+---------+------+-------+
| Year | Cluster | AREA | COUNT |
+------+---------+------+-------+
| 2016 | 0 | 10 | 2952 |
| 2016 | 1 | 10 | 2556 |
| 2016 | 2 | 10 | 8867 |
| 2016 | 3 | 10 | 9786 |
| 2017 | 0 | 10 | 2470 |
| 2017 | 1 | 10 | 3729 |
| 2017 | 2 | 10 | 8825 |
| 2017 | 3 | 10 | 9114 |
| 2018 | 0 | 10 | 1313 |
| 2018 | 1 | 10 | 3564 |
| 2018 | 2 | 10 | 7245 |
| 2018 | 3 | 10 | 6990 |
+------+---------+------+-------+
I have to get the percentage changes for each cluster compared to the previous year, e.g.
+------+---------+-----------+-------+----------------+
| Year | Cluster | AREA | COUNT | Percent Change |
+------+---------+-----------+-------+----------------+
| 2016 | 0 | 10 | 2952 | NaN |
| 2017 | 0 | 10 | 2470 | -16.33% |
| 2018 | 0 | 10 | 1313 | -46.84% |
| 2016 | 1 | 10 | 2556 | NaN |
| 2017 | 1 | 10 | 3729 | 45.89% |
| 2018 | 1 | 10 | 3564 | -4.42% |
| 2016 | 2 | 10 | 8867 | NaN |
| 2017 | 2 | 10 | 8825 | -0.47% |
| 2018 | 2 | 10 | 7245 | -17.90% |
| 2016 | 3 | 10 | 9786 | NaN |
| 2017 | 3 | 10 | 9114 | -6.87% |
| 2018 | 3 | 10 | 6990 | -23.30% |
+------+---------+-----------+-------+----------------+
Is there any easy to do this?
I've tried a few things below, this seemed to make the most sense, but it returns NaN for each pct_change.
df['pct_change'] = df.groupby(['Cluster','Year'])['COUNT '].pct_change()
+------+---------+------+------------+------------+
| Year | Cluster | AREA | Count | pct_change |
+------+---------+------+------------+------------+
| 2016 | 0 | 10 | 295200.00% | NaN |
| 2016 | 1 | 10 | 255600.00% | NaN |
| 2016 | 2 | 10 | 886700.00% | NaN |
| 2016 | 3 | 10 | 978600.00% | NaN |
| 2017 | 0 | 10 | 247000.00% | NaN |
| 2017 | 1 | 10 | 372900.00% | NaN |
| 2017 | 2 | 10 | 882500.00% | NaN |
| 2017 | 3 | 10 | 911400.00% | NaN |
| 2018 | 0 | 10 | 131300.00% | NaN |
| 2018 | 1 | 10 | 356400.00% | NaN |
| 2018 | 2 | 10 | 724500.00% | NaN |
| 2018 | 3 | 10 | 699000.00% | NaN |
+------+---------+------+------------+------------+
Basically, I simply want the function to compare the year on year change for each cluster.
df['pct_change'] = df.groupby(['Cluster'])['Count'].pct_change()
df.sort_values('Cluster', axis = 0, ascending = True)
Another method going old school with transform
df['p'] = df.groupby('cluster')['count'].transform(lambda x: (x-x.shift())/x.shift())
df = df.sort_values(by='cluster')
print(df)
year cluster area count p
0 2016 0 10 2952 NaN
4 2017 0 10 2470 -0.163279
8 2018 0 10 1313 -0.468421
1 2016 1 10 2556 NaN
5 2017 1 10 3729 0.458920
9 2018 1 10 3564 -0.044248
2 2016 2 10 8867 NaN
6 2017 2 10 8825 -0.004737
10 2018 2 10 7245 -0.179037
3 2016 3 10 9786 NaN
7 2017 3 10 9114 -0.068670
11 2018 3 10 6990 -0.233048

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Merging pandas column from dataframe to another dataframe based on their indices

I have a data frame, df_one that looks like this where video_id is the index:
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched | avg_time_watched |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| video_id | | | | | | |
| 5 | 17 | 12.000000 | 17 | 1 | 1 | 1.000000 |
| 10 | 22 | 10.000000 | 1 | 1 | 1 | 0.045455 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 | 1.000000 |
| 22 | 29 | 20.000000 | 5 | 1 | 1 | 0.172414 |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
And I have another dataframe, df_two that looks like this where video_id is also the index:
+----------+--------------+---------------+--------------+----------------+------------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched_yeterday |
+----------+--------------+---------------+--------------+----------------+------------------------+
| video_id | | | | | |
| 5 | 102 | 11.333333 | 73 | 6 | 6 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 |
| 16 | 44 | 2.000000 | 15 | 1 | 1 |
| 17 | 180 | 23.333333 | 53 | 6 | 6 |
| 18 | 40 | 1.000000 | 40 | 1 | 1 |
+----------+--------------+---------------+--------------+----------------+------------------------+
What I want to do is merge the count_watched_yeterday column from df_two to df_one based on the index of each.
I tried:
video_base = pd.merge(df_one, df_two['count_watched_yeterday'], how='left', on=[df_one.index, df_two.index])
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Actually I think the easiest thing to do here is to directly assign:
In [13]:
df['count_watched_yesterday'] = df1['count_watched_yeterday']
df['count_watched_yesterday']
Out[13]:
video_id
5 6
10 NaN
15 1
22 NaN
Name: count_watched_yesterday, dtype: float64
This works because it will align on the index values, where you have no matching values a NaN will be assigned as the value

Categories