Pandasql with conditions

Pandasql with conditions - python

I have two dataframes:
First one i have student information. I will call it df1
user_id | plan | subplan | matrix_code | student_semester
102532 | GADMSSP | GSP10 | 1501 | 8
106040 | GRINTSP | | 1901 | 4
106114 | GCSOSSULA | | 1901 | 4
106504 | GCSOSSP | | 1902 | 3
106664 | GCINESP | | 1901 | 4
Second one I have the requirements of electives for an institution. I will call it df2.
plan | subplan | matrix_code | semester | credits| cumulative_credits
GADMSSP | | 1501 | 5 | 4 | 4
GADMSSP | | 1501 | 6 | 4 | 8
GADMSSP | | 1501 | 7 | 4 | 12
GADMSSP | | 1501 | 8 | 0 | 12
GRINTSP | | 1901 | 7 | 2 | 2
GRINTSP | | 1901 | 8 | 0 | 2
GCSOSSULA | | 1901 | 3 | 4 | 4
GCSOSSULA | | 1901 | 4 | 0 | 4
GCSOSSULA | | 1901 | 5 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 5 | 4 | 8
GCSOSSULA | | 1901 | 6 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 6 | 0 | 8
GCSOSSULA | | 1901 | 7 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 7 | 0 | 8
GCSOSSULA | | 1901 | 8 | 0 | 4
GCSOSSULA | GSUL5 | 1901 | 8 | 0 | 8
GCSOSSP | | 1902 | 5 | 4 | 4
GCSOSSP | | 1902 | 6 | 4 | 8
GCSOSSP | | 1902 | 7 | 4 | 12
GCSOSSP | | 1902 | 8 | 0 | 12
GCINESP | | 1901 | 2 | 4 | 4
GCINESP | | 1901 | 3 | 4 | 8
GCINESP | | 1901 | 4 | 4 | 12
GCINESP | | 1901 | 5 | 4 | 16
GCINESP | | 1901 | 6 | 4 | 24
GCINESP | | 1901 | 7 | 4 | 32
GCINESP | | 1901 | 8 | 4 | 40
So i have to merge the df considering some conditions:
plan and matrix_code must be the same for df1 and df2.
df1.subplan is either the same of df2.subplan or it can be null. So user_id 102532 in line 1 of df1 will get the requirements of df2.subplan null, since there is no indication of specific subplan requirements for this plan and matrix_code.
Get student_semester +1, but considering max df2.semester as the limit of student_semester. So user_id 102532 in line 1 must remain in semester 8. This one I cannot add +1 semester, but i would like to indicate that it is a user that did not reach the requirements in the last semester.
I am only interested in cumulative_credits.
For this two dfs the result should be something like this:
user_id | plan | subplan | matrix_code | semester | student_semester | cumulative_credits
102532 | GADMSSP | GSP10 | 1501 | 8 | 9 | 12
106040 | GRINTSP | | 1901 | 5 | 4 | 0
106114 | GCSOSSULA | | 1901 | 5 | 4 | 4
106504 | GCSOSSP | | 1902 | 4 | 3 | 0
106664 | GCINESP | | 1901 | 5 | 4 | 16
But if there is no possible way to get the students with 0 cumulative_credits, the result should be:
user_id | plan | subplan | matrix_code | semester | student_semester | cumulative_credits
102532 | GADMSSP | GSP10 | 1501 | 8 | 9 | 12
106114 | GCSOSSULA | | 1901 | 5 | 4 | 4
106664 | GCINESP | | 1901 | 5 | 4 | 16
What i did untill now is the following:
pip install -U pandasql
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
df2 = df2.groupby(['plan', 'subplan', 'matrix_code', 'semester']).cumulative_credits.max()
df2 = df2.to_frame()
df2 = df2.reset_index()
electives = """
SELECT user_id
,a.plan
,a.subplan as "student_subplan"
,a.matrix_code
,a.student_semester
,b.subplan as "matrix_subplan"
,b.semester
,cumulative_credits
FROM df1 a
LEFT JOIN df2 b
ON a.plan = b.plan
AND a.matrix_code = b.matrix_code
WHERE (b.subplan = '' OR a.subplan = b.subplan)
"""
electives = pysqldf(electives)
Then i was trying to get the 3rd condition, but I have no clue in the right way to do this. I think i could use a lambda but I am not sure how.
df_s['semester_x'] = df_s['student_semester'] +1 | df_s['student_semester'] == df_s['semester'].max()
Also, if there is a better way to do the previous conditions steps using a merge with a condition, it could be nice.
EDIT - SOLUTION:
I used part of Parfait's solution. I just made a conditional logic to get the cumulative credits of student next semester instead of max cumulative credits of matrix code.
Here is what I've done:
First part - Parfait's solution:
agg = (pd.merge(df1, df2, on=['plano', 'matriz'], suffixes=["", "_"])
.fillna('')
.query("(subplano == '') | (subplano_aluno == subplano)")
.rename({'subplano':'subplano_matriz', 'semestre_': 'semestre_matriz', 'semestre': 'semestre_aluno'}, axis='columns')
Second part:
y = """
with a as
(
SELECT DISTINCT plan
,CASE
WHEN plan LIKE '%SULB%' OR plano LIKE '%SULC%' THEN 10
WHEN plan LIKE '%SULD%' OR plano LIKE '%SULE%' THEN 12
ELSE 8
END as "semester_max"
FROM agg
)
SELECT DISTINCT
user_id
,student_semester
,plan
,student_subplan
,matrix_code
,matrix_subplan
,cumulative_credits
,matrix_semester
,semester_max
,CASE
WHEN student_semester < semester_max THEN (student_semester)+1
WHEN student_semester = semester_max THEN student_semester
END as "next_semester"
FROM
(
SELECT DISTINCT
user_id
,student_semester
,b.plan
,student_subplan
,matrix_code
,matrix_subplan
,cumulative_credits
,matrix_semester
,semester_max
FROM agg b
INNER JOIN a ON b.plano = a.plano
) x
WHERE matrix_semester = next_semester
"""
z = pysqldf(x)

Consider adding a CASE statement in SQL query:
SELECT d1.user_id
, d1.plan
, d1.subplan AS student_subplan
, d1.matrix_code
, d1.student_semester
, d2.subplan AS matrix_subplan
, CASE
WHEN d1.student_semester = MAX(d2.semester)
THEN d1.student_semester
ELSE d1.student_semester + 1
END AS semester
, MAX(d2.cumulative_credits) AS cumulative_credits
FROM df1 d1
LEFT JOIN df2 d2
ON d1.plan = d2.plan
AND d1.matrix_code = d2.matrix_code
WHERE (d2.subplan IS NULL OR d1.subplan = d2.subplan)
GROUP BY d1.user_id
, d1.plan
, d1.subplan
, d1.matrix_code
, d1.student_semester
, d2.subplan;
Online Demo
In Pandas, translation would use merge + groupby + Series.where for case conditional logic:
# MERGE
agg = (pd.merge(df1, df2, on=['plan', 'matrix_code'], suffixes=["", "_"])
.fillna('')
.query("(subplan_ == '') | (subplan == subplan_)")
.rename({'subplan':'student_subplan', 'subplan_':'matrix_subplan'}, axis='columns')
)
# AGGRGEATION
agg = (agg.groupby(['user_id', 'plan', 'student_subplan', 'matrix_code',
'student_semester', 'matrix_subplan'], as_index=False)
.agg({'semester':'max', 'cumulative_credits':'max'})
)
# CONDITIONAL LOGIC
agg['semester'] = agg['student_semester'].where(agg['semester'] == agg['student_semester'],
agg['student_semester'].add(1))
agg
# user_id plan student_subplan matrix_code student_semester matrix_subplan semester cumulative_credits
# 0 102532 GADMSSP GSP10 1501 8 8 12
# 1 106040 GRINTSP 1901 4 5 2
# 2 106114 GCSOSSULA 1901 4 5 4
# 3 106504 GCSOSSP 1902 3 4 12
# 4 106664 GCINESP 1901 4 5 40

Related

Modify column in according another column dataframe python

I have two dataframes. One is the master dataframe and the other df is used to fil my master dataframe.
what I want is fil one column in according another column without alter the others columns.
This is example of master df
| id | Purch. order | cost | size | code |
| 1 | G918282 | 8283 | large| hchs |
| 2 | EE18282 | 1283 | small| ueus |
| 3 | DD08282 | 5583 | large| kdks |
| 4 | GU88912 | 8232 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
This is example of the another df
| id | Purch. order | cost |
| 1 | G918282 | 7728 |
| 2 | EE18282 | 2211 |
| 3 | DD08282 | 5321 |
| 4 | GU88912 | 4778 |
| 5 | NaN | 4283 |
| 6 | Nan | 9993 |
| 7 | Nan | 3442 |
This is the result I'd like
| id | Purch. order | cost | size | code |
| 1 | G918282 | 7728 | large| hchs |
| 2 | EE18282 | 2211 | small| ueus |
| 3 | DD08282 | 5321 | large| kdks |
| 4 | GU88912 | 4778 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
Where only the cost column is modified only if the secondary df coincides with the purch. order and if it's not NaN.
I hope you can help me... and I'm sorry if my english is so basic, not is my mother language. Thanks a lot.

lets try Update which works along indexes, by default overwrite is set to True which will overwrite overlapping values in your target dataframe. use overwrite=False if you only want to change NA values.
master_df = master_df.set_index(['id','Purch. order'])
another_df = another_df.dropna(subset=['Purch. order']).set_index(['id','Purch. order'])
master_df.update(another_df)
print(master_df)
cost size code
id Purch. order
1 G918282 7728.0 large hchs
2 EE18282 2211.0 small ueus
3 DD08282 5321.0 large kdks
4 GU88912 4778.0 large jdhd
5 NaN 1283.0 large jdjd
6 Nan 5583.0 large qqas
7 Nan 8232.0 large djjs

You can do it with merge followed by updating the cost column based on where the Nan are:
final_df = df1.merge(df2[~df2["Purch. order"].isna()], on = 'Purch. order', how="left")
final_df.loc[~final_df['Purch. order'].isnull(), "cost"] = final_df['cost_y'] # not nan
final_df.loc[final_df['Purch. order'].isnull(), "cost"] = final_df['cost_x'] # nan
final_df = final_df.drop(['id_y','cost_x','cost_y'],axis=1)
Output:
id _x Purch. order size code cost
0 1 G918282 large hchs 7728.0
1 2 EE18282 small ueus 2211.0
2 3 DD08282 large kdks 5321.0
3 4 GU88912 large jdhd 4778.0
4 5 NaN large jdjd 1283.0
5 6 NaN large qqas 5583.0
6 7 NaN large djjs 8232.0

Split a column and combine rows where there are multiple data measures

I'm trying to use python to solve my data analysis problem.
I have a table like this:
+----------+-----+------+--------+-------------+--------------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | Value_column |
+----------+-----+------+--------+-------------+--------------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 |
| 11 | 1 | 2020 | Name1 | QTRMAX | 6 |
| 11 | 2 | 2020 | Name1 | QTRMAX | 9 |
| 11 | 3 | 2020 | Name1 | QTRMAX | 7 |
| 11 | 4 | 2020 | Name1 | QTRMAX | 10 |
+----------+-----+------+--------+-------------+--------------+
I want to arrange the Value_column in a way that can capture when there is multiple Qtr_measures for unique IDs and MEF_IDs. When doing this, the overall size of the table will be reduced and I would like to have columns replacing Qtr_Measures with the type as below:
+----------+-----+------+--------+-------------+--------+--------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | QTRAVG | QTRMAX |
+----------+-----+------+--------+-------------+--------+--------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 | 6 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 | 9 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 | 7 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 | 10 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 | |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 | |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 | |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 | |
+----------+-----+------+--------+-------------+--------+--------+
How can I do this with python?
Thank you

Use pivot_table with reset_index and rename_axis:
piv = (df.pivot_table(index=['ID', 'QTR', 'Year', 'MEF_ID'],
values='Value_column',
columns='Qtr_Measure')
.reset_index()
.rename_axis(None, axis=1)
)
print(piv)
ID QTR Year MEF_ID QTRAVG QTRMAX
0 11 1 2020 Name1 5.0 6.0
1 11 2 2020 Name1 8.0 9.0
2 11 3 2020 Name1 6.0 7.0
3 11 4 2020 Name1 9.0 10.0
4 15 1 2020 Name2 67.0 NaN
5 15 2 2020 Name2 89.0 NaN
6 15 3 2020 Name2 100.0 NaN
7 15 4 2020 Name2 121.0 NaN

How to calculate percentatge change on this simple data frame?

I have data that looks like this:
+------+---------+------+-------+
| Year | Cluster | AREA | COUNT |
+------+---------+------+-------+
| 2016 | 0 | 10 | 2952 |
| 2016 | 1 | 10 | 2556 |
| 2016 | 2 | 10 | 8867 |
| 2016 | 3 | 10 | 9786 |
| 2017 | 0 | 10 | 2470 |
| 2017 | 1 | 10 | 3729 |
| 2017 | 2 | 10 | 8825 |
| 2017 | 3 | 10 | 9114 |
| 2018 | 0 | 10 | 1313 |
| 2018 | 1 | 10 | 3564 |
| 2018 | 2 | 10 | 7245 |
| 2018 | 3 | 10 | 6990 |
+------+---------+------+-------+
I have to get the percentage changes for each cluster compared to the previous year, e.g.
+------+---------+-----------+-------+----------------+
| Year | Cluster | AREA | COUNT | Percent Change |
+------+---------+-----------+-------+----------------+
| 2016 | 0 | 10 | 2952 | NaN |
| 2017 | 0 | 10 | 2470 | -16.33% |
| 2018 | 0 | 10 | 1313 | -46.84% |
| 2016 | 1 | 10 | 2556 | NaN |
| 2017 | 1 | 10 | 3729 | 45.89% |
| 2018 | 1 | 10 | 3564 | -4.42% |
| 2016 | 2 | 10 | 8867 | NaN |
| 2017 | 2 | 10 | 8825 | -0.47% |
| 2018 | 2 | 10 | 7245 | -17.90% |
| 2016 | 3 | 10 | 9786 | NaN |
| 2017 | 3 | 10 | 9114 | -6.87% |
| 2018 | 3 | 10 | 6990 | -23.30% |
+------+---------+-----------+-------+----------------+
Is there any easy to do this?
I've tried a few things below, this seemed to make the most sense, but it returns NaN for each pct_change.
df['pct_change'] = df.groupby(['Cluster','Year'])['COUNT '].pct_change()
+------+---------+------+------------+------------+
| Year | Cluster | AREA | Count | pct_change |
+------+---------+------+------------+------------+
| 2016 | 0 | 10 | 295200.00% | NaN |
| 2016 | 1 | 10 | 255600.00% | NaN |
| 2016 | 2 | 10 | 886700.00% | NaN |
| 2016 | 3 | 10 | 978600.00% | NaN |
| 2017 | 0 | 10 | 247000.00% | NaN |
| 2017 | 1 | 10 | 372900.00% | NaN |
| 2017 | 2 | 10 | 882500.00% | NaN |
| 2017 | 3 | 10 | 911400.00% | NaN |
| 2018 | 0 | 10 | 131300.00% | NaN |
| 2018 | 1 | 10 | 356400.00% | NaN |
| 2018 | 2 | 10 | 724500.00% | NaN |
| 2018 | 3 | 10 | 699000.00% | NaN |
+------+---------+------+------------+------------+
Basically, I simply want the function to compare the year on year change for each cluster.

df['pct_change'] = df.groupby(['Cluster'])['Count'].pct_change()
df.sort_values('Cluster', axis = 0, ascending = True)

Another method going old school with transform
df['p'] = df.groupby('cluster')['count'].transform(lambda x: (x-x.shift())/x.shift())
df = df.sort_values(by='cluster')
print(df)
year cluster area count p
0 2016 0 10 2952 NaN
4 2017 0 10 2470 -0.163279
8 2018 0 10 1313 -0.468421
1 2016 1 10 2556 NaN
5 2017 1 10 3729 0.458920
9 2018 1 10 3564 -0.044248
2 2016 2 10 8867 NaN
6 2017 2 10 8825 -0.004737
10 2018 2 10 7245 -0.179037
3 2016 3 10 9786 NaN
7 2017 3 10 9114 -0.068670
11 2018 3 10 6990 -0.233048

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.

A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Merging pandas column from dataframe to another dataframe based on their indices

I have a data frame, df_one that looks like this where video_id is the index:
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched | avg_time_watched |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| video_id | | | | | | |
| 5 | 17 | 12.000000 | 17 | 1 | 1 | 1.000000 |
| 10 | 22 | 10.000000 | 1 | 1 | 1 | 0.045455 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 | 1.000000 |
| 22 | 29 | 20.000000 | 5 | 1 | 1 | 0.172414 |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
And I have another dataframe, df_two that looks like this where video_id is also the index:
+----------+--------------+---------------+--------------+----------------+------------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched_yeterday |
+----------+--------------+---------------+--------------+----------------+------------------------+
| video_id | | | | | |
| 5 | 102 | 11.333333 | 73 | 6 | 6 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 |
| 16 | 44 | 2.000000 | 15 | 1 | 1 |
| 17 | 180 | 23.333333 | 53 | 6 | 6 |
| 18 | 40 | 1.000000 | 40 | 1 | 1 |
+----------+--------------+---------------+--------------+----------------+------------------------+
What I want to do is merge the count_watched_yeterday column from df_two to df_one based on the index of each.
I tried:
video_base = pd.merge(df_one, df_two['count_watched_yeterday'], how='left', on=[df_one.index, df_two.index])
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Actually I think the easiest thing to do here is to directly assign:
In [13]:
df['count_watched_yesterday'] = df1['count_watched_yeterday']
df['count_watched_yesterday']
Out[13]:
video_id
5 6
10 NaN
15 1
22 NaN
Name: count_watched_yesterday, dtype: float64
This works because it will align on the index values, where you have no matching values a NaN will be assigned as the value

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandasql with conditions - python

Related

Modify column in according another column dataframe python

Split a column and combine rows where there are multiple data measures

How to calculate percentatge change on this simple data frame?

Pandas, create new column based on values from previuos rows with certain values

Merging pandas column from dataframe to another dataframe based on their indices

Categories

Resources