Split a column and combine rows where there are multiple data measures - python

I'm trying to use python to solve my data analysis problem.
I have a table like this:
+----------+-----+------+--------+-------------+--------------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | Value_column |
+----------+-----+------+--------+-------------+--------------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 |
| 11 | 1 | 2020 | Name1 | QTRMAX | 6 |
| 11 | 2 | 2020 | Name1 | QTRMAX | 9 |
| 11 | 3 | 2020 | Name1 | QTRMAX | 7 |
| 11 | 4 | 2020 | Name1 | QTRMAX | 10 |
+----------+-----+------+--------+-------------+--------------+
I want to arrange the Value_column in a way that can capture when there is multiple Qtr_measures for unique IDs and MEF_IDs. When doing this, the overall size of the table will be reduced and I would like to have columns replacing Qtr_Measures with the type as below:
+----------+-----+------+--------+-------------+--------+--------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | QTRAVG | QTRMAX |
+----------+-----+------+--------+-------------+--------+--------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 | 6 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 | 9 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 | 7 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 | 10 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 | |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 | |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 | |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 | |
+----------+-----+------+--------+-------------+--------+--------+
How can I do this with python?
Thank you

Use pivot_table with reset_index and rename_axis:
piv = (df.pivot_table(index=['ID', 'QTR', 'Year', 'MEF_ID'],
values='Value_column',
columns='Qtr_Measure')
.reset_index()
.rename_axis(None, axis=1)
)
print(piv)
ID QTR Year MEF_ID QTRAVG QTRMAX
0 11 1 2020 Name1 5.0 6.0
1 11 2 2020 Name1 8.0 9.0
2 11 3 2020 Name1 6.0 7.0
3 11 4 2020 Name1 9.0 10.0
4 15 1 2020 Name2 67.0 NaN
5 15 2 2020 Name2 89.0 NaN
6 15 3 2020 Name2 100.0 NaN
7 15 4 2020 Name2 121.0 NaN

Related

cumsum bounded within a range(python, pandas)

I have a df where I'd like to have the cumsum be bounded within a range of 0 to 6. Where sum over 6 will be rollover to 0. The adj_cumsum column is what I'm trying to get. I've search and found a couple of posts using loops, however, since mine is more straightforward, hence, is wondering whether there is a less complicated or updated approach.
+----+-------+------+----------+----------------+--------+------------+
| | month | days | adj_days | adj_days_shift | cumsum | adj_cumsum |
+----+-------+------+----------+----------------+--------+------------+
| 0 | jan | 31 | 3 | 0 | 0 | 0 |
| 1 | feb | 28 | 0 | 3 | 3 | 3 |
| 2 | mar | 31 | 3 | 0 | 3 | 3 |
| 3 | apr | 30 | 2 | 3 | 6 | 6 |
| 4 | may | 31 | 3 | 2 | 8 | 1 |
| 5 | jun | 30 | 2 | 3 | 11 | 4 |
| 6 | jul | 31 | 3 | 2 | 13 | 6 |
| 7 | aug | 31 | 3 | 3 | 16 | 2 |
| 8 | sep | 30 | 2 | 3 | 19 | 5 |
| 9 | oct | 31 | 3 | 2 | 21 | 0 |
| 10 | nov | 30 | 2 | 3 | 24 | 3 |
| 11 | dec | 31 | 3 | 2 | 26 | 5 |
+----+-------+------+----------+----------------+--------+------------+
data = {"month": ['jan','feb','mar','apr',
'may','jun','jul','aug',
'sep','oct','nov','dec'],
"days": [31,28,31,30,31,30,31,31,30,31,30,31]}
df = pd.DataFrame(data)
df['adj_days'] = df['days'] - 28
df['adj_days_shift'] = df['adj_days'].shift(1)
df['cumsum'] = df.adj_days_shift.cumsum()
df.fillna(0, inplace=True)
Kindly advise
What you are looking for is called a modulo operation.
Use df['adj_cumsum'] = df['cumsum'].mod(7).
Intuition:
df["adj_cumsum"] = df["cumsum"].apply(lambda x:x%7)
Am I right?

Filter rows based on condition in Pandas

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).
You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

determine chain of predecessors and successor from a list of first predecessor in python

I have a list like the following
+----+-------------------+
| id | first_predecessor |
+----+-------------------+
| 0 | 4 |
| 1 | 5 |
| 2 | 6 |
| 3 | 17,18 |
| 4 | 7 |
| 5 | 8 |
| 6 | 9 |
| 7 | 10,11,12 |
| 8 | 13,14,15 |
| 9 | 16 |
| 10 | Input |
| 11 | Input |
| 12 | Input |
| 13 | Input |
| 14 | Input |
| 15 | Input |
| 16 | Input |
| 17 | 19 |
| 18 | 20 |
| 19 | 21 |
| 20 | 22 |
| 21 | Input |
+----+-------------------+
One item can have multiple immediate incoming ids, like in case of id=3, which is imediately preceeded by id=17 and id=18.
I need a python code to determine this result by following the chain of predecessors both ways:
(it is best to read the column all_successors to understand the logic, all_predecessors is the same logic backwards)
+----+-------------------+------------------+----------------+
| id | first_predecessor | all_predecessors | all_successors |
+----+-------------------+------------------+----------------+
| 0 | 4 | 4,7,10,11,12 | |
| 1 | 5 | 5,8,13,14,15 | |
| 2 | 6 | 6,9,16 | |
| 3 | 17,18 | 19,21,20,22 | |
| 4 | 7 | 7,10,11,12 | 0 |
| 5 | 8 | 8,13,14,15 | 1 |
| 6 | 9 | 9,16 | 2 |
| 7 | 10,11,12 | 10,11,12 | 0,4 |
| 8 | 13,14,15 | 13,14,15 | 1,5 |
| 9 | 16 | 16 | 2,6 |
| 10 | Input | | 0,4,7 |
| 11 | Input | | 0,4,7 |
| 12 | Input | | 0,4,7 |
| 13 | Input | | 1,5,8 |
| 14 | Input | | 1,5,8 |
| 15 | Input | | 1,5,8 |
| 16 | Input | | 2,6,9 |
| 17 | 19 | 19,21 | 3 |
| 18 | 20 | 20,22 | 3 |
| 19 | 21 | 21 | 3,17 |
| 20 | 22 | 22 | 3,18 |
| 21 | Input | | 3,17,19 |
| 22 | Input | | 3,18,20 |
+----+-------------------+------------------+----------------+
I need some kind of recursive solution, or should I use some graph package?
You can use the following functions to find all predecessors and all successors.
ancestors(G, source): Returns all nodes having a path to source in G.
descendants(G, source): Returns all nodes reachable from source in G.
To run the following example, make sure you change INPUT in your id column to NaN.
df_ = df.copy()
df_['first_predecessor'] = df_['first_predecessor'].str.split(',')
df_ = df_.explode('first_predecessor')
df_['first_predecessor'] = df_['first_predecessor'].fillna(-1).astype(int)
G = nx.from_pandas_edgelist(df_, 'first_predecessor', 'id', create_using=nx.DiGraph())
G.remove_node(-1)
df['all_predecessors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.ancestors(G, x)))))
df['all_successors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.descendants(G, x)))))
print(df)
id first_predecessor all_predecessors all_successors
0 0 4 4,7,10,11,12
1 1 5 5,8,13,14,15
2 2 6 6,9,16
3 3 17,18 17,18,19,20,21,22
4 4 7 7,10,11,12 0
5 5 8 8,13,14,15 1
6 6 9 9,16 2
7 7 10,11,12 10,11,12 0,4
8 8 13,14,15 13,14,15 1,5
9 9 16 16 2,6
10 10 NaN 0,4,7
11 11 NaN 0,4,7
12 12 NaN 0,4,7
13 13 NaN 1,5,8
14 14 NaN 1,5,8
15 15 NaN 1,5,8
16 16 NaN 2,6,9
17 17 19 19,21 3
18 18 20 20,22 3
19 19 21 21 3,17
20 20 22 22 3,18
21 21 NaN 3,17,19

How to calculate percentatge change on this simple data frame?

I have data that looks like this:
+------+---------+------+-------+
| Year | Cluster | AREA | COUNT |
+------+---------+------+-------+
| 2016 | 0 | 10 | 2952 |
| 2016 | 1 | 10 | 2556 |
| 2016 | 2 | 10 | 8867 |
| 2016 | 3 | 10 | 9786 |
| 2017 | 0 | 10 | 2470 |
| 2017 | 1 | 10 | 3729 |
| 2017 | 2 | 10 | 8825 |
| 2017 | 3 | 10 | 9114 |
| 2018 | 0 | 10 | 1313 |
| 2018 | 1 | 10 | 3564 |
| 2018 | 2 | 10 | 7245 |
| 2018 | 3 | 10 | 6990 |
+------+---------+------+-------+
I have to get the percentage changes for each cluster compared to the previous year, e.g.
+------+---------+-----------+-------+----------------+
| Year | Cluster | AREA | COUNT | Percent Change |
+------+---------+-----------+-------+----------------+
| 2016 | 0 | 10 | 2952 | NaN |
| 2017 | 0 | 10 | 2470 | -16.33% |
| 2018 | 0 | 10 | 1313 | -46.84% |
| 2016 | 1 | 10 | 2556 | NaN |
| 2017 | 1 | 10 | 3729 | 45.89% |
| 2018 | 1 | 10 | 3564 | -4.42% |
| 2016 | 2 | 10 | 8867 | NaN |
| 2017 | 2 | 10 | 8825 | -0.47% |
| 2018 | 2 | 10 | 7245 | -17.90% |
| 2016 | 3 | 10 | 9786 | NaN |
| 2017 | 3 | 10 | 9114 | -6.87% |
| 2018 | 3 | 10 | 6990 | -23.30% |
+------+---------+-----------+-------+----------------+
Is there any easy to do this?
I've tried a few things below, this seemed to make the most sense, but it returns NaN for each pct_change.
df['pct_change'] = df.groupby(['Cluster','Year'])['COUNT '].pct_change()
+------+---------+------+------------+------------+
| Year | Cluster | AREA | Count | pct_change |
+------+---------+------+------------+------------+
| 2016 | 0 | 10 | 295200.00% | NaN |
| 2016 | 1 | 10 | 255600.00% | NaN |
| 2016 | 2 | 10 | 886700.00% | NaN |
| 2016 | 3 | 10 | 978600.00% | NaN |
| 2017 | 0 | 10 | 247000.00% | NaN |
| 2017 | 1 | 10 | 372900.00% | NaN |
| 2017 | 2 | 10 | 882500.00% | NaN |
| 2017 | 3 | 10 | 911400.00% | NaN |
| 2018 | 0 | 10 | 131300.00% | NaN |
| 2018 | 1 | 10 | 356400.00% | NaN |
| 2018 | 2 | 10 | 724500.00% | NaN |
| 2018 | 3 | 10 | 699000.00% | NaN |
+------+---------+------+------------+------------+
Basically, I simply want the function to compare the year on year change for each cluster.
df['pct_change'] = df.groupby(['Cluster'])['Count'].pct_change()
df.sort_values('Cluster', axis = 0, ascending = True)
Another method going old school with transform
df['p'] = df.groupby('cluster')['count'].transform(lambda x: (x-x.shift())/x.shift())
df = df.sort_values(by='cluster')
print(df)
year cluster area count p
0 2016 0 10 2952 NaN
4 2017 0 10 2470 -0.163279
8 2018 0 10 1313 -0.468421
1 2016 1 10 2556 NaN
5 2017 1 10 3729 0.458920
9 2018 1 10 3564 -0.044248
2 2016 2 10 8867 NaN
6 2017 2 10 8825 -0.004737
10 2018 2 10 7245 -0.179037
3 2016 3 10 9786 NaN
7 2017 3 10 9114 -0.068670
11 2018 3 10 6990 -0.233048

Merging pandas column from dataframe to another dataframe based on their indices

I have a data frame, df_one that looks like this where video_id is the index:
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched | avg_time_watched |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| video_id | | | | | | |
| 5 | 17 | 12.000000 | 17 | 1 | 1 | 1.000000 |
| 10 | 22 | 10.000000 | 1 | 1 | 1 | 0.045455 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 | 1.000000 |
| 22 | 29 | 20.000000 | 5 | 1 | 1 | 0.172414 |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
And I have another dataframe, df_two that looks like this where video_id is also the index:
+----------+--------------+---------------+--------------+----------------+------------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched_yeterday |
+----------+--------------+---------------+--------------+----------------+------------------------+
| video_id | | | | | |
| 5 | 102 | 11.333333 | 73 | 6 | 6 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 |
| 16 | 44 | 2.000000 | 15 | 1 | 1 |
| 17 | 180 | 23.333333 | 53 | 6 | 6 |
| 18 | 40 | 1.000000 | 40 | 1 | 1 |
+----------+--------------+---------------+--------------+----------------+------------------------+
What I want to do is merge the count_watched_yeterday column from df_two to df_one based on the index of each.
I tried:
video_base = pd.merge(df_one, df_two['count_watched_yeterday'], how='left', on=[df_one.index, df_two.index])
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Actually I think the easiest thing to do here is to directly assign:
In [13]:
df['count_watched_yesterday'] = df1['count_watched_yeterday']
df['count_watched_yesterday']
Out[13]:
video_id
5 6
10 NaN
15 1
22 NaN
Name: count_watched_yesterday, dtype: float64
This works because it will align on the index values, where you have no matching values a NaN will be assigned as the value

Categories