Multiindex Roll-up Indicator - python

How do you roll-up multi-index by Date & ID and create indicators?
+--------+-----+------+-------------+
| Date | ID | Flag | Action Type |
+--------+-----+------+-------------+
| 201712 | 123 | - | Delete |
| 201712 | 456 | + | Add |
| 201712 | 123 | + | Add |
| 201801 | 123 | + | Change |
+--------+-----+------+-------------+
with an output of:
+--------+-----+------+--------------+
| Date | ID | Flag | Action Type |
+--------+-----+------+--------------+
| 201712 | 123 | * | Add & Delete |
| 201712 | 456 | + | Add |
| 201801 | 123 | + | Added Chg |
+--------+-----+------+--------------+

You can using groupby and join
s=df.groupby(['Date','ID'],as_index=False).agg('&'.join)
s.Flag.str.len().gt(1)
Out[285]:
0 True
1 False
2 False
Name: Flag, dtype: bool
s.loc[s.Flag.str.len().gt(1),'Flag']='*'
s
Out[287]:
Date ID Flag Actiontype
0 201712 123 * Delete&Add
1 201712 456 + Add
2 201801 123 + Change

Related

merge / concat two dataframes on column values and drop subsequent rows from the resulting dataframe

I have 2 data frames
df1
| email | ack |
| -------- | -------------- |
| first#abc.com | 1 |
| second#abc.com | 1 |
| third#abc.com | 1 |
| fourth#abc.com | 1 |
| fifth#abc.com | 1 |
| sixth#abc.com | 1 |
| seventh#abc.com | 1 |
| eight#abc.com | 1 |
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 0 |abc | 01/01/2022 |
| second#abc.com | 0 |xyz | 01/02/2022 |
| third#abc.com | 0 |mno | 01/03/2022 |
| fourth#abc.com | 0 |pqr | 01/04/2022 |
| fifth#abc.com | 0 |adam| 01/05/2022 |
| sixth#abc.com | 0 |eve |01/06/2022|
| seventh#abc.com | 0 |mary|01/07/2022|
| eight#abc.com | 0 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
How do i merge the above two dataframes so as to replace the values in 'ack' column of df2 wherever applicable i.e., on email address.
result
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 1 |abc|01/01/2022|
| second#abc.com | 1 |xyz|01/02/2022|
| third#abc.com | 1 |mno|01/03/2022|
| fourth#abc.com | 1 |pqr|01/04/2022|
| fifth#abc.com | 1 |adam|01/05/2022|
| sixth#abc.com | 1 |eve|01/06/2022|
| seventh#abc.com | 1 |mary|01/07/2022|
| eight#abc.com | 1 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
I tried left join and outer join, it appended rows to existing rows.
Assuming df1['ack'] is always 1, the following code should work:
df2.loc[df2['email'].isin(df1['email']), 'ack'] = 1
In English:
If df2['email'] is found in df1['email'], set df2['ack'] = 1

Fetch values corresponding to id of each row python

Is is possible to fetch column containing values corresponding to an id column?
Example:-
df1
| ID | Value | Salary |
|:---------:--------:|:------:|
| 1 | amr | 34 |
| 1 | ith | 67 |
| 2 | oaa | 45 |
| 1 | eea | 78 |
| 3 | anik | 56 |
| 4 | mmkk | 99 |
| 5 | sh_s | 98 |
| 5 | ahhi | 77 |
df2
| ID | Dept |
|:---------:--------:|
| 1 | hrs |
| 1 | cse |
| 2 | me |
| 1 | ece |
| 3 | eee |
Expected Output
| ID | Dept | Value |
|:---------:--------:|----------:|
| 1 | hrs | amr |
| 1 | cse | ith |
| 2 | me | oaa |
| 1 | ece | eea |
| 3 | eee | anik |
I want to fetch each values in the 'Value' column corresponding to values in df2's ID column. And create column containing 'Values' in df2. The number of rows in the two dfs are not the same. I have tried
this
Not worked
IIUC , you can try df.merge after assigning a helper column by doing groupby+cumcount on ID:
out = (df1.assign(k=df1.groupby("ID").cumcount())
.merge(df2.assign(k=df2.groupby("ID").cumcount()),on=['ID','k'])
.drop("k",1))
print(out)
ID Value Dept
0 1 Amr hrs
1 1 ith cse
2 2 oaa me
3 1 eea ece
4 3 anik eee
is this what you want to do?
df1.merge(df2, how='inner',on ='ID')
Since you have duplicated IDs in both dfs, but these are ordered, try:
df1 = df1.drop(columns="ID")
df3 = df2.merge(df1, left_index=True, right_index=True)

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

How to group a Pandas DataFrame by url without the query string?

I have a Pandas DataFrame that is structured like this:
+-------+------------+------------------------------------+----------+
| index | Date | path | Count |
+-------+------------+------------------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 2893 |
| 2 | 2020-06-10 | about/v1/?status=active?name=craig | 264 |
| 3 | 2020-06-09 | about/v1/?status=active?name=craig | 182 |
+-------+------------+------------------------------------+----------+
How do I group by the path, and the date without the query string so that the table looks like this?
+-------+------------+-------------------------+----------+
| index | Date | path | Count |
+-------+------------+-------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 3157 |
| 3 | 2020-06-09 | about/v1/?status=active | 182 |
+-------+------------+-------------------------+----------+
Replace the name=craig section, and groupby on the Date and path columns :
result = (df.assign(path=df.path.str.replace(r"\?name=.*",""))
.drop("index",axis=1)
.groupby(["Date","path"],sort=False)
.sum()
)
result
Count
Date path
2020-06-10 about/v1/ 10865
about/v1/?status=active 3157
2020-06-09 about/v1/?status=active 182

Categories