Pandas merge columns with the same name - python

I have the following Dataframe:
Timestamp
participant
level
gold
participant
level
gold
1
1
100
6000
2
76
4200
2
1
150
5000
2
120
3700
I am trying to change the Dataframe so that all rows from columns named the same is moved below each other, while keeping the column named timestamp:
Timestamp
participant
level
gold
1
1
100
6000
2
1
150
5000
1
2
76
4200
2
2
120
3700
To be clear, the example above is a small sample, the actual Dataframe has a lot of columnes named the same, and a lot more rows. Hence, the solution needs to take that in to account.
Thanks!

Idea is deduplicated duplicated columns names by GroupBy.cumcount for counter and then reshape by DataFrame.stack:
df = df.set_index('Timestamp')
s = df.columns.to_series()
df.columns = [df.columns, s.groupby(s).cumcount()]
df = df.stack().reset_index(level=1, drop=True).reset_index()
If columns names are not duplicated and added . with number:
print (df)
Timestamp participant level gold participant.1 level.1 gold.1
0 1 1 100 6000 2 76 4200
1 2 1 150 5000 2 120 3700
df = df.set_index('Timestamp')
df.columns = pd.MultiIndex.from_frame(df.columns.str.split('.', expand=True)
.to_frame().fillna('0'))
df = df.stack().reset_index(level=1, drop=True).reset_index()
print (df)
0 Timestamp gold level participant
0 1 6000 100 1
1 1 4200 76 2
2 2 5000 150 1
3 2 3700 120 2

Hope this helps
df1=pd.concat([df.iloc[:,0],df.loc[:,df.columns.duplicates()]],axis=1)
df2=df.loc[:,~df.columns.duplicates()]
df=pd.concat([df1,df2],axis=1)

Related

Python Pandas: Conditional subraction of data between two dataframes?

I'm trying to use conditional subtraction between two dataframes.
Dataframe df1 has columns name and price.name is not unique.
>>df1
name price
0 mark 50
1 mark 200
2 john 10
3 chris 500
Another dataframe has two columns name and paid, Here name is unique
>>df2
name paid
0 mark 150
1 john 10
How can I conditionally subtract both dataframes to get following output
Final Output expected
name price paid
0 mark 50 50
1 mark 200 100
2 john 10 10
3 chris 500 0
IIUC, you can use:
# mapper for paid values
s = df2.set_index('name')['paid']
df1['paid'] = (df1
.groupby('name')['price'] # for each name
.apply(lambda g: g.cumsum() # sum the total owed
.clip(upper=s.get(g.name, default=0)) # in the limit of the paid
.pipe(lambda s: s.diff().fillna(s)) # compute reverse cumsum
)
)
output:
name price paid
0 mark 50 50.0
1 mark 200 100.0
2 john 10 10.0
3 chris 500 0.0

Pandas create column based on values in other rows and columns

I am currently trying to add a new column in a pandas dataset whose data is based on the contents of other rows in the dataset.
In the example, for each row x, I want to find the entry from id_real from row y, so that the content of id_par in row x matches the content from id in row y. See the following example.
id_real id id_par
100 1 2
200 2 3
300 3 4
id_real id id_par new_col
100 1 2 200
200 2 3 300
300 3 4 NaN
I have tried a lot of things and the last thing I tried was the following:
df["new_col"] = df[df["id"] == df["id_par"]]["node_id"]
Unfortunately, the new column then only contains NaN entries. Can you help me?
Use Series.map with DataFrame.drop_duplicates for match first id rows:
df["new_col"] = df['id_par'].map(df.drop_duplicates('id').set_index('id')['id_real'])
print (df)
id_real id id_par new_col
0 100 1 2 200.0
1 200 2 3 300.0
2 300 3 4 NaN
Use map to match the "id_par" using "id" as index and "id_real" as values:
df['new_col'] = df['id_par'].map(df.set_index('id')['id_real'])
output:
id_real id id_par new_col
0 100 1 2 200.0
1 200 2 3 300.0
2 300 3 4 NaN

Merging dataframes with multiple key columns

I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))

Pandas Multilevel header, creating multiple rows for entries within a line

I have a multilevel column pandas dataframe with orders from an online retailer. Each line has info from a single order. This results in multiple items within the same row. I need to create a single row for each item sold in the orders. Because of this there are hundreds of columns.
orderID orderinfo orderline orderline orderline orderline
nan orderaspect1 0 0 1 1
nan nan itemaspect1 itemaspect2 itemaspect1 itemaspect2
0 1 2 3 4 5 6
1 10 20 30 40 50 60
2 100 200 300 400 500 600
each row with the same number under orderline needs to have its own row which includes the info from order iD and all the order aspects. It needs to look something like this.
orderID orderinfo orderline orderline item #
nan nan itemaspect1 itemaspect2
0 1 2 3 4 0
1 10 20 30 40 0
2 100 200 300 400 0
3 1 2 5 6 1
4 10 20 50 60 1
5 100 200 500 600 1
This way there is a row for each ITEM instead of a row for each ORDER.
I've tried using melt and stack to unpivot followed by pivoting but I've run into issue with that and the multiindex columns. Nothing is formatting it correctly.
[edit]
It should look like this using code.
i = pd.DatetimeIndex(['2011-03-31', '2011-04-01', '2011-04-04', '2011-04-05',
'2011-04-06', '2011-04-07', '2011-04-08', '2011-04-11',
'2011-04-12', '2011-04-13'])
cols = pd.MultiIndex.from_product([['orderID', "orderlines"],['order data', 0, 1]])
df = pd.DataFrame(np.random.randint(10, size=(len(i), 6)),index=i, columns=cols)

Summarizing a dataset and creating new variables

I have a Dataset that lists individual transactions by country, quarter, division, the transaction type and the value. I would like to sum it up based on the first three variables but create new columns for the other two. The dataset looks like this:
Country Quarter Division Type Value
A 1 Sales A 50
A 2 Sales A 150
A 3 Sales B 20
A 1 Sales A 250
A 2 Sales B 50
A 3 Sales B 50
A 2 Marketing A 50
Now I would like to aggregate the data to get the number of transactions by type as a new variable. The overall number of transactions grouped by the first three variables is easy:
df.groupby(['Country', 'Quarter', 'Division'], as_index=False).agg({'Type':'count', 'Value':'sum'})
However, I would like my new dataframe to look as follows:
Country Quarter Division Type_A Type_B Value_A Value_B
A 1 Sales 2 0 300 0
A 2 Sales 1 1 150 50
A 3 Sales 0 2 0 70
A 2 Marketing 1 0 50 0
How do I do that?
Specify column after groupby with tuples in agg functions for new columns names with aggregate functions, then reshape by DataFrame.unstack and last convert MultiIndex in columns by map:
df1 = (df.groupby(['Country', 'Quarter', 'Division', 'Type'])['Value']
.agg([('Type','count'), ('Value','sum')])
.unstack(fill_value=0))
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index()
print (df1)
Country Quarter Division Type_A Type_B Value_A Value_B
0 A 1 Sales 2 0 300 0
1 A 2 Marketing 1 0 50 0
2 A 2 Sales 1 1 150 50
3 A 3 Sales 0 2 0 70

Categories