I have a wide dataframe I want to be able to reshape.
I have some columns that I wanna preserve. I have been exploring melt and wide_to_long but I'm not sure that's what I need.
Imagine I have some columns named: 'id', 'classroom', 'city'
And other columns called: 'alumn_x_subject_y_mark', 'alumn_x_subject_y_name', 'alumn_x_subject_y_teacher'
And x and y are the product of [range(20), range(10)].
I would like to end with a df that has columns: id, classroom, city, alumn, subject, mark, name, teacher
With all the original 20*10 columns converted to rows.
An empty dataframe with that structure can be generated this way:
import pandas as pd
import itertools
vals = list(itertools.product(*[range(20), range(10)]))
pd.DataFrame(columns=['id', 'classroom', 'city']+ \
['alumn_{0}_subject_{1}_mark'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_name'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_teacher'.format(x, y) for x, y in vals]
, dtype=object)
I'm not building this dataframe but receiving it from a file, that's why it has so many columns and I cannot change that.
If you had only 2 parameters to extract, wide_to_long would work.
Here you have 3, thus you can perform a manual reshaping with a MultiIndex:
regex = r'alumn_(\d+)_subject_(\d+)_(.*)'
out = (df
.set_index(['id', 'classroom', 'city'])
.pipe(lambda d: d.set_axis(pd.MultiIndex
.from_frame(d.columns.str.extract(regex),
names=['alumn', 'subject', None]
),
axis=1))
.stack(['alumn', 'subject'])
.reset_index()
)
output:
Empty DataFrame
Columns: [id, classroom, city, alumn, subject, mark, name, teacher]
Index: []
output with a single row (after df.loc[0] = range(df.shape[1])):
id classroom city alumn subject mark name teacher
0 0 1 2 0 0 3 203 403
1 0 1 2 0 1 4 204 404
2 0 1 2 0 2 5 205 405
3 0 1 2 0 3 6 206 406
4 0 1 2 0 4 7 207 407
.. .. ... ... ... ... ... ... ...
195 0 1 2 9 5 98 298 498
196 0 1 2 9 6 99 299 499
197 0 1 2 9 7 100 300 500
198 0 1 2 9 8 101 301 501
199 0 1 2 9 9 102 302 502
[200 rows x 8 columns]
Related
I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))
I would like to add for each category within the Customer_Acquisition_Channel column all values in Days_To_Acquisition column to a separate df.
All Customer_ID vales are unique in dataset below
DF
Customer_ID Customer_Acquisition_Channel Days_To_Acquisition
323 Organic 2
583 Organic 5
838 Organic 2
193 Website 7
241 Website 7
642 Website 1
Desired Output:
Days_To_Acq_Organic_Df
Index Days_To_Acquisition
0 2
1 5
2 2
Days_To_Acq_Website_Df
Index Days_To_Acquisition
0 7
1 7
2 1
This is what I have tried so far, but I would like to use a for loop instead going through each column manually
sub_1 = df.loc[df['Customer_Acquisition_Channel'] == 'Organic']
Days_To_Acq_Organic_Df=sub_1[['Days_To_Acquisition']]
sub_2 = df.loc[df['Customer_Acquisition_Channel'] == 'Website']
Days_To_Acq_Website_Df=sub_2[['Days_To_Acquisition']]
You can iterate through unique values of the channel column and create new dataframes, change the column names, and append them to a list:
dataframes = []
for channel in df.Customer_Acquisition_Channel.unique():
new_df = df[df['Customer_Acquisition_Channel'] == channel][['Customer_ID','Days_To_Acquisition']]
new_df.columns = ['Customer_ID',f'Days_To_Acquisition_{channel}_df']
dataframes.append(new_df)
OUTPUT:
for df in dataframes:
print(df,'\n__________')
Customer_ID Days_To_Acquisition_Organic_df
0 323 2
1 583 5
2 838 2
__________
Customer_ID Days_To_Acquisition_Website_df
3 193 7
4 241 7
5 642 1
__________
Alternatively, you can store the dataframes to a dictionary so you can name them and call them individually:
dataframes = {}
for channel in df.Customer_Acquisition_Channel.unique():
new_df = df[df['Customer_Acquisition_Channel'] == channel][['Customer_ID','Days_To_Acquisition']]
new_df.columns = ['Customer_ID',f'Days_To_Acquisition_{channel}']
dataframes[f'Days_To_Acquisition_{channel}_df'] = new_df
OUTPUT:
print(dataframes['Days_To_Acquisition_Organic_df'])
Customer_ID Days_To_Acquisition_Organic
0 323 2
1 583 5
2 838 2
I have pandas dataframe:
id colA colB colC
194 1 0 1
194 1 1 0
194 2 1 3
195 1 1 2
195 0 1 0
197 1 1 2
i would to calculate occurrence of each value group by id. in my case, expected result is:
id countOfValue0 countOfValue1 countOfValue2 countOfValue3
194 2 3 1 1
195 1 2 1 0
197 0 1 1 0
if value appeared in same row - distinct value by row (this is why i have for id=194, value1 = 3)
i thought to separate the data to 3 data frames using group by id-colA, id-colB, id-colC
something like = df.groupby('id', 'colaA') but i can't find an proper way to calculate those dataframe values based on id. probably there is more efficient way for doing this
Try:
res=df.set_index("id", append=True).stack()\
.reset_index(level=0).reset_index(level=1,drop=True)\
.drop_duplicates().assign(_dummy=1)\
.rename(columns={0: "countOfValue"})\
.pivot_table(index="id", columns="countOfValue", values="_dummy", aggfunc="sum")\
.fillna(0).astype(int)
res=res.add_prefix("countOfValue")
del res.columns.name
Outputs:
countOfValue0 ... countOfValue3
id ...
194 2 ... 1
195 1 ... 0
197 0 ... 0
I have used pandas merge to bring together two dataframes (24 columns each), based on a set of condition, to generate a dataframe which contains rows which have the same values; naturally there are many other columns in each dataframe with different values. The code used to do this is:
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner' )
The result is a dataframe which has 48 columns, I would like to bring together these now (using melt possibly). so to visualise this:
Deal_x ID_x Location_x \... 21 other columns with _x postfix
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
Deal_y ID_y Location_y \ ... 21 other columns with _y postfix
0 155 9545 B
1 155 0345 C
2 155 0445 D
I want this to become:
Deal ID Location \
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
0 155 9545 B
1 155 0345 C
2 155 0445 D
Please how do I do this?
You can do something with the suffixes, split the columns to a MultiIndex, and then unstack
Merged=pd.merge(Buy_MD,Sell_MD, on= ['ID','LocName','Sub-Group','Month'], how = 'inner', suffixes=('_buy', '_sell')
Merged.columns = pd.MultiIndex.from_tuples(Merged.columns.str.rsplit('_').map(tuple), names=('key', 'transaction'))
Merged = Merged.stack(level='transaction')
transaction Deal ID Location
0 buy 130 5845 A
0 sell 155 9545 B
1 buy 155 5845 B
1 sell 155 345 C
2 buy 138 6245 C
2 sell 155 445 D
If you want to get rid of the MultiIndex you can do:
Merged.index = Merged.index.droplevel('transaction')
First, get rid of the suffixes using df.columns.str.split and taking the first split value from each sub-list in the result.
df_list = [df1, df2, ...] # a generic solution for 2 or more frames
for i, df in enumerate(df_list):
df_list[i].columns = df.columns.str.split('_').str[0]
Now, concatenate the result -
df = pd.concat(df_list, ignore_index=True)
df
Deal ID Location
0 130 5845 A
1 155 5845 B
2 138 6245 C
3 152 7345 A
4 155 9545 B
5 155 345 C
6 155 445 D
Also, if you're interested, use str.zfill on ID to get your expected output -
v = df.ID.astype(str)
v.str.zfill(v.str.len().max())
0 5845
1 5845
2 6245
3 7345
4 9545
5 0345
6 0445
Name: ID, dtype: object
Assign the result back.
I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266