Explode data frame columns into multiple rows - python

I have a large dataframe a that I would like to split or explode to become dataframe b (the real dataframe a contains 90 columns).
I tried to look up for solutions to a problem similar to this but I did not find since it is not related to the values in cells but to the column names.
Any pointer to the solution or to using an existing function in the pandas library would be appreciated.
Thank you in advance.
from pandas import DataFrame
import numpy as np
# current df
a = DataFrame([{'ID': 'ID_1', 'A-1': 'a1', 'B-1':'b1','C-1':'c1', 'A-2': 'a2', 'B-2':'b2','C-2':'c2'}])
# desired df
b = DataFrame([{'ID': 'ID_1', 'A': 'a1', 'B':'b1', 'C':'c1'},
{'ID': 'ID_1','A': 'a2', 'B':'b2','C':'c2'}])
current df
desired df
One idea I have is to to split this dataframe into two dataframes (Dataframe 1 will contain columns from A1 to C1 and Dataframe 2 will contain columns from A2 to C2 ) rename the columns to A/B/C and than concatenate both. But I am not sure in terms of efficiency since I have 90 Columns that will grow over time.

This approach will generate some intermediate columns which will be removed later on.
First bring down those labels (A-1,...) from the header into a column
df = pd.melt(a, id_vars=['ID'], var_name='label')
Then split the label into character and number
df[['char', 'num']] = df['label'].str.split('-', expand=True)
Finally drop the label, set_index before unstack, and take care of the final table formats.
df.drop('label', axis=1)\
.set_index(['ID', 'num', 'char'])\
.unstack()\
.droplevel(0, axis=1)\
.reset_index()\
.drop('num', axis=1)

pd.wide_to_long works well here assuming a small number of known stubnames:
b = (
pd.wide_to_long(a, stubnames=['A', 'B', 'C'], sep='-', i='ID', j='to_drop')
.droplevel(level='to_drop')
.reset_index()
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Alternatively set_index, split the columns on '-' with str.split and stack:
b = a.set_index('ID')
b.columns = b.columns.str.split('-', expand=True)
b = b.stack().droplevel(-1).reset_index()
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2

One option is with the pivot_longer function from pyjanitor, which abstracts the reshaping process and is also efficient:
# pip install pyjanitor
import janitor
import pandas as pd
a.pivot_longer(index="ID", names_to=".value", names_pattern="(.).+")
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
The .value tells the function which part of the columns to retain. It takes its cue from the names_pattern, which should be a regular expression with groups, the grouped regex are what stay as headers. In this case, the first letter of each column is what we are interested in, which is represented by (.).
Another option, with pivot_longer, is to use the names_sep parameter:
(a.pivot_longer(index="ID", names_to=(".value", "num"), names_sep="-")
.drop(columns="num")
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Again, only values in the columns associated with .value remain as headers.

import pandas as pd
import math
df=pd.DataFrame(data={k:[i*k for i in range(1,5)] for k in range (1,9)})
assert(df.shape[1]%2==0)
df_1=df.iloc[:,0:math.floor(df.shape[1]/2)]
df_2=df.iloc[:,math.floor(df.shape[1]/2):]
df_2.columns=df_1.columns
df_sum=pd.concat((df_1,df_2),axis=0)
display(df_sum)
Like this?

Related

(Python)Transforming dataframe

My goal is to transform dataframe. Source and Target form is like this. And the date column of taget is index. How to transform source table to target form? (I tried pd.DataFrame([sum(list(df.values()), []), but it doesnot work)
#Source form
#date is 2021-11-24
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30],'B': [100, 200, 300]})
A B
10 100
20 200
30 300
#Target form (date is index)
# date A0 A1 A2 B0 B1 B2
# 2021-11-24 10 20 30 100 200 300
You can do it like this:
df_out = df.unstack().to_frame().T
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out['date'] = pd.to_datetime('2021-11-14')
print(df_out)
Output:
A0 A1 A2 B0 B1 B2 date
0 10 20 30 100 200 300 2021-11-14
I would first pre-process the df using pd.melt to bring the column names down to the table under a new column called variable.
df_melt = pd.melt(df, var_name='variable', value_name='value')
Then append numerical suffix to the column names
df_melt['variable'] += df_melt.groupby('variable').cumcount().astype(str)
Now, it's a good time to put the date in,
df_melt['date'] = pd.to_datetime('2021-11-24')
so that I can use it as the index for my final table. variable is also set as the index so that I can use unstack to bring it up to the column names.
df_melt.set_index(['date', 'variable']).unstack().droplevel(0, axis=1)
The last droplevel takes care of the unwanted level of column name.

How to compare a value of a single column over multiple columns in the same row using pandas?

I have a dataframe that looks like this:
np.random.seed(21)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B1', 'B2', 'B3'])
df['current_State'] = [df['B1'][0], df['B1'][1], df['B2'][2], df['B2'][3], df['B3'][4], df['B3'][5], df['B1'][6], df['B2'][7]]
df
I need to create a new column that contains the name of the column where the value of 'current_State' is the same, this is the desired output:
I tried many combinations of apply and lambda functions but without success. Any help is very welcome!
You can compare the current_State column with all the remaining columns to create a boolean mask, then use idxmax along axis=1 on this mask to get the name of the column where the value in the given row equal to corresponding value in current_State:
c = 'current_State'
df['new_column'] = df.drop(c, 1).eq(df[c], axis=0).idxmax(1)
In case if there is a possibility that there are no matching values we can instead use:
c = 'current_State'
m = df.drop(c, 1).eq(df[c], axis=0)
df['new_column'] = m.idxmax(1).mask(~m.any(1))
>>> df
A B1 B2 B3 current_State new_column
0 -0.051964 -0.111196 1.041797 -1.256739 -0.111196 B1
1 0.745388 -1.711054 -0.205864 -0.234571 -1.711054 B1
2 1.128144 -0.012626 -0.613200 1.373688 -0.613200 B2
3 1.610992 -0.689228 0.691924 -0.448116 0.691924 B2
4 0.162342 0.257229 -1.275456 0.064004 0.064004 B3
5 -1.061857 -0.989368 -0.457723 -1.984182 -1.984182 B3
6 -1.476442 0.231803 0.644159 0.852123 0.231803 B1
7 -0.464019 0.697177 1.567882 1.178556 1.567882 B2

Sort a Pandas Dataframe by Multiple Columns Using Key Argument

I have a dataframe a pandas dataframe with the following columns:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
Which I am hoping to sort primarily by column 'two', then by column 'one'. For the secondary sort, I would like to use a custom sorting rule that will sort column 'one' by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:
one two
A1 1
B1 1
A2 1
A1 2
B1 2
A2 2
I have sorted a list of strings similar to column 'one' before using a sorting rule like so:
def custom_sort(value):
return (value[0], int(value[1:]))
my_list.sort(key=custom_sort)
If I try to apply this rule via a pandas sort, I run into a number of issues including:
The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column 'one', I get the error "TypeError: cannot convert the series to <class 'int'>"
When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column 'two' using a native numerical sort.
How would I go about sorting the DataFrame as mentioned above?
You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.
>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
.sort_values(['two', 'rhs', 'lhs'])
.drop(columns=['lhs', 'rhs']))
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
0 A2 2
use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (\d+) and then drop them:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]:
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".
But I have some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.
So my solution was to use a key to sort on multiple columns with a custom sorting order:
import pandas as pd
df = pd.DataFrame([
[A2, 2],
[B1, 1],
[A1, 2],
[A2, 1],
[B1, 2],
[A1, 1]],
columns=['one','two'])
def custom_sorting(col: pd.Series) -> pd.Series:
"""Series is input and ordered series is expected as output"""
to_ret = col
# apply custom sorting only to column one:
if col.name == "one":
custom_dict = {}
# for example ensure that A2 is first, pass items in sorted order here:
def custom_sort(value):
return (value[0], int(value[1:]))
ordered_items = list(col.unique())
ordered_items.sort(key=custom_sort)
# apply custom order first:
for index, item in enumerate(ordered_items):
custom_dict[item] = index
to_ret = col.map(custom_dict)
# default text sorting is about to be applied
return to_ret
# pass two columns to be sorted
df.sort_values(
by=["two", "one"],
ascending=True,
inplace=True,
key=custom_sorting,
)
print(df)
Output:
5 A1 1
3 A2 1
1 B1 1
2 A1 2
0 A2 2
4 B1 2
Be aware that this solution can be slow.
I have created a function to solve the issue of using key argument for multi-column, following the suggestion from #Alexander. Also it deals with not duplicating names when creating temporal columns. Furthermore, it can sort the whole dataframe including the index (using the index.names).
It can be improved, but using copy-paste should work:
https://github.com/DavidDB33/pandas_helpers/blob/main/pandas_helpers/helpers.py
With pandas >= 1.1.0 and natsort, you can also do this now:
import natsort
sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())

Union of two pandas DataFrames

Say I have two data frames:
df1:
A
0 a
1 b
df2:
A
0 a
1 c
I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:
A B
0 a df1, df2
1 b df1
2 c df2
I can get the concatenated data frame (df3) without duplicates as follows:
import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)
I can't think of/find a method to have control over what element goes where. How can I add the extra column?
Thank you very much for any tips.
Merge with an indicator argument, and remap the result:
m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}
result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)
result
A B
0 a df1, df2
1 b df1
2 c df2
Use the command below:
df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) \
.groupby('A') \
.aggregate(list) \
.reset_index()
The result will be:
A source
0 a [df1, df2]
1 b [df1]
2 c [df2]
The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.
We use outer join to solve this -
df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df
A B
0 a df1,df2
1 b df1
2 c df2

Combine and complete values of two pandas dataframes from each other

I have 2 dataframes with missing values that I want to merge and complete data from each other,
A simple visualisation :
df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3
df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2
The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4
Basically merge the dataframes without duplicates of the column "A" and completing if there are missing values in a row by comparing values from same column "A" between dataframes.
I tried many things saw on the Pandas documentation + solutions on stackexchange but failed everytime
These are all the different things I tried :
pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))
(You can see I was quite desperate at the end)
Any idea how to do that ?
Did you try combine_first with A as the index?
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Or you can use first
pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]:
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Setup
Since you wrote them as csvs, I'm going to assume they were csvs.
df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)
Solution
Use fillna after having used align
pd.DataFrame.fillna(*df1.align(df2))
B C
A
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
You can use reset_index if you insist but I think it's prettier to leave it as it is.
You can use the pandas categorical data type to set an ordered list of categories, sort of these ordered categories, and drop rows with Null values to get your desired results:
from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', '', 'C3']})
# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
'B': ['', 'B4', 'B2'],
'C': ['C1', 'C4', 'C2']})
# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])
# specify the letters in your dataset
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}
# iterate over the letters
for let in letters:
# create the ordered categories - set hte range for the max # of values
cats = ['{}{}'.format(let, num) for num in list(range(1000))]
# create ordered categorical datatype
cat_type = CategoricalDtype(cats, ordered=True)
# insert into placeholder
cat_dict[let] = cat_type
# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')
# which outputs desired results
A B C
0 A1 B1 C1
2 A2 B2 C2
2 A3 B3 C3
1 A4 B4 C4
While this is a bit longer, one nice thing about doing it this way is your data can be in any order upon input. This inserts an inherit rank into the values within each column, so A1 < A2 < A3 and so on and so forth. This also enables you to sort the columns.

Categories