Data Transforming/formatting in Python - python

I've the following panda data:
df = {'ID_1': [1,1,1,2,2,3,4,4,4,4],
'ID_2': ['a', 'b', 'c', 'f', 'g', 'd', 'v', 'x', 'y', 'z']
}
df = pd.DataFrame(df)
display(df)
ID_1 ID_2
1 a
1 b
1 c
2 f
2 g
3 d
4 v
4 x
4 y
4 z
For each ID_1, I need to find the combination (order doesn't matter) of ID_2. For example,
When ID_1 = 1, the combinations are ab, ac, bc.
When ID_1 = 2, the combination is fg.
Note, if the frequency of ID_1<2, then there is no combination here (see ID_1=3, for example).
Finally, I need to store the combination results in df2 as follows:

One way using itertools.combinations:
from itertools import combinations
def comb_df(ser):
return pd.DataFrame(list(combinations(ser, 2)), columns=["from", "to"])
new_df = df.groupby("ID_1")["ID_2"].apply(comb_df).reset_index(drop=True)
Output:
from to
0 a b
1 a c
2 b c
3 f g
4 v x
5 v y
6 v z
7 x y
8 x z
9 y z

Related

Grouping DataFrame rows given a row-dependent condition

I have a problem i cannot solve.
I have a DataFrame in which every row is a "person" with one or two connections with another row. Every person has got an ID, and the connection is expressed in the columns COMPANION1 and COMPANION2, where will appear the ID of the person connected.
I have to bind with Pandas every "group", maybe by creating a new column with a number associated with the group.
It's easier to look ad the DF:
array = np.array([['A', 'B', 0], ['B', 'A', 0], ['C','D', 0],
['D', 'C', 0], ['E', 'G','F'], ['F','E','G'],
['G', 0, 0]])
index_values = ['0', '1', '2',
'3', '4', '5', '6']
column_values = ['ID', 'COMPANION1', 'COMPANION2']
df = pd.DataFrame(data = array,
index = index_values,
columns = column_values)
df['GROUP'] = np.zeros(len(df))
df
Dataframe screenshot
The original dataset is way bigger than this (circa 1600 rows).
In this example, A-B are bound, as C-D, and then E-F-G (yes, not every "person" has links, but it is sufficient to check if others have links to the ones without any).
How can i assign a "index" to every family? I'm sure there are no unbound people as well every family is a "closed system", and no family is bigger than 3.
I hope i've been clear enough!
Thanks a lot,
Samuel
EDIT
Ok, I think this solves the issue of G being in its own group:
# Step 1, create groups
>>> groups = [sorted([j for j in i if j != '0']) for i in list(df['ID'] + df['COMPANION1'] + df['COMPANION2'])]
# Get groups that are actually groups, not just one-offs
>>> groups = [g for g in groups if len(g) > 1]
{['A', 'B'], ['A', 'B'], ['C',' D'], ['C', 'D'], ['E', 'F', 'G'], ['E', 'F', 'G']}
# Convert groups to strings and get only the unique ones
>>> groups = set(["".join(g) for g in groups])
{"AB", "CD", "EFG"}
# Step 2, fill in groups
>>> df['Group'] = df['ID'].apply(lambda x: [i for i in groups if x in i][0])
ID COMPANION1 COMPANION2 Group
0 A B 0 AB
1 B A 0 AB
2 C D 0 CD
3 D C 0 CD
4 E G F EFG
5 F E G EFG
6 G 0 0 EFG
Then you could continue as below to assign numbers to the groups
Original
The simplest way I'm seeing to do this is to create a new column with the group members:
df['Members'] = df['ID'] + df['COMPANION1'] + df['COMPANION2']
df['Members'] = df['Members'].apply(lambda x: "".join(sorted(x)))
ID COMPANION1 COMPANION2 Members
0 A B 0 0AB
1 B A 0 0AB
2 C D 0 0CD
3 D C 0 0CD
4 E G F EFG
5 F E G EFG
6 G 0 0 00G
If you wanted to have numeric group IDs instead, you could do:
df["Group_Id"] = df["Members"].copy().replace({gid: i for i, gid in enumerate(df["Members"].unique())})
ID COMPANION1 COMPANION2 Members Group_Id
0 A B 0 0AB 0
1 B A 0 0AB 0
2 C D 0 0CD 1
3 D C 0 0CD 1
4 E G F EFG 2
5 F E G EFG 2
6 G 0 0 00G 3

Combine Columns in Pandas

Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8

How do I group values in different rows that have the same name in Pandas?

I have this Pandas DataFrame df:
column1 column2
0 x a
1 x b
2 x c
3 y d
4 y e
5 y f
6 y g
7 z h
8 z i
9 z j
How do I group the values in column2 according to the value in column1?
Expected output:
x y z
0 a d h
1 b e i
2 c f j
3 g
I'm new to Pandas, I'd really appreciate your help.
This is a pivot problem with some preprocessing work:
(df.assign(index=df.groupby('column1').transform('cumcount'))
.pivot('index', 'column1', 'column2'))
column1 x y z
index
0 a d h
1 b e i
2 c f j
3 NaN g NaN
We're pivoting using "column1" as the header and "column2" as the values. To make pivoting possible, we need a 3rd column which identifies the uniqueness of the values being pivoted, so we build that with groupby and cumcount.
Somewhat coding is due to the fact that each column in the
solution (res) dataframe is of different size.
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'column1' : ['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z'], 'column2' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})
print(df)
new_columns = df['column1'].unique().tolist() # ['x', 'y', 'z']
res = pd.DataFrame(columns=new_columns)
res[new_columns[0]] = df[df['column1'] == new_columns[0]]['column2'] # adding first column 'x'
for new_column in new_columns[1:]:
new_col_ser = df[df['column1'] == new_column]['column2']
no_of_rows_to_add = len(new_col_ser) - len(res)
for i in range(no_of_rows_to_add):
res.loc[len(res)+1,:] = np.nan
res[new_column][:len(new_col_ser)] = new_col_ser
print(res)
Output:
column1 column2
0 x a
1 x b
2 x c
3 y d
4 y e
5 y f
6 y g
7 z h
8 z i
9 z j
x y z
0 a d h
1 b e i
2 c f j
4 NaN g NaN

pandas: groupby sum conditional on other column

i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()
You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13

Pandas Dataframe pivot with rolling window

I am trying to prepare data for some time-series modeling with Python Pandas (first timer). My DataFrame looks like this:
df = pd.DataFrame({
'time': [0, 1, 2, 3, 4],
'colA': ['a', 'b', 'c', 'd', 'e'],
'colB': ['v', 'w', 'x', 'y', 'z'],
'value' : [10, 11, 12, 13, 14]
})
# time colA colB value
# 0 0 a v 10
# 1 1 b w 11
# 2 2 c x 12
# 3 3 d y 13
# 4 4 e z 14
Is there a combination of functions that could transform it into the following format?
# colA-2 colA-1 colA colB-2 colB-1 colB value
# _ _ a _ _ v 10
# _ a b _ v w 11
# a b c v w x 12
# b c d w x y 13
# c d e x y z 14
I am very new to Python/Pandas and I do not have any concrete code/results that got me even close to what I need...
You can use the shift function:
df['colA-2'] =df['colA'].shift(2, fill_value='-' )
df['colA-1'] =df['colA'].shift(1,fill_value='-')
...
I'd use pd.concat
pd.concat([
df[['colA', 'colB']].shift(i).add_suffix(f'-{i}')
for i in range(1, 3)], axis=1
).fillna('-').join(df)
colA-1 colB-1 colA-2 colB-2 time colA colB value
0 - - - - 0 a v 10
1 a v - - 1 b w 11
2 b w a v 2 c x 12
3 c x b w 3 d y 13
4 d y c x 4 e z 14

Categories