I have a problem i cannot solve.
I have a DataFrame in which every row is a "person" with one or two connections with another row. Every person has got an ID, and the connection is expressed in the columns COMPANION1 and COMPANION2, where will appear the ID of the person connected.
I have to bind with Pandas every "group", maybe by creating a new column with a number associated with the group.
It's easier to look ad the DF:
array = np.array([['A', 'B', 0], ['B', 'A', 0], ['C','D', 0],
['D', 'C', 0], ['E', 'G','F'], ['F','E','G'],
['G', 0, 0]])
index_values = ['0', '1', '2',
'3', '4', '5', '6']
column_values = ['ID', 'COMPANION1', 'COMPANION2']
df = pd.DataFrame(data = array,
index = index_values,
columns = column_values)
df['GROUP'] = np.zeros(len(df))
df
Dataframe screenshot
The original dataset is way bigger than this (circa 1600 rows).
In this example, A-B are bound, as C-D, and then E-F-G (yes, not every "person" has links, but it is sufficient to check if others have links to the ones without any).
How can i assign a "index" to every family? I'm sure there are no unbound people as well every family is a "closed system", and no family is bigger than 3.
I hope i've been clear enough!
Thanks a lot,
Samuel
EDIT
Ok, I think this solves the issue of G being in its own group:
# Step 1, create groups
>>> groups = [sorted([j for j in i if j != '0']) for i in list(df['ID'] + df['COMPANION1'] + df['COMPANION2'])]
# Get groups that are actually groups, not just one-offs
>>> groups = [g for g in groups if len(g) > 1]
{['A', 'B'], ['A', 'B'], ['C',' D'], ['C', 'D'], ['E', 'F', 'G'], ['E', 'F', 'G']}
# Convert groups to strings and get only the unique ones
>>> groups = set(["".join(g) for g in groups])
{"AB", "CD", "EFG"}
# Step 2, fill in groups
>>> df['Group'] = df['ID'].apply(lambda x: [i for i in groups if x in i][0])
ID COMPANION1 COMPANION2 Group
0 A B 0 AB
1 B A 0 AB
2 C D 0 CD
3 D C 0 CD
4 E G F EFG
5 F E G EFG
6 G 0 0 EFG
Then you could continue as below to assign numbers to the groups
Original
The simplest way I'm seeing to do this is to create a new column with the group members:
df['Members'] = df['ID'] + df['COMPANION1'] + df['COMPANION2']
df['Members'] = df['Members'].apply(lambda x: "".join(sorted(x)))
ID COMPANION1 COMPANION2 Members
0 A B 0 0AB
1 B A 0 0AB
2 C D 0 0CD
3 D C 0 0CD
4 E G F EFG
5 F E G EFG
6 G 0 0 00G
If you wanted to have numeric group IDs instead, you could do:
df["Group_Id"] = df["Members"].copy().replace({gid: i for i, gid in enumerate(df["Members"].unique())})
ID COMPANION1 COMPANION2 Members Group_Id
0 A B 0 0AB 0
1 B A 0 0AB 0
2 C D 0 0CD 1
3 D C 0 0CD 1
4 E G F EFG 2
5 F E G EFG 2
6 G 0 0 00G 3
Related
Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
I've the following panda data:
df = {'ID_1': [1,1,1,2,2,3,4,4,4,4],
'ID_2': ['a', 'b', 'c', 'f', 'g', 'd', 'v', 'x', 'y', 'z']
}
df = pd.DataFrame(df)
display(df)
ID_1 ID_2
1 a
1 b
1 c
2 f
2 g
3 d
4 v
4 x
4 y
4 z
For each ID_1, I need to find the combination (order doesn't matter) of ID_2. For example,
When ID_1 = 1, the combinations are ab, ac, bc.
When ID_1 = 2, the combination is fg.
Note, if the frequency of ID_1<2, then there is no combination here (see ID_1=3, for example).
Finally, I need to store the combination results in df2 as follows:
One way using itertools.combinations:
from itertools import combinations
def comb_df(ser):
return pd.DataFrame(list(combinations(ser, 2)), columns=["from", "to"])
new_df = df.groupby("ID_1")["ID_2"].apply(comb_df).reset_index(drop=True)
Output:
from to
0 a b
1 a c
2 b c
3 f g
4 v x
5 v y
6 v z
7 x y
8 x z
9 y z
I want to group a df by a column col_2, which contains mostly integers, but some cells contain a range of integers. In my real life example, each unique integer represents a specific serial number of an assembled part. Each row in the dataframe represents a single part, which is allocated to the assembled part by col_2. Some parts can only be allocated to the assembled part with a given uncertainty (range).
The expected output would be one single group for each referenced integer (assembled part S/N). For example, the entry col_1 = c should be allocated to both groups where col_2 = 1 and col_2 = 2.
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
print(df.groupby(['col_2']).groups)
The code above gives an error:
TypeError: '<' not supported between instances of 'range' and 'int'
I think this does what you want:
s = df.col_2.apply(pd.Series).set_index(df.col_1).stack().astype(int)
s.reset_index().groupby(0).col_1.apply(list)
The first step gives you:
col_1
a 0 1
b 0 2
c 0 1
1 2
d 0 3
e 0 2
1 3
2 4
f 0 5
And the final result is:
1 [a, c]
2 [b, c, e]
3 [d, e]
4 [e]
5 [f]
Try this:
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
df['col_2'] = df.col_2.map(lambda x: range(x) if type(x) != range else x)
print(df.groupby(['col_2']).groups)```
My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()
How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN