I would like do a one to many mapping with the following list and mapping dictonary:
l1 = ['a', 'b', 'c']
l2 = ['a', 'c', 'd']
l3 = ['d', 'e', 'f']
mapping_dict = ChainMap(
dict.fromkeys(l1, 'A'),
dict.fromkeys(l2, 'B'),
dict.fromkeys(l3, 'C'))
This is my dataframe:
df = pd.DataFrame({'code': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [1, 2, 3, 4, 5, 6]})
print(df)
code value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
When I do the mapping as follows:
df['mapping'] = df['code'].map(mapping_dict.get)
code value mapping
0 a 1 A
1 b 2 A
2 c 3 A
3 d 4 B
4 e 5 C
5 f 6 C
The problem is that I want to do a one to many mapping and I don't capture the relationship. The desired outcome would be something like this. Which creates a new line when there are multiple relationships.
code value mapping
0 a 1 A
1 a 1 B
2 b 2 A
3 c 3 A
4 c 3 B
5 d 4 B
6 d 4 C
7 e 5 C
8 e 5 C
9 f 6 C
Thank you for your support.
Here ChainMap can't be used since it will not preserve all the duplicate keys. The solution is to create an intermediate dataframe from each pairs of (mapping, code) and then left merge that with the original dataframe
pairs = [('A', l1), ('B', l2), ('C', l3)]
mapping = pd.DataFrame(pairs, columns=['mapping', 'code'])
df.merge(mapping.explode('code'), how='left')
Result
code value mapping
0 a 1 A
1 a 1 B
2 b 2 A
3 c 3 A
4 c 3 B
5 d 4 B
6 d 4 C
7 e 5 C
8 f 6 C
Related
Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
I have dataframe A and dataframe B, I want to join B onto A but only for a certain column on B. Like this:
dataA = ['a', 'c', 'd', 'e']
A = pd.DataFrame(dataA, columns=['testA'])
dataB = [['a', 1, 'asdf'],
['b', 2, 'asdf'],
['c', 3, 'asdf'],
['d', 4, 'asdf'],
['e', 5, 'asdf']]
B = pd.DataFrame(data1, columns=['testB', 'num', 'asdf'])
Out[1]: A
testA
0 a
1 c
2 d
3 e
Out[2]: B
testB num asdf
0 a 1 asdf
1 b 2 asdf
2 c 3 asdf
3 d 4 asdf
4 e 5 asdf
My current code is:
Out[3]: A.join(B.set_index('testB'), on='testA')
testA num asdf
0 a 1 asdf
1 c 3 asdf
2 d 4 asdf
3 e 5 asdf
My desired output is only to join over the 'num' column as below and ignore the 'asdf' column, or all other columns if there were even more:
Out[4]: A
testA num
0 a 1
1 c 3
2 d 4
3 e 5
One way may be to use merge:
new_df= A.merge(B, how='left', left_on='testA', right_on='testB')[['testA', 'num']]
Result:
testA num
0 a 1
1 c 3
2 d 4
3 e 5
Use map, first create a pd.Series with the column you are bringing over as values, and in the index set the "mapping" column. This ignores and not doing any work on other columns not need in the desired result:
A['num'] = A['testA'].map(B.set_index('testB')['num'])
A
Output:
testA num
0 a 1
1 c 3
2 d 4
3 e 5
Using what you already have, and keeping only the columns you want.
z = a.join(b.set_index('testB'), on='testA')[["testA","num"]]
outputs:
testA num
0 a 1
1 c 3
2 d 4
3 e 5
I want to group a df by a column col_2, which contains mostly integers, but some cells contain a range of integers. In my real life example, each unique integer represents a specific serial number of an assembled part. Each row in the dataframe represents a single part, which is allocated to the assembled part by col_2. Some parts can only be allocated to the assembled part with a given uncertainty (range).
The expected output would be one single group for each referenced integer (assembled part S/N). For example, the entry col_1 = c should be allocated to both groups where col_2 = 1 and col_2 = 2.
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
print(df.groupby(['col_2']).groups)
The code above gives an error:
TypeError: '<' not supported between instances of 'range' and 'int'
I think this does what you want:
s = df.col_2.apply(pd.Series).set_index(df.col_1).stack().astype(int)
s.reset_index().groupby(0).col_1.apply(list)
The first step gives you:
col_1
a 0 1
b 0 2
c 0 1
1 2
d 0 3
e 0 2
1 3
2 4
f 0 5
And the final result is:
1 [a, c]
2 [b, c, e]
3 [d, e]
4 [e]
5 [f]
Try this:
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
df['col_2'] = df.col_2.map(lambda x: range(x) if type(x) != range else x)
print(df.groupby(['col_2']).groups)```
Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14