This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 7 months ago.
How to add a new column value based on condition?Having two data set as follows:
First data set contain 2 columns as follows:
Start
End
A
B
A
C
A
D
B
A
B
C
B
E
----------
----------------
Second data set contain 3 columns.
start
End
time
A
B
8
A
D
9
A
E
10
B
A
7
B
E
4
----------
----------------
----
If the start and end are same, add the time with the first data set.How to merge these two columns in python as follows.
Start
End
Time
A
B
8
A
C
nan
A
D
9
B
A
7
B
C
nan
B
E
4
----------
----------------
----
df1 = pd.DataFrame({'Start':['A', 'A', 'A', 'B', 'B', 'B'],
'End': ['B', 'C', 'D', 'A', 'C', 'E']})
df2 = pd.DataFrame({'Start':['A', 'A', 'A', 'B', 'B'],
'End': ['B', 'D', 'E', 'A', 'E'],
'time':[ 8, 9, 10, 7, 4]})
result = df1.merge(df2, how='left')
Start
End
time
A
B
8
A
C
nan
A
D
9
B
A
7
B
C
nan
B
E
4
Here I am assuming that your both dataframe saving same column name as Start and End
Related
I would like do a one to many mapping with the following list and mapping dictonary:
l1 = ['a', 'b', 'c']
l2 = ['a', 'c', 'd']
l3 = ['d', 'e', 'f']
mapping_dict = ChainMap(
dict.fromkeys(l1, 'A'),
dict.fromkeys(l2, 'B'),
dict.fromkeys(l3, 'C'))
This is my dataframe:
df = pd.DataFrame({'code': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [1, 2, 3, 4, 5, 6]})
print(df)
code value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
When I do the mapping as follows:
df['mapping'] = df['code'].map(mapping_dict.get)
code value mapping
0 a 1 A
1 b 2 A
2 c 3 A
3 d 4 B
4 e 5 C
5 f 6 C
The problem is that I want to do a one to many mapping and I don't capture the relationship. The desired outcome would be something like this. Which creates a new line when there are multiple relationships.
code value mapping
0 a 1 A
1 a 1 B
2 b 2 A
3 c 3 A
4 c 3 B
5 d 4 B
6 d 4 C
7 e 5 C
8 e 5 C
9 f 6 C
Thank you for your support.
Here ChainMap can't be used since it will not preserve all the duplicate keys. The solution is to create an intermediate dataframe from each pairs of (mapping, code) and then left merge that with the original dataframe
pairs = [('A', l1), ('B', l2), ('C', l3)]
mapping = pd.DataFrame(pairs, columns=['mapping', 'code'])
df.merge(mapping.explode('code'), how='left')
Result
code value mapping
0 a 1 A
1 a 1 B
2 b 2 A
3 c 3 A
4 c 3 B
5 d 4 B
6 d 4 C
7 e 5 C
8 f 6 C
I want to find local duplicates and give them a unique id, directly in pandas.
Reallife example:
Time-ordered purchase data where a customer id occures multiple times (because he visits a shop multiple times a week), but I want to identify occasions where the customer purches multiple items at the same time.
My current approach would look like this:
def follow_ups(lst):
lst2 = [None] + lst[:-1]
i = 0
l = []
for e1, e2 in zip(lst, lst2):
if e1 != e2:
i += 1
l.append(i)
return l
follow_ups(['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'])
# [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]
# for pandas
df['out'] = follow_ups(df['test'])
But I have the feeling there might be a much simpler and cleaner approach in pandas which I am unable to find.
Pandas Sample data
import pandas as pd
df = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C']})
# test
# 0 A
# 1 B
# 2 B
# 3 C
# 4 B
# 5 D
# 6 D
# 7 D
# 8 E
# 9 A
# 10 B
# 11 C
df_out = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'], 'out':[1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]})
# test out
# 0 A 1
# 1 B 2
# 2 B 2
# 3 C 3
# 4 B 4
# 5 D 5
# 6 D 5
# 7 D 5
# 8 E 6
# 9 A 7
# 10 B 8
# 11 C 9
You can compare whether your column test is not equal to it's shifted version, using shift() with ne(), and use cumsum() on that:
df['out'] = df['test'].ne(df['test'].shift()).cumsum()
Which prints:
df
test out
0 A 1
1 B 2
2 B 2
3 C 3
4 B 4
5 D 5
6 D 5
7 D 5
8 E 6
9 A 7
10 B 8
11 C 9
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14
I have a dataset where two columns have almost perfect correlation, meaning when one column has a certain value there is very high chance that the second column will have another certain value, Example:
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'B': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'd', 'd', 'e', 'e', 'f', 'f', 'g']})
print(df)
Out[6]:
A B
0 1 a
1 1 a
2 1 a
3 1 a
4 1 a
5 1 a
6 2 b
7 2 c
8 3 d
9 3 d
10 4 e
11 4 e
12 5 f
13 5 f
14 5 g
When column A has value of 1, B will have a - that's a perfect correlation as there is no A value of 1 which will have a B value different than a. That is also the case with 3->d, 4->e.
5 and 2 are not perfectly correlated.
How can I find all the A values who has more than one matching B values so I could print them all out?
In this case, my desired output would be something like
find_imperfect_correlations(df, 'A', 'B')
Out[7]:
2 -> 'b', 'c'
5 -> 'f', 'g'
EDIT:
Preferably a generalized answer for when the dtype of B could be ints, dates, etc.
def find_imperfect_correlations(df, col1, col2):
df_out = df.groupby(col1).filter(lambda x: x[col2].nunique()>1).groupby(col1)[col2].apply(lambda x:x.unique())
for i in df_out.iteritems():
print(str(i[0]) + ' -> ' + str(i[1]))
find_imperfect_correlations(df, 'A', 'B')
Output:
2 -> ['b' 'c']
5 -> ['f' 'g']
IIUIC, you can
In [562]: s = df.groupby('A')['B'].unique()
In [563]: s[s.str.len() > 1]
Out[563]:
A
2 [b, c]
5 [f, g]
dtype: object
Or
In [564]: s[s.str.len() > 1].str.join(', ')
Out[564]:
A
2 b, c
5 f, g
dtype: object
This one would also work:
grouped = df.groupby('A').B.nunique()
df_grouped = grouped.to_frame().reset_index()
decorrelated = df_grouped[df_grouped['B'] > 1]
print(decorrelated['A'])
The first line counts the distinct values in column B for each value in column A. The second line converts the resulting series to a dataframe. The third line selects the rows where the number of distinct values is greater than 1. Then the last line prints the A values.