I want to find local duplicates and give them a unique id, directly in pandas.
Reallife example:
Time-ordered purchase data where a customer id occures multiple times (because he visits a shop multiple times a week), but I want to identify occasions where the customer purches multiple items at the same time.
My current approach would look like this:
def follow_ups(lst):
lst2 = [None] + lst[:-1]
i = 0
l = []
for e1, e2 in zip(lst, lst2):
if e1 != e2:
i += 1
l.append(i)
return l
follow_ups(['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'])
# [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]
# for pandas
df['out'] = follow_ups(df['test'])
But I have the feeling there might be a much simpler and cleaner approach in pandas which I am unable to find.
Pandas Sample data
import pandas as pd
df = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C']})
# test
# 0 A
# 1 B
# 2 B
# 3 C
# 4 B
# 5 D
# 6 D
# 7 D
# 8 E
# 9 A
# 10 B
# 11 C
df_out = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'], 'out':[1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]})
# test out
# 0 A 1
# 1 B 2
# 2 B 2
# 3 C 3
# 4 B 4
# 5 D 5
# 6 D 5
# 7 D 5
# 8 E 6
# 9 A 7
# 10 B 8
# 11 C 9
You can compare whether your column test is not equal to it's shifted version, using shift() with ne(), and use cumsum() on that:
df['out'] = df['test'].ne(df['test'].shift()).cumsum()
Which prints:
df
test out
0 A 1
1 B 2
2 B 2
3 C 3
4 B 4
5 D 5
6 D 5
7 D 5
8 E 6
9 A 7
10 B 8
11 C 9
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 7 months ago.
How to add a new column value based on condition?Having two data set as follows:
First data set contain 2 columns as follows:
Start
End
A
B
A
C
A
D
B
A
B
C
B
E
----------
----------------
Second data set contain 3 columns.
start
End
time
A
B
8
A
D
9
A
E
10
B
A
7
B
E
4
----------
----------------
----
If the start and end are same, add the time with the first data set.How to merge these two columns in python as follows.
Start
End
Time
A
B
8
A
C
nan
A
D
9
B
A
7
B
C
nan
B
E
4
----------
----------------
----
df1 = pd.DataFrame({'Start':['A', 'A', 'A', 'B', 'B', 'B'],
'End': ['B', 'C', 'D', 'A', 'C', 'E']})
df2 = pd.DataFrame({'Start':['A', 'A', 'A', 'B', 'B'],
'End': ['B', 'D', 'E', 'A', 'E'],
'time':[ 8, 9, 10, 7, 4]})
result = df1.merge(df2, how='left')
Start
End
time
A
B
8
A
C
nan
A
D
9
B
A
7
B
C
nan
B
E
4
Here I am assuming that your both dataframe saving same column name as Start and End
Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 4 years ago.
I have a dataframe that I group according to an id-column. For each group I want to get the row (the whole row, not just the value) containing the max value. I am able to do this by first getting the max value for each group, then create a filter array and then apply the filter on the original dataframe. Like so,
import pandas as pd
# Dummy data
df = pd.DataFrame({'id' : [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4],
'other_value' : ['a', 'e', 'b', 'b', 'a', 'd', 'b', 'f' ,'a' ,'c', 'e', 'f'],
'value' : [1, 3, 5, 2, 5, 6, 2, 4, 6, 1, 7, 3]
})
# Get the max value in each group
df_max = df.groupby('id')['value'].max()
# Create row filter
row_filter = [df_max[i]==v for i, v in zip(df['id'], df['value'])]
# Filter
df_target = df[row_filter]
df_target
Out[58]:
id other_value value
2 1 b 5
5 2 d 6
7 3 f 4
10 4 e 7
This solution works, but somehow seems overly cumbersome. Does anybody know of a nicer way to do this. Preferably a oneliner. Regarding potential duplicates, I'll deal with those later :)
Use DataFrameGroupBy.idxmax if need select only one max value:
df = df.loc[df.groupby('id')['value'].idxmax()]
print (df)
id other_value value
2 1 b 5
5 2 d 6
7 3 f 4
10 4 e 7
If multiple max values and want seelct all rows by max values:
df = pd.DataFrame({'id' : [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4],
'other_value' : ['a', 'e', 'b', 'b', 'a', 'd', 'b', 'f' ,'a' ,'c', 'e', 'f'],
'value' : [1, 3, 5, 2, 5, 6, 2, 4, 6, 1, 7, 7]
})
print (df)
id other_value value
0 1 a 1
1 1 e 3
2 1 b 5
3 2 b 2
4 2 a 5
5 2 d 6
6 3 b 2
7 3 f 4
8 4 a 6
9 4 c 1
10 4 e 7
11 4 f 7
df = df[df.groupby('id')['value'].transform('max') == df['value']]
print (df)
id other_value value
2 1 b 5
5 2 d 6
7 3 f 4
10 4 e 7
11 4 f 7
I have a dataset where two columns have almost perfect correlation, meaning when one column has a certain value there is very high chance that the second column will have another certain value, Example:
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'B': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'd', 'd', 'e', 'e', 'f', 'f', 'g']})
print(df)
Out[6]:
A B
0 1 a
1 1 a
2 1 a
3 1 a
4 1 a
5 1 a
6 2 b
7 2 c
8 3 d
9 3 d
10 4 e
11 4 e
12 5 f
13 5 f
14 5 g
When column A has value of 1, B will have a - that's a perfect correlation as there is no A value of 1 which will have a B value different than a. That is also the case with 3->d, 4->e.
5 and 2 are not perfectly correlated.
How can I find all the A values who has more than one matching B values so I could print them all out?
In this case, my desired output would be something like
find_imperfect_correlations(df, 'A', 'B')
Out[7]:
2 -> 'b', 'c'
5 -> 'f', 'g'
EDIT:
Preferably a generalized answer for when the dtype of B could be ints, dates, etc.
def find_imperfect_correlations(df, col1, col2):
df_out = df.groupby(col1).filter(lambda x: x[col2].nunique()>1).groupby(col1)[col2].apply(lambda x:x.unique())
for i in df_out.iteritems():
print(str(i[0]) + ' -> ' + str(i[1]))
find_imperfect_correlations(df, 'A', 'B')
Output:
2 -> ['b' 'c']
5 -> ['f' 'g']
IIUIC, you can
In [562]: s = df.groupby('A')['B'].unique()
In [563]: s[s.str.len() > 1]
Out[563]:
A
2 [b, c]
5 [f, g]
dtype: object
Or
In [564]: s[s.str.len() > 1].str.join(', ')
Out[564]:
A
2 b, c
5 f, g
dtype: object
This one would also work:
grouped = df.groupby('A').B.nunique()
df_grouped = grouped.to_frame().reset_index()
decorrelated = df_grouped[df_grouped['B'] > 1]
print(decorrelated['A'])
The first line counts the distinct values in column B for each value in column A. The second line converts the resulting series to a dataframe. The third line selects the rows where the number of distinct values is greater than 1. Then the last line prints the A values.