pd.Series.explode and ValueError: cannot reindex from a duplicate axis

pd.Series.explode and ValueError: cannot reindex from a duplicate axis - python

I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can't quite figure out what exactly is throwing me the error.
Below is my best at reproducing the spirit of the dataframe, which does throw the error.
d = {"id" : [1,2,3,4,5],
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
"catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
"catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}
df = pd.DataFrame(d)
df.head()
id cata catb catc
0 1 [aaa1, bbb2, ccc3] [ddd1, eee2, fff3, ggg4] [hhh1, iii2, jjj3, kkk4, lll5]
1 2 [aaa4, bbb5, ccc6] [ddd5, eee6, fff7, ggg8] [hhh6, iii7, jjj8, kkk9, lll10]
2 3 [aaa7, bbb8, ccc9] [ddd9, eee10, fff11, ggg12] [hhh11, iii12, jjj13, kkk14, lll15]
3 4 [aaa10, bbb11, ccc12] [ddd13, eee14, fff15, ggg16] [hhh16, iii17, jjj18, kkk18, lll19]
4 5 [aaa13, bbb14, ccc15] [ddd17, eee18, fff19, ggg20] [hhh20, iii21, jjj22, kkk23, lll24]
df.set_index(['id']).apply(pd.Series.explode).reset_index()
Here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()
14 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
The dataset I'm using is a few hundred MBs and it's a pain - lots of lists inside lists, but the example of above is a fair representation of where I'm stuck. Even when I try to generate a fake dataframe with unique values, I still don't understand why I'm getting the ValueError.
I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1), which doesn't throw a value error, however, it's definitely not as fast and I'd probably would reconsider how I'm processing the df. Still, I want to understand why I'm getting the ValueError I'm getting when I don't have any obvious duplicate values.
Thanks!!!!
Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.
id cata catb catc
0 1 aaa1 ddd1 hhh1
1 1 bbb2 eee2 iii2
2 1 ccc3 fff3 jjj3
3 1 NaN ggg4 kkk4
4 1 NaN NaN lll5
5 2 aaa4 ddd5 hhh6
6 2 bbb5 eee6 iii7
7 2 ccc6 fff7 jjj8
8 2 NaN ggg8 kkk9
9 2 NaN NaN lll10
10 3 aaa7 ddd9 hhh11
11 3 bbb8 eee10 iii12
12 3 ccc9 fff11 jjj13
13 3 NaN ggg12 kkk14
14 3 NaN NaN lll15
15 4 aaa10 ddd13 hhh16
16 4 bbb11 eee14 iii17
17 4 ccc12 fff15 jjj18
18 4 NaN ggg16 kkk18
19 4 NaN NaN lll19
20 5 aaa13 ddd17 hhh20
21 5 bbb14 eee18 iii21
22 5 ccc15 fff19 jjj22
23 5 NaN ggg20 kkk23
24 5 NaN NaN lll24

The error of pd.Series.explode() cannot be solved, but a long form with an 'id' column is created.
tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)
id value
0 0 aaa1
1 0 aaa4
2 0 aaa7
3 0 aaa10
4 0 aaa13
5 1 bbb2
6 1 bbb5
7 1 bbb8
8 1 bbb11
9 1 bbb14
10 2 ccc3
11 2 ccc6
12 2 ccc9
...

I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., "name") However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.
For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.

from pandas import DataFrame as df
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"id" : [1,2,3],
0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
)
print(df)
"""
id 0 1 2
0 1 [x, y, z] [a, b, c] [a, b, c]
1 2 [a, b, c] [a, b, c] [x, y, z]
2 3 [a, b, c] [a, b, c] [a, b, c]
"""
bb = (
df.set_index('id').stack().explode()
.reset_index(name='val')
.drop(columns='level_1').reindex()
)
print (bb)
"""
id val
0 1 x
1 1 y
2 1 z
3 1 a
4 1 b
5 1 c
6 1 a
7 1 b
8 1 c
9 2 a
10 2 b
11 2 c
12 2 a
13 2 b
14 2 c
15 2 x
16 2 y
17 2 z
18 3 a
19 3 b
20 3 c
21 3 a
22 3 b
23 3 c
24 3 a
25 3 b
26 3 c
"""
aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
id 0 1 2
0 1 x a a
1 1 y b b
2 1 z c c
3 2 a a x
4 2 b b y
5 2 c c z
6 3 a a a
7 3 b b b
8 3 c c c
"""

Related

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Pandas - Keeping groups having at least two different codes

I'm working with a DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'group' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4],
'brand' : ['A', 'B', 'X', 'A', 'B', 'C', 'X', 'B', 'C', 'X', 'A', 'B'],
'code' : [2185, 2185, 0, 1410, 1390, 1390, 0, 3670, 4870, 0, 2000, 0]})
print(df)
group brand code
0 1 A 2185
1 1 B 2185
2 1 X 0
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
10 4 A 2000
11 4 B 0
My goal is to view only the groups having a least two different codes. Missing codes labelled with 0's should not be taken into consideration in the filtering criterion. For example, even though the two records from group 4 have different codes, we don't keep this group in the final DataFrame since one of the code is missing.
The resulting DataFrame on the above example should look like this:
group brand code
1 2 A 1410
2 2 B 1390
3 2 C 1390
4 2 X 0
5 3 B 3670
6 3 C 4870
7 3 X 0
I didn't manage to do much with this problem. I think that the first step should be to create a mask to remove the records with a missing (0) code. Something like:
mask = df['code'].eq(0)
df = df[~mask]
print(df)
group brand code
0 1 A 2185
1 1 B 2185
3 2 A 1410
4 2 B 1390
5 2 C 1390
7 3 B 3670
8 3 C 4870
10 4 A 2000
And now only keep the groups having a least two different codes but I don't know how to work this out in Python. Also, this method will remove the records with an missing code in my final DataFrame which I don't want. I want to have a view on the full group.
Any additional help would be appreciated.

This is transform():
mask = (df.groupby('group')['code']
.transform(lambda x: x.mask(x==0) # mask out the 0 values
.nunique() # count the nunique
)
.gt(1)
)
df[mask]
Output:
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
Option 2: Similar idea, but without the lambda function:
mask = (df['code'].mask(df['code']==0) # mask out the 0 values
.groupby(df['group']) # groupby
.transform('nunique') # count uniques
.gt(1) # at least 2
)

We can also use groupby.filter:
df.groupby('group').filter(lambda x: x.code.mask(x.code.eq(0)).nunique()>1)
or surely faster than the previous:
( df.assign(code=df['code'].replace(0,np.nan))
.groupby('group')
.filter(lambda x: x.code.nunique()>1)
.fillna({'code':0}) )
Output
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0

Pandas: For every row in df calculate number of times that value exist in another column

I have following data frame.
>>> df = pd.DataFrame({'selected': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'D'], 'presented': ['A|B|D', 'B|D|A', 'A|B|C', 'D|C|B|A','A|C|D|B', 'D|B|C','D|C|B|A','D|B|C']})
>>> df
This is a large data set and have 500K rows (date column taken out to keep example simple)
selected presented
0 A A|B|D
1 B B|D|A
2 C A|B|C
3 A D|C|B|A
4 B A|C|D|B
5 C D|B|C
6 A D|C|B|A
7 D D|B|C
Goal is to calculate selected/presented ratio for each item in the selected column. Example A was presented in 8 times but it was only selected 6 times out of those 8 times it was presented to the user.
I would like to create following resulting data.frame:
item, selected, presented, ratio
A, 3, 6, 0.5
B, 2, 8, 0.25
I started with following but can't figure out the grouping because if I just group by selected and start counting it would only capture the time it was shown.
>>> df['ratio'] = df.apply(lambda x:1 if x.selected in x.presented.split('|') else 0, axis=1)
>>> df
selected presented ratio
0 A A|B|D 1
1 B B|D|A 1
2 C A|B|C 1
3 A D|C|B|A 1
4 B A|C|D|B 1
5 C D|B|C 1
6 A D|C|B|A 1
7 D D|B|C 1

You can using get_dummies + value_counts, then concat the result
s1=df.presented.str.get_dummies('|').sum().to_frame('presented')
s2=df.selected.value_counts()
yourdf=pd.concat([s1,s2],1,sort=True)
yourdf['ratio']=yourdf['selected']/yourdf['presented']
yourdf
Out[488]:
presented selected ratio
A 6 3 0.500000
B 8 2 0.250000
C 6 2 0.333333
D 7 1 0.142857

How about this one-liner:
df['presented'].str.split('|', expand=True).stack().value_counts(sort=False).to_frame('presented')\
.assign(selected = df['selected'].value_counts())\
.eval('ratio = selected / presented')
Output:
presented selected ratio
A 6 3 0.500000
C 6 2 0.333333
B 8 2 0.250000
D 7 1 0.142857

How to return a dataframe value from row and column reference?

I know this is probably a basic question, but somehow I can't find the answer. I was wondering how it's possible to return a value from a dataframe if I know the row and column to look for? E.g. If I have a dataframe with columns 1-4 and rows A-D, how would I return the value for B4?

You can use ix for this:
In [236]:
df = pd.DataFrame(np.random.randn(4,4), index=list('ABCD'), columns=[1,2,3,4])
df
Out[236]:
1 2 3 4
A 1.682851 0.889752 -0.406603 -0.627984
B 0.948240 -1.959154 -0.866491 -1.212045
C -0.970505 0.510938 -0.261347 -1.575971
D -0.847320 -0.050969 -0.388632 -1.033542
In [237]:
df.ix['B',4]
Out[237]:
-1.2120448782618383

Use at, if rows are A-D and columns 1-4:
print (df.at['B', 4])
If rows are 1-4 and columns A-D:
print (df.at[4, 'B'])
Fast scalar value getting and setting.
Sample:
df = pd.DataFrame(np.arange(16).reshape(4,4),index=list('ABCD'), columns=[1,2,3,4])
print (df)
1 2 3 4
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15
print (df.at['B', 4])
7
df = pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4], columns=list('ABCD'))
print (df)
A B C D
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
print (df.at[4, 'B'])
13

Replicating rows in a pandas data frame by a column value [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.

You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8

It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pd.Series.explode and ValueError: cannot reindex from a duplicate axis - python

Related

drop rows using pandas groupby and filter

Pandas - Keeping groups having at least two different codes

Pandas: For every row in df calculate number of times that value exist in another column

How to return a dataframe value from row and column reference?

Replicating rows in a pandas data frame by a column value [duplicate]

Categories

Resources