For the given dataframe below. I want to know if for each index in X (i.e 1,2 and 3), the values in other index Y are same and total number of them.
So for X index 1, i want to know what values are in Y which is a, b and c. And whether it is equal to Y's index values for 2 and 3. So here Y values for X index 1 is equal to Y values for X index 3 i.e they both have a,b and c, while for 3 is not same.
X Y
1 a A
b B
c C
2 a A
b B
3 a A
b B
c D
If maximum length of Y level has all possible unique values is possible use Series.unstack for reshape:
print (type(s))
<class 'pandas.core.series.Series'>
print (s.unstack())
Y a b c
X
1 A B C
2 A B NaN
3 A B D
And then remove rows with incomplete values, it means rows with missing values by DataFrame.dropna:
df1 = s.unstack().dropna()
print (df1)
Y a b c
X
1 A B C
3 A B D
print (df1.columns.tolist())
['a', 'b', 'c']
print (len(df1))
2
Related
I am trying to get the max value of one row, according to the cumulative sum of a different row. My dataframe looks like this:
df = pd.DataFrame({'constant': ['a', 'b', 'b', 'c', 'c', 'd', 'a'], 'value': [1, 3, 1, 5, 1, 9, 2]})
indx constant value
0 a 1
1 b 3
2 b 1
3 c 5
4 c 1
5 d 9
6 a 2
I am trying to add a new field, with the constant that has the highest cumulative sum of value up to that point in the dataframe. the final dataframe would look like this:
indx constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
As you can see, at index 1, a has the highest cumulative sum of value for all prior rows. At index 2, b has the highest cumulative sum of value for all prior rows, and so on.
Anyone have a solution?
As presented, you just need a shift. However try the following for other scenarios.
Steps
Find the cummulative maximum
Where the cummulative max is equal to df['value'], copy the 'constant', otherwise make it a NaN
The NaNs should leave chance to broadcast the constant corresponding to the max value
Outcome
df=df.assign(new_field=(np.where(df['value']==df['value'].cummax(), df['constant'], np.nan))).ffill()
df=df.assign(new_field=df['new_field'].shift())
constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
I think you should try and approach this as a pivot table, which would allow you to use np.argmax over the column axis.
# this will count cummulative occurences over the ix for each value of `constant`
X = df.pivot_table(
index=df.index,
columns=['constant'],
values='value'
).fillna(0.0).cumsum(axis=0)
# now you get a list of ixs that max the cummulative value over the column axis - i.e., the "winner"
colix = np.argmax(X.values, axis=1)
# you can fetch corresponding column names using this argmax index
df['winner'] = np.r_[[np.nan], X.columns[colix].values[:-1]]
# and there you go
df
constant value winner
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
You should be a little more careful (since values can be negative value which decrease cumsum), here is what you probably need to do,
df["cumsum"] = df["value"].cumsum()
df["cummax"] = df["cumsum"].cummax()
df["new"] = np.where(df["cumsum"] == df["cummax"], df['constant'], np.nan)
df["new"] = df.ffill()["new"].shift()
df
Having the following data frame, I'd like to find the 2 most frequent groups in the 'first' column, and within each group find the 2 most frequent groups in the 'second' column.
df = pd.DataFrame({'first': list('cbbcbcbccabc'), 'second': list('zvvzwyzyxxwz')})
df
gets
first second
0 c z
1 b v
2 b v
3 c z
4 b w
5 c y
6 b z
7 c y
8 c x
9 a x
10 b w
11 c z
and by df.groupby(['first']).size() we get
first
a 1
b 5
c 6
so, 'c' and 'b' are the most frequent items in the 'first' column. We want the 2 most frequent items in the 'second' column within 'c' and 'b' groups. If we do df.groupby(['first', 'second']).size() we get
first second
a x 1
b v 2
w 2
z 1
c x 1
y 2
z 3
therefore we're interested in 'z' and 'y' within 'c', and 'v' and 'w' within 'b', that is
first second
c z 3
y 2
b v 2
w 2
I think here is possible use Series.value_counts, because sorting by default by counts - first is filtered top2 values, filtered DataFrame and the is returned top2 per groups with change order by idx values:
Notice - Filtering by m is not necessary, but added for better performance (processing only 2 groups instead all)
df = pd.DataFrame({'first': list('cbbcbcbccabc'), 'second': list('zvvzwyzyxxwz')})
idx = df['first'].value_counts().head(2).index
m = df['first'].isin(idx)
df = (df[m].groupby(['first'])['second']
.apply(lambda x: x.value_counts().iloc[:2])
.reindex(idx, level=0)
.rename_axis(['first','second']))
print (df)
first second
c z 3
y 2
b w 2
v 2
Name: second, dtype: int64
Solution for 3 levels:
df = pd.DataFrame({'second': list('cbbcbcbccabc'),
'third': list('zvvzwyzyxxwz')})
#3 column df
df = (pd.concat([df, df], keys=('a','b'))
.reset_index(level=1, drop=True)
.rename_axis('first')
.reset_index())
# print (df)
idx = df['first'].value_counts().head(2).index
m = df['first'].isin(idx)
idx1 = (df[m].groupby(['first'])['second']
.apply(lambda x: x.value_counts().iloc[:2])
.index)
print (idx1)
df = df.set_index(['first','second'])
df = (df.loc[idx1].groupby(['first','second'], sort=False)['third']
.apply(lambda x: x.value_counts().iloc[:2])
.rename_axis(['first','second','third']))
print (df)
first second third
a c z 3
y 2
b w 2
v 2
b c z 3
y 2
b w 2
v 2
Name: third, dtype: int64
Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2
I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C
Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
df
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C N
I need a line of code that:
1. identifies if there are more than 1 unique values in column B for each category of A (i.e. category "C" in column A has 2 unique values in column B whereas categories "A" and "B" in column A only have 1 unique value each).
2. Changes the value in column B to "Y" only if there are more than 1 unique values per that category (i.e. Column B should have "Y" for both rows of category "C" in column A.
Here's the desired result:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y
Thanks in advance!
You could:
df['B'] = df.groupby('A')['B'].transform(lambda x: 'Y' if x.nunique() > 1 else x)
to get:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y
This should work:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
# Get unique items in each column A group
group_counts = df.groupby('A').B.apply(lambda x: len(x.unique()))
# Find all of them with more than 1 unique value
cols_to_impute = group_counts[group_counts > 1].index.values
# Change column B to 'Y' for such columns
df.loc[df.A.isin(cols_to_impute),'B'] = 'Y'
In [20]: df
Out[20]:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y
Is there any way to merge two data frames while one of them has duplicated indices such as following:
dataframe A:
value
key
a 1
b 2
b 3
b 4
c 5
a 6
dataframe B:
number
key
a I
b X
c V
after merging, I want to have a data frame like the following:
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Or maybe there are better ways to do it using groupby?
>>> a.join(b).sort('value')
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Use join:
>>> a = pd.DataFrame(range(1,7), index=list('abbbca'), columns=['value'])
>>> b = pd.DataFrame(['I', 'X', 'V'], index=list('abc'), columns=['number'])
>>> a.join(b)
value number
a 1 I
a 6 I
b 2 X
b 3 X
b 4 X
c 5 V