Aggregate over difference of levels of factor in Pandas DataFrame?

Aggregate over difference of levels of factor in Pandas DataFrame? - python

Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2

I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C

Related

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.

IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

Create and fill new columns using values in rows pandas

I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!

B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4

Pandas: How do I repeat dataframe for each value in a series?

I have a dataframe (df) as such:
A B
1 a
2 b
3 c
And a series: S = pd.Series(['x','y','z']) I want to repeat the dataframe df for each value in the series. The expected result is to be like this:
result:
S A B
x 1 a
y 1 a
z 1 a
x 2 b
y 2 b
z 2 b
x 3 c
y 3 c
z 3 c
How do I achieve this kind of output? I'm thinking of merge or join but mergeing is giving me a memory error. I am dealing with a rather large dataframe and series. Thanks!

Using numpy, lets say you have series and df of diffenent lengths
s= pd.Series(['X', 'Y', 'Z', 'A']) #added a character to s to make it length 4
s_n = len(s)
df_n = len(df)
pd.DataFrame(np.repeat(df.values,s_n, axis = 0), columns = df.columns, index = np.tile(s,df_n)).rename_axis('S').reset_index()
S A B
0 X 1 a
1 Y 1 a
2 Z 1 a
3 A 1 a
4 X 2 b
5 Y 2 b
6 Z 2 b
7 A 2 b
8 X 3 c
9 Y 3 c
10 Z 3 c
11 A 3 c

UPDATE:
here is a bit changed #A-Za-z's solution which might be bit more memory saving, but it's slower:
x = pd.DataFrame(index=range(len(df) * len(S)))
for col in df.columns:
x[col] = np.repeat(df[col], len(s))
x['S'] = np.tile(S, len(df))
Old incorrect answer:
In [94]: pd.concat([df.assign(S=S)] * len(s))
Out[94]:
A B S
0 1 a x
1 2 b y
2 3 c z
0 1 a x
1 2 b y
2 3 c z
0 1 a x
1 2 b y
2 3 c z

Setup
df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 'a', 1: 'b', 2: 'c'}})
S = pd.Series(['x','y','z'], name='S')
Solution
#Convert the Series to a Dataframe with desired shape of the output filled with S values.
#Join df_S to df to get As and Bs
df_S = pd.DataFrame(index=np.repeat(S.index,3), columns=['S'], data= np.tile(S.values,3))
df_S.join(df)
Out[54]:
S A B
0 x 1 a
0 y 1 a
0 z 1 a
1 x 2 b
1 y 2 b
1 z 2 b
2 x 3 c
2 y 3 c
2 z 3 c

Pandas conditionally replace value if >1 unique values for other column

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
df
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C N
I need a line of code that:
1. identifies if there are more than 1 unique values in column B for each category of A (i.e. category "C" in column A has 2 unique values in column B whereas categories "A" and "B" in column A only have 1 unique value each).
2. Changes the value in column B to "Y" only if there are more than 1 unique values per that category (i.e. Column B should have "Y" for both rows of category "C" in column A.
Here's the desired result:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y
Thanks in advance!

You could:
df['B'] = df.groupby('A')['B'].transform(lambda x: 'Y' if x.nunique() > 1 else x)
to get:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y

This should work:
import pandas as pd
df = pd.DataFrame(
{'A':['A','A','B','B','C','C'],
'B':['Y','Y','N','N','Y','N'],
})
# Get unique items in each column A group
group_counts = df.groupby('A').B.apply(lambda x: len(x.unique()))
# Find all of them with more than 1 unique value
cols_to_impute = group_counts[group_counts > 1].index.values
# Change column B to 'Y' for such columns
df.loc[df.A.isin(cols_to_impute),'B'] = 'Y'
In [20]: df
Out[20]:
A B
0 A Y
1 A Y
2 B N
3 B N
4 C Y
5 C Y

pandas get last value of column x when column y is equal to z

Suppose I create a pandas DataFrame with two columns, one of which contains some numbers and the other contains letters. Like this:
import pandas as pd
from pprint import pprint
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
pprint(df)
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
Now say that I want to make a third column (c) whose value is equal to the last value of a when b was equal to x. In the cases where a value of x was not encountered in b yet, the value in c should default to 0.
The procedure should produce pretty much the following result:
last_a = 0
c = []
for i,b in enumerate(df['b']):
if b == 'x':
last_a = df.iloc[i]['a']
c += [last_a]
df['c'] = c
pprint(df)
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4
Is there a more elegant way to accomplish this either with or without pandas?

In [140]: df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
In [141]: df
Out[141]:
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
FInd out where column 'b' == x, then return the value in that column (not the location); this column is already the 'a' column
In [142]: df['c'] = df.loc[df['b']=='x','a'].apply(lambda v: v if v < len(df) else np.nan)
Fill the rest of the values forward, then fill holes with 0
In [143]: df['c'] = df['c'].ffill().fillna(0)
In [144]: df
Out[144]:
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregate over difference of levels of factor in Pandas DataFrame? - python

Given df1: A B C 0 a 7 x 1 b 3 x 2 a 5 y 3 b 4 y 4 a 5 z 5 b 3 z How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b: C D 0 x 4 1 y 1 2 z 2

I'd use a pivot table: df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C') df2 = pd.DataFrame({'D': df['a'] - df['b']}) The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C

Related

Matching two columns from Pandas Dataframe but the order matters

Create and fill new columns using values in rows pandas

Pandas: How do I repeat dataframe for each value in a series?

Pandas conditionally replace value if >1 unique values for other column

pandas get last value of column x when column y is equal to z

Categories

Resources