Pandas - Attach column to a DataFrame - python

I have two dataframes, which for simplicity look like:
A B C D E
1 2 3 4 5
5 4 3 2 1
1 3 5 7 9
9 7 5 3 1
And the second one looks like:
F
0
1
0
1
So, both dataframes have the SAME number of rows.
I want to attach column F to the first dataframe:
A B C D E F
1 2 3 4 5 0
5 4 3 2 1 1
1 3 5 7 9 0
9 7 5 3 1 1
I have already tried various methods such as joins, iloc, adding df['F'] manually, and I don't seem to find an answer. Most of the time I get F added to the dataframe, but with its data filled with NaN (e.g. the lines where the first dataframe was filled, I get NaN in F, and then I get double the number of rows with NaN everywhere, except F, where the data is OK).

It seems you want to add column F to the first dataframe regardless of the index of both dataframes. In that case, just assign through ndarray of column F
df1['F'] = df2['F'].to_numpy()
Out[131]:
A B C D E F
0 1 2 3 4 5 0
1 5 4 3 2 1 1
2 1 3 5 7 9 0
3 9 7 5 3 1 1

You have just to create a new column on the original dataframe assigning the result of the second dataframe:
generating the example
import pandas as pd
data1 = {"A": [1, 5, 1, 9],
"B": [2, 4, 3, 7],
"C": [3, 3, 5, 5],
"D": [4, 2, 7, 3],
"E": [5, 1, 9, 1]}
data2 = {"F": [0, 1, 0, 1]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#creating the column
df1["F"] = df2.F
df1
> A B C D E F
> 0 1 2 3 4 5 0
> 1 5 4 3 2 1 1
> 2 1 3 5 7 9 0
> 3 9 7 5 3 1 1

Related

Splitting a Dataframe not based on a string, but a value in a column

I have a dataframe cut from a much larger dataframe:
import pandas as pd
data = {'Name': [5, 5, 6, 6, 7, 7],
'Value': [1, 2, 1, 2, 1, 2]
}
df = pd.DataFrame(data)
Name Value
0 5 1
1 5 2
2 6 1
3 6 2
4 7 1
5 7 2
Ideal Output:
Name Value Value2
0 5 1 2
1 6 1 2
2 7 1 2
I need a way to split the dataframe into 2 separate dataframes based on the "Value" column. The rows with '1' in the Value column and the rows with '2' in the Value column need to be split up.
The best/end goal solution is to have one Name with the 1 and 2 be separate columns in the same dataframe. My idea so far is to split the two and combine them so the data is side by side all tracing back to a single Name.
You can group by your Name column and aggregate your values into a list.
out = df.groupby('Name').agg(list).reset_index()
Use a DataFrame constructor to break the value and assign them back:
out[['Value1','Value2']] = pd.DataFrame(out.Value.tolist(), index= out.index)
>>> out.drop('Value',axis=1)
Name Value1 Value2
0 5 1 2
1 6 1 2
2 7 1 2
Use pd.concat:
>>> pd.concat([out['Name'],
pd.DataFrame(out["Value"].to_list(), columns=['Value1', 'Value2'])],
axis=1)
Name Value1 Value2
0 5 1 2
1 6 1 2
2 7 1 2
Complete code to answer your comments:
Sample DF
data = {'Name': [5, 5, 6, 6, 7, 7],
'Value': [1, 2, 1, 2, 1, 2]
}
df = pd.DataFrame(data)
Name Value
0 5 1
1 5 2
2 6 1
3 6 2
4 7 1
5 7 2
Answer:
out = df.groupby('Name').agg(list).reset_index()
out[['Value1','Value2']] = pd.DataFrame(out.Value.tolist(), index= out.index)
out.drop('Value',axis=1)
Name Value1 Value2
0 5 1 2
1 6 1 2
2 7 1 2
Additional columns:
data = {'Name': [5, 5, 6, 6, 7, 7],
'Value': [1, 2, 1, 2, 1, 2],
'extra':[1,2,3,4,5,6]
}
df = pd.DataFrame(data)
out = df.groupby('Name').agg({'Value':list}).reset_index()
out[['Value1','Value2']] = pd.DataFrame(out.Value.tolist(), index= out.index)
out.drop('Value',axis=1,inplace=True)
result = pd.merge(df.drop('Value',axis=1),out,on='Name',how='left')
>>>result
Name extra Value1 Value2
0 5 1 1 2
1 5 2 1 2
2 6 3 1 2
3 6 4 1 2
4 7 5 1 2
5 7 6 1 2
Use pandas.DataFrame.groupby():
>>> df
Name Value
0 5 1
1 5 2
2 6 1
3 6 2
4 7 1
5 7 2
>>> dfs = [d for _, d in df.groupby('Value')]
>>> dfs
[ Name Value
0 5 1
2 6 1
4 7 1,
Name Value
1 5 2
3 6 2
5 7 2]
>>> dfs[0]
Name Value
0 5 1
2 6 1
4 7 1
>>> dfs[1]
Name Value
1 5 2
3 6 2
5 7 2

Replicating rows on Pandas Dataframe based on a value column and then affixing a counter column

Suppose I have this dataframe df:
A B count
0 1 2 3
1 3 4 2
2 5 6 1
3 7 8 2
Then I want to do row-replication operation depending on the count column, and then add a new column that does the counter. So the resulting outcome is:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
My idea was to duplicate the rows accordingly (using numpy and pandas df). Then add a counter column that increments for every row found the same and then reset to 0 once found a new row. But I was thinking this may be slow. Is there any way to do it much easily and not that slow?
Let's try index.repeat to scale up the DataFrame, then groupby cumcount to create the groups and insert it into the DataFrame at the front:
df = df.loc[df.index.repeat(df['count'])]
df.insert(0, 'counter', df.groupby(level=0).cumcount())
df = df.reset_index(drop=True)
df:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
DataFrame constructor:
import pandas as pd
df = pd.DataFrame({
'A': [1, 3, 5, 7], 'B': [2, 4, 6, 8], 'count': [3, 2, 1, 2]
})

Pandas add new column with CumSum of two columns, restart with new value in other column

I have the following df:
A B C
1 10 2
1 15 0
2 5 2
2 5 0
I add column D through:
df["D"] = (df.B - df.C).cumsum()
A B C D
1 10 2 8
1 15 0 23
2 5 2 26
2 5 0 31
I want the cumsum to restart in row 3 where the value in column A is different from the value in row 2.
Desired output:
A B C D
1 10 2 8
1 15 0 23
2 5 2 3
2 5 0 8
Try with
df['new'] = (df.B-df.C).groupby(df.A).cumsum()
Out[343]:
0 8
1 23
2 3
3 8
dtype: int64
Use groupby and cumsum
df['D'] = df.assign(D=df['B']-df['C']).groupby('A')['D'].cumsum()
A B C D
0 1 10 2 8
1 1 15 0 23
2 2 5 2 3
3 2 5 0 8
import pandas as pd
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [10, 15, 5, 5], "C": [2, 0, 2, 0]})
df['D'] = df['B'] - df['C']
df = df.groupby('A').cumsum()
print(df)
output:
B C D
0 10 2 8
1 25 2 23
2 5 2 3
3 10 2 8

Pandas max for rows, top n max

Im trying to create top columns, which is the max of a couple of column rows. Pandas has a method nlargest but I cannot get it to work in rows. Pandas also has max and idxmax which does exactly what I want to do but only for the absolute max value.
df = pd.DataFrame(np.array([[1, 2, 3, 5, 1, 9], [4, 5, 6, 2, 5, 9], [7, 8, 9, 2, 5, 10]]), columns=['a', 'b', 'c', 'd', 'e', 'f'])
cols = df.columns[:-1].tolist()
df['max_1_val'] = df[cols].max(axis=1)
df['max_1_col'] = df[cols].idxmax(axis=1)
Output:
a b c d e f max_1_val max_1_col
0 1 2 3 5 1 9 5 d
1 4 5 6 2 5 9 6 c
2 7 8 9 2 5 10 9 c
But I am trying to get max_n_val and max_n_col so the expected output for top 3 would be:
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val max_3_col
0 1 2 3 5 1 9 5 d 3 c 2 b
1 4 5 6 2 5 9 6 c 5 b 5 e
2 7 8 9 2 5 10 9 c 8 b 7 a
For improve performance is used numpy.argsort for positions, for correct order is used the last 3 items, reversed by indexing:
N = 3
a = df[cols].to_numpy().argsort()[:, :-N-1:-1]
print (a)
[[3 2 1]
[2 4 1]
[2 1 0]]
Then get columns names by indexing to c and for reordering values in d use this solution:
c = np.array(cols)[a]
d = df[cols].to_numpy()[np.arange(a.shape[0])[:, None], a]
Last create DataFrames, join by concat and reorder columns names by DataFrame.reindex:
df1 = pd.DataFrame(c).rename(columns=lambda x : f'max_{x+1}_col')
df2 = pd.DataFrame(d).rename(columns=lambda x : f'max_{x+1}_val')
c = df.columns.tolist() + [y for x in zip(df2.columns, df1.columns) for y in x]
df = pd.concat([df, df1, df2], axis=1).reindex(c, axis=1)
print (df)
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val \
0 1 2 3 5 1 9 5 d 3 c 2
1 4 5 6 2 5 9 6 c 5 e 5
2 7 8 9 2 5 10 9 c 8 b 7
max_3_col
0 b
1 b
2 a

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

Categories