how to add new input row on dataframe?

how to add new input row on dataframe? - python

I have this data-frame
df = pd.DataFrame({'Type':['A','A','B','B'], 'Variants':['A3','A6','Bxy','Byz']})
it shows like this
Type Variants
0 A A3
1 A A6
2 B Bxy
3 B Byz
I should make a function that adds a new row below each on every new Type key-values.
it should go like this if I'm adding n=2
Type Variants
0 A A3
1 A A6
2 A Nan
3 A Nan
4 B Bxy
5 B Byz
6 B Nan
7 B Nan
can anyone help me with this , I will appreciate it a lot, thx in advance

Create a dataframe to merge with your original one:
def add_rows(df, n):
df1 = pd.DataFrame(np.repeat(df['Type'].unique(), n), columns=['Type'])
return pd.concat([df, df1]).sort_values('Type').reset_index(drop=True)
out = add_rows(df, 2)
print(out)
# Output
Type Variants
0 A A3
1 A A6
2 A NaN
3 A NaN
4 B Bxy
5 B Byz
6 B NaN
7 B NaN

Related

Pandas groupby two columns and expand the third

I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0

df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None

The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())

Merging columns using pandas

I am trying to merge multiple-choice question columns using pandas so I can then manipulate them. An example of what my questions look like is:
C1 C2 C3
0 A A
1 B B
2 C C
3 D D
The data is currently presented as C1 and C2 but I need it to be combined into 1 column as represented in C3.

One option, assuming NaN in empty cells, is to bfill the first column and copy it:
df['C3'] = df[['C1', 'C2']].bfill(axis=1)['C1']
This way is extensible to any number of initial columns.
Output:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D

You may try with fillna
df['C3'] = df['C1'].fillna(df['C2'])
df
Out[483]:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D

You can also use combine_first:
df['C3'] = df['C1'].combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
If your cells contain empty strings and not null values, replace them temporary by NaN:
df['C3'] = df['C1'].replace('', np.nan).combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A A
1 B B
2 C C
3 D D

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Replace column and extend index in DataFrame

I have DataFrame x and I would like to replace one column with Series y
x = DataFrame([[1,2],[3,4]], columns=['C1','C2'], index=['a','b'])
C1 C2
a 1 2
b 3 4
y = Series([5,6,7], index=['a','b','c'])
a 5
b 6
c 7
Simple replacement works fine but keeps original index of DataFrame
x['C1'] = y
C1 C2
a 5 2
b 6 4
I need to have union of indeces of x and y. One solution would be to reindex before replacement
x = x.reindex(x.index.union(y.index), copy=False)
x['C1'] = y
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Is there simpler way?

combine_first
Turn y into a DataFrame first with to_frame
y.to_frame('C1').combine_first(x)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
align and assign
Use align to... align the indices
x, y = x.align(y, axis=0)
x.assign(C1=y)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Your cat try use join:
pd.DataFrame(y,columns=['C1']).join(x[['C2']])
Output:
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Similar to your solution but more succinct, you use reindex, then assign:
res = x.reindex(x.index | y.index).assign(C1=y)
print(res)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

You can use concat but you will have to fix the column names, i.e.
import pandas as pd
pd.concat([x.loc[:, 'C2'], y], axis = 1)
which gives,
C2 0
a 2.0 5
b 4.0 6
c NaN 7

How to select rows which matches certain row

I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.

I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to add new input row on dataframe? - python

Related

Pandas groupby two columns and expand the third

Merging columns using pandas

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

Replace column and extend index in DataFrame

How to select rows which matches certain row

Categories

Resources