Pandas replace Na using merge or join - python

I want to replace Na in column A with based on shared values of column B so column rows with x in column B have 1 in column A and rows with y in column B have 2 in column A
A B C D E
1 x d e q
Na x v s f
Na x v e j
2 y w e v
Na y b d g
'''

Use groupby.transform('first'), eventually combined with convert_dtypes:
df['A'] = df.groupby('B')['A'].transform('first').convert_dtypes()
output:
A B C D E
0 1 x d e q
1 1 x v s f
2 1 x v e j
3 2 y w e v
4 2 y b d g

Related

How to transpose only a specific amount and add it to the existing dataframe

I searched the internet to find a solution for my problem, but i could not find it.
I have the folowing dataframe
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
and i want to add to the existing dataframe the folowing dataframe:
pos1 pos2 pos3
0 A B C
1 A B C
2 A B C
3 D E F
4 D E F
5 D E F
6 G H I
7 G H I
8 G H I
So that i get the following dataframe:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A B C
10 A B C
11 A B C
12 D E F
13 D E F
14 D E F
15 G H I
16 G H I
17 G H I
I know that the number of row are always a multiple of the number of columns. That means if i have 4 columns than the rows should be either 4, 8, 12, 16, etc. Im my example the columns are 3 and the rows are 9
What i then want to do is transpose the rows into columns but only for that number of columns. So i want the first 3 rows to be transposed with the columns, then the next 3 rows and so forth.
I have now the following code:
import pandas as pd
import io
s = """pos1 pos2 pos3
A A A
B B B
C C C
D D D
E E E
F F F
G G G
H H H
I I I
"""
df = pd.read_csv(io.StringIO(s), delim_whitespace=True)
final_df = df.copy()
index_values = final_df.index.values
value = 0
while value < len(df.index):
sub_df = df[value:value+3]
sub_df.columns = index_values[value: value + 3]
sub_df = sub_df.T
sub_df.columns = df.columns
final_df = pd.concat([final_df, sub_df])
value += len(df.columns)
final_df = final_df.reset_index(drop=True)
print(final_df)
The code that i now have is slow because of the forloop.
Is it possible to obtain the same solution without using the forloop?
You can use the underlying numpy array with ravel and reshape with the order='F' parameter (column-major order) and the pandas.DataFrame constructor.
Then concat the output with the original array:
pd.concat([df,
pd.DataFrame(df.to_numpy().ravel().reshape(df.shape, order='F'),
columns=df.columns)
], ignore_index=True)
output:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A D G
10 A D G
11 A D G
12 B E H
13 B E H
14 B E H
15 C F I
16 C F I
17 C F I
this is somewhat efficient if you want to use pandas only.
for value in range(1,int(len(df.index)/3)):
df.loc[len(df)+value*value]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+1]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+2]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]

How do I merge categories for crosstab in pandas?

Suppose my pandas dataframe has 3 categories for variable X: [A, B, C] and 2 categories for variable Y:[D,E]. I want to cross-tab this, with something like:
+--------+----------------------+-----+
| X/Y | D | E |
+--------+----------------------+-----+
| A or B | count(X=A or B, Y=D) | ... |
| C | count(X=C),Y=D) | ... |
+--------+----------------------+-----+
Is this what you are looking for?
import pandas as pd
import numpy as np
x = np.random.choice(['A', 'B', 'C'], size=10)
y = np.random.choice(['D', 'E'], size=10)
df = pd.DataFrame({'X':x, 'Y':y})
df.head()
Output:
X Y
0 A D
1 B D
2 B E
3 B D
4 A E
Dataframe modifications:
df['X'] = df['X'].apply(lambda x: 'A or B' if x == 'A' or x == 'B' else x)
Crosstab application:
pd.crosstab(df.X, df.Y)
Output:
Y D E
X
A or B 1 3
C 4 2
You can use pandas.pivot_table() for this purpose. This should do the trick - df refers to input dataframe.
import numpy as np
df["catX"]=np.where(df["X"].isin(["A","B"]), "AB", np.where(df["X"]=="C", "C", "other"))
df2=df.pivot_table(index="catX", columns="Y", aggfunc='count', values="X")
Sample output:
#input - df with extra categorical column - catX
X Y catX
0 A D AB
1 B D AB
2 C E C
3 B E AB
4 C D C
5 B D AB
6 C D C
7 A E AB
8 A D AB
9 A E AB
10 C E C
11 C E C
12 A E AB
#result:
Y D E
catX
AB 4 4
C 2 3

How do I split a dataframe by a repeating index and enumerate?

I have a pandas dataframe that looks like the following:
0 1 2
# A B C
1 D E F
2 G H I
# J K L
1 M N O
2 P Q R
3 S T U
The index has a repeating 'delimiter', namely #. I am seeking an efficient way to transform this to the following:
0 1 2 3
# A B C 1
1 D E F 1
2 G H I 1
# J K L 2
1 M N O 2
2 P Q R 2
3 S T U 2
I would like a new column (3) which is splitting by the # symbol in the rows and enumerating the chunks. This is for an NLP application and the dataset I am working with can be found here for context: https://sites.google.com/site/germeval2014ner/data.
By the way, I know I can do this with a simple iteration, but I am wondering if there is vectorized format or a split capability I am not aware of.
Thanks for your help!
Something like
df['new_col'] = (df.index == '#').cumsum()
Output:
1 2 3 new_col
0
# A B C 1
1 D E F 1
2 G H I 1
# J K L 2
1 M N O 2
2 P Q R 2
3 S T U 2

Pandas: Strip white spaces from columns with mixed string-float values

I have a dataframe with mixed string and float/int values in column 'k':
>>> df
a b k
0 1 a q
1 2 b 1
2 3 c e
3 4 d r
When I do this to remove any whitespaces from all columns:
df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
It converts the integer 1 to a NaN:
a b k
0 1 a q
1 2 b NaN
2 3 c e
3 4 d r
How can I overcome this?
You can do with mask and to_numeric, this will mask all nonnumeric value to NaN
df=df.mask(df.apply(pd.to_numeric,errors = 'coerce').isnull(),df.astype(str).apply(lambda x : x.str.strip()))
df
Out[572]:
a b k
0 1 a q
1 2 b 1
2 3 c e
3 4 d r

How to split a string and assign as column name for a pandas dataframe?

I have a dataframe which has a single column like this:
a;d;c;d;e;r;w;e;o
--------------------
0 h;j;r;d;w;f;g;t;r
1 a;f;c;x;d;e;r;t;y
2 b;h;g;t;t;t;y;u;f
3 g;t;u;n;b;v;d;s;e
When I split it I am getting like this:
0 1 2 3 4 5 6 7 8
------------------------------
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
I need to assign a d c d e r w e o instead of 0 1 2 3 4 5 6 7 8 as column names.
I tried :
df = dataframe
df = df.iloc[:,0].str.split(';')
res = pd.DataFrame(df.columns.tolist())
res = pd.DataFrame(df.values.tolist())
I am getting values assigned to each column..But not column headers. What to do?
I think need create new DataFrame by expand=True parameter and then assign new columns names:
res = df.iloc[:,0].str.split(';', expand=True)
res.columns = df.columns[0].split(';')
print (res)
a d c d e r w e o
0 h j r d w f g t r
1 a f c x d e r t y
2 b h g t t t y u f
3 g t u n b v d s e
But maybe need sep=';' in read_csv if only one column data:
res = pd.read_csv(file, sep=';')

Categories