Encoding the same values in different columns with same integer in python - python

I have a data frame with true/false values stored in string format. Some values are null in the data frame.
I need to encode this data such that TRUE/FALSE/null values are encoded with the same integer in every column.
Input:
col1 col2 col3
True True False
True True True
null null True
I am using:
le = preprocessing.LabelEncoder()
df.apply(le.fit_transform)
Output:
2 1 0
2 1 1
1 0 1
But I want the output as:
2 2 0
2 2 2
1 1 2
How do i do this?

For me working create one column DataFrame:
df = df.stack(dropna=False).to_frame().apply(le.fit_transform)[0].unstack()
print (df)
col1 col2 col3
0 1 1 0
1 1 1 1
2 2 2 1
Another idea is use DataFrame.replace with 'True' instead True, because:
I have a data frame with true/false values stored in string format.
If null are missing values:
df = df.replace({'True':2, 'False':1, np.nan:0})
If null are strings null:
df = df.replace({'True':2, 'False':1, 'null':0})
print (df)
col1 col2 col3
0 2 2 1
1 2 2 2
2 0 0 2

Related

Adding new column to dataframe with value based on existing columns

How would I go through a dataframe and add a new column containing values based on whether the existing columns have the same values for each row?
For example in the following Dataframe I want to add a new column that contains 1 in the rows where Col1 and Col2 contain 1s and 0 if they do not all contain 1.
Col1 Col2
1 1
1 1
1 0
0 0
0 1
1 1
1 1
The output that I would want is
Col1 Col2 Col3
1 1 1
1 1 1
1 0 0
0 0 0
0 1 0
1 1 1
1 1 1
Ideally this would be scalable for more columns in the future (new column would only contain 1 if all columns contain 1)
if there are only 0 and 1 you try with Series.mul
df['Col3'] = df['Col1'].mul(df['Col2'])
If need check if all columns are 1 use DataFrame.all with casting to integers, working if data are only 1 and 0:
df['col3'] = df.all(axis=1).astype(int)
If need test only 1, working for any data use DataFrame.eq for ==:
df['col3'] = df.eq(1).all(axis=1).astype(int)
If want select columns for check add subset:
cols = ['Col1', 'Col2']
df['col3'] = df[cols].all(axis=1).astype(int)
Or:
df['col3'] = df[cols].eq(1).all(axis=1).astype(int)

pandas printing a column if columns value is 1 applied to all columns

I have rows and columns with columns representing actual entities. The Column values apart from the first column are either 1 or 0. The first column is a key. The objective is to return the column name (2nd to last column) if the column value is 1.
This is the function that i have written and it works. Was wondering if there is a better way to express this in Pandas, or even a better way to represent this form of data to make it more pandas friendly.
def return_keys(df,productname):
df2 = df[df['Product']==productname]
print(df2)
columns = list(df2)
cust=[]
for col in columns[1:]:
if (df2[col].to_list()[0]==1):
cust.append(col)
return cust
If your key column does not contain 0/1 , you can try using apply row-wise. Below is an example dataset:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Product':np.random.choice(['A','B','C'],10),
'Col1':np.random.binomial(1,0.5,10),
'Col2':np.random.binomial(1,0.5,10),
'Col3':np.random.binomial(1,0.5,10)})
df
Product Col1 Col2 Col3
0 A 0 1 1
1 A 1 0 0
2 A 1 1 1
3 A 1 0 0
4 C 1 1 1
5 B 0 1 1
6 C 1 0 0
7 C 0 1 0
8 C 1 1 1
9 A 0 1 0
We apply a boolean and apply (axis=1) onto this boolean data.frame, call out the columns.
(df == 1).apply(lambda x:df.columns[x].tolist(),axis=1)
0 [Col2, Col3]
1 [Col1]
2 [Col1, Col2, Col3]
3 [Col1]
4 [Col1, Col2, Col3]
5 [Col2, Col3]
6 [Col1]
7 [Col2]
8 [Col1, Col2, Col3]
9 [Col2]
Try the following:
df = pd.DataFrame({'a':[1,2,3], 'b':[0,1,2], 'c':[1,2,4], 'd':[0,2,4]})
cols_with_first_element_1 = df.columns[df.iloc[0]==1].to_list()
print(cols_with_first_element_1)
results in ['a', 'c'].

Create readable string in pandas dataframe

I have a single column dataframe:
col1
1
2
3
4
I need to create another column where it will be a string like:
Result:
col1 col2
1 Value is 1
2 Value is 2
3 Value is 3
4 Value is 4
I know about formatted strings but not sure how to implement it in dataframe
Convert column to string and prepend values:
df['col2'] = 'Value is ' + df['col1'].astype(str)
Or use f-strings with Series.map:
df['col2'] = df['col1'].map(lambda x: f'Value is {x}')
print (df)
col1 col2
0 1 Value is 1
1 2 Value is 2
2 3 Value is 3
3 4 Value is 4

PySpark : How to duplicate the rows of a dataframe based on the values in one column

From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)

Pandas DataFrame: use column value to slice string in another column

I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ

Categories