I have a single column dataframe:
col1
1
2
3
4
I need to create another column where it will be a string like:
Result:
col1 col2
1 Value is 1
2 Value is 2
3 Value is 3
4 Value is 4
I know about formatted strings but not sure how to implement it in dataframe
Convert column to string and prepend values:
df['col2'] = 'Value is ' + df['col1'].astype(str)
Or use f-strings with Series.map:
df['col2'] = df['col1'].map(lambda x: f'Value is {x}')
print (df)
col1 col2
0 1 Value is 1
1 2 Value is 2
2 3 Value is 3
3 4 Value is 4
Related
I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5
I have a data frame with true/false values stored in string format. Some values are null in the data frame.
I need to encode this data such that TRUE/FALSE/null values are encoded with the same integer in every column.
Input:
col1 col2 col3
True True False
True True True
null null True
I am using:
le = preprocessing.LabelEncoder()
df.apply(le.fit_transform)
Output:
2 1 0
2 1 1
1 0 1
But I want the output as:
2 2 0
2 2 2
1 1 2
How do i do this?
For me working create one column DataFrame:
df = df.stack(dropna=False).to_frame().apply(le.fit_transform)[0].unstack()
print (df)
col1 col2 col3
0 1 1 0
1 1 1 1
2 2 2 1
Another idea is use DataFrame.replace with 'True' instead True, because:
I have a data frame with true/false values stored in string format.
If null are missing values:
df = df.replace({'True':2, 'False':1, np.nan:0})
If null are strings null:
df = df.replace({'True':2, 'False':1, 'null':0})
print (df)
col1 col2 col3
0 2 2 1
1 2 2 2
2 0 0 2
I have a dataframe with columns like this -
Name Id 2019col1 2019col2 2019col3 2020col1 2020col2 2020col3 2021col1 2021Ccol2 2021Ccol3
That is, the columns are repeated for each year.
I want to take the year out and make it a column, so that the final dataframe looks like -
Name Id Year col1 col2 col3
Is there a way in pandas to achieve something like this?
Use wide_to_long, but before change order years to end of columns names like 2019col1 to col12019 in list comprehension:
print (df)
Name Id 2019col1 2019col2 2019col3 2020col1 2020col2 2020col3 \
0 a 456 4 5 6 2 3 4
2021col1 2021col2 2021col3
0 5 2 1
df.columns = [x[4:] + x[:4] if x[:4].isnumeric() else x for x in df.columns]
df = (pd.wide_to_long(df.reset_index(),
['col1','col2', 'col3'],
i='index',
j='Year').reset_index(level=0, drop=True).reset_index())
print (df)
Year Id Name col1 col2 col3
0 2019 456 a 4 5 6
1 2020 456 a 2 3 4
2 2021 456 a 5 2 1
From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)
I have a data frame that contains a column as the following:
1 string;string
2 string;string;string
I would like to iterate through the hole column and replace the values with the count of ";" +1 (number of strings) to get:
1 2
2 3
Thank you for any help.
You can use str.count function:
print (df)
col
1 string;string
2 string;string;string
df['col'] = df['col'].str.count(';') + 1
print (df)
col
1 2
2 3
df['col'] = df['col'].str.count(';').add(1)
print (df)
col
1 2
2 3