I have a df with a column of strings like so:
col1
a
b
c
d
I also have a string variable x = 'x' and a list of strings list1 = ['ax', cx']
I want to create a new column that checks if the concatenated string of col1 + x is in list1. If yes then col2 = 1 else col2 = 0.
Here is my attempt:
df['col2'] = 1 if str(df['col1'] + x) in list1 else 0
Which doesn't work.
df['col2'] = 1 if df['col1'] + x in list1 else 0
Doesn't work either.
What would be the correct way to format this?
Thank you for any help.
col1 col2 <-- should be this
a 1
b 0
c 1
d 0
Use isin:
df['col2'] = df.col1.add('x').isin(list1).astype(int)
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
Check Results
You can use map function as follows.
df['col2'] = df['col1'].map(lambda val: 1 if x + val in list1 else 0)
An other solution using apply :
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c', 'd']})
def func(row):
list1 = {'ax', 'cx'}
row['col2'] = 1 if row.col1 + 'x' in list1 else 0
return row
df2 = df.apply(func, axis='columns')
# OUTPUTS :
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
Related
I have a df
index col1
0 a,c
1 d,f
2 o,k
I need a df like this
index col1
0 {"col1":"a,c"}
1 {"col1":"d,f"}
2 {"col1":"o,k"}
This needs to be applied for all columns in the df.
Tried with orient, but not as expected.
For all columns use double apply, columns name is passed by x.name, get dictionary:
df = df.apply(lambda x: x.apply(lambda y: {x.name: y}))
For json use:
import json
df = df.apply(lambda x: x.apply(lambda y: json.dumps({x.name: y})))
print (df)
col1
0 {"col1": "a,c"}
1 {"col1": "d,f"}
2 {"col1": "o,k"}
Alternative solution for dictionaries:
df = pd.DataFrame({c: [{c: x} for x in df[c]] for c in df.columns}, index=df.index)
Alterative2 solution for json (working well if all columns are filled by strings):
df = '{"' + df.columns + '": "' + df.astype(str) + '"}'
If you want strings exactly as shown, use:
df['col1'] = '{col1:'+df['col1']+'}'
# or
c = 'col1'
df[c] = f'{{{c}:'+df[c]+'}'
output:
0 {col1:a,c}
1 {col1:d,f}
2 {col1:o,k}
Name: col1, dtype: object
or, with quotes:
df['col1'] = '{"col1":"'+df['col1']+'"}'
# or
c = 'col1'
df[c] = f'{{"{c}":"'+df[c]+'"}'
output:
index col1
0 0 {"col1":"a,c"}
1 1 {"col1":"d,f"}
2 2 {"col1":"o,k"}
for all columns:
df = df.apply(lambda c: f'{{"{c.name}":"'+c.astype(str)+'"}')
NB. ensure "index" is the index
for dictionaries:
df['col1'] = [{'col1': x} for x in df['col1']]
output:
index col1
0 0 {'col1': 'a,c'}
1 1 {'col1': 'd,f'}
2 2 {'col1': 'o,k'}
I have a dataframe:
df
Col1 Col2 Col3
A B 5
C D 4
E F 1
I want to see only those rows which contribute to 90% of Col3. In this case the expected output will be :
Col1 Col2 Col3
A B 5
C D 4
I tried the below but is doesnt work as expected:
df['col3'].value_counts(normalize=True) * 100
Is there any solution for the same?
Are you looking for this?
df = df[df.Col3 > 0] # optionally remove 0 valued rows
df = df.sort_values(by='Col3', ascending=False).reset_index(drop=True)
totals = df.Col3.cumsum()
cutoff = totals[totals >= df.Col3.sum() * .7].idxmin()
print(df[:cutoff + 1])
Output
Col1 Col2 Col3
0 A B 5
1 C D 4
#RSM, When you say 90% of the data, do you want the calculation of 90% to always start from the top or do you need it to be random ?
import pandas as pd
import numpy as np
from io import StringIO
d = '''Col1 Col2 Col3
A B 5
C D 4
E F 1'''
df = pd.read_csv(StringIO(d), sep='\s+')
total_value = df['Col3'].sum()
target_value = 0.9 * total_value
df['Cumulative_Sum'] = df['Col3'].cumsum()
desired_df = df.loc[df['Cumulative_Sum'] <=target_value]
print(desired_df)
I have a data frame like this
col1 col2
[A, B] 1
[A, C] 2
I would like to separate col1 into two columns and the output, I would like it out in this form
col1_A col1_B col2
A B 1
A C 2
I have tried this df['col1'].str.rsplit(',',n=2, expand=True)
but it showed TypeError: list indices must be integers or slices, not str
join + pop
df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
columns=['col1_A', 'col1_B']))
print(df)
col2 col1_A col1_B
0 1 A B
1 2 A C
It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.
You can use apply:
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)
col1_A col1_B col2
0 A B 1
1 A C 2
You can do this:
>> df_expanded = df['col1'].apply(pd.Series).rename(
columns = lambda x : 'col1_' + str(x))
>> df_expanded
col1_0 col1_1
0 A B
1 A C
Adding these columns to the original dataframe:
>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)
col1_0 col1_1 col2
0 A B 1
1 A C 2
If columns need to be named as the first element in the rows:
df_expanded.columns = ['col1_' + value
for value in df_expanded.iloc[0,:].values.tolist()]
col1_A col1_B
0 A B
1 A C
Zip values and column name and use insert to get right position.
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
Full example
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
print(df)
Returns:
col1_A col1_B col2
0 A B 1
1 A C 2
I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo
Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))
Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
I have an example of Dataframe df:
Col1 Col2
a "some string AXA some string "
b "some string2"
I would like to:
if df.Col2 contains "AXA" then change the value to 1, if not then change it to 0.
So I get:
Col1 Col2
a 1
b 0
I've tried something like,
if "AXA" in df['Col2']:
df['Col2'] = 1
or if I can do something like
df.loc[df['Col2'] contains "AXA"] = 1
Thank you for help !
You can use str.contains for boolean mask and then cast to int:
print (df.Col2.str.contains('AXA'))
0 True
1 False
Name: Col2, dtype: bool
df['Col2'] = df.Col2.str.contains('AXA').astype(int)
print (df)
Col1 Col2
0 a 1
1 b 0
EDIT: If need create output by 2 conditions, fastest is use double numpy.where:
print (df)
Col1 Col2
0 a some string AXA some string
1 a some string AXE some string
2 b some string2
df['Col2'] = np.where(df.Col2.str.contains('AXA'), 1,
np.where(df.Col2.str.contains('AXE'), 2, 0))
print (df)
Col1 Col2
0 a 1
1 a 2
2 b 0