I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo
Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))
Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Related
I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I have a df with a column of strings like so:
col1
a
b
c
d
I also have a string variable x = 'x' and a list of strings list1 = ['ax', cx']
I want to create a new column that checks if the concatenated string of col1 + x is in list1. If yes then col2 = 1 else col2 = 0.
Here is my attempt:
df['col2'] = 1 if str(df['col1'] + x) in list1 else 0
Which doesn't work.
df['col2'] = 1 if df['col1'] + x in list1 else 0
Doesn't work either.
What would be the correct way to format this?
Thank you for any help.
col1 col2 <-- should be this
a 1
b 0
c 1
d 0
Use isin:
df['col2'] = df.col1.add('x').isin(list1).astype(int)
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
Check Results
You can use map function as follows.
df['col2'] = df['col1'].map(lambda val: 1 if x + val in list1 else 0)
An other solution using apply :
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c', 'd']})
def func(row):
list1 = {'ax', 'cx'}
row['col2'] = 1 if row.col1 + 'x' in list1 else 0
return row
df2 = df.apply(func, axis='columns')
# OUTPUTS :
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)
Given below is my dataframe
df = pd.DataFrame({'Col1':['1','2'],'Col2':[{'a':['a1','a2']},{'b':['b1']}]})
Col1 Col2
0 1 {u'a': [u'a1', u'a2']}
1 2 {u'b': [u'b1']}
I need to reformat this data frame as below
Col1 NCol2 NCol3
0 1 a a1
1 1 a a2
2 2 b b1
Basically, for each key value pair in the dictionary, i am adding a row with key and value in Ncol2 and Ncol3.
Thanks for help in advance.
You can use the following solution:
df1 = df['Col2'].apply(pd.Series).apply(lambda x: x.explode())\
.stack().reset_index(level=1)
df1.columns = ['Col2', 'Col3']
df.drop('Col2', axis=1).merge(df1, left_index=True, right_index=True)\
.reset_index(drop=True)
Output:
Col1 Col2 Col3
0 1 a a1
1 1 a a2
2 2 b b1
I have a data frame like this
col1 col2
[A, B] 1
[A, C] 2
I would like to separate col1 into two columns and the output, I would like it out in this form
col1_A col1_B col2
A B 1
A C 2
I have tried this df['col1'].str.rsplit(',',n=2, expand=True)
but it showed TypeError: list indices must be integers or slices, not str
join + pop
df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
columns=['col1_A', 'col1_B']))
print(df)
col2 col1_A col1_B
0 1 A B
1 2 A C
It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.
You can use apply:
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)
col1_A col1_B col2
0 A B 1
1 A C 2
You can do this:
>> df_expanded = df['col1'].apply(pd.Series).rename(
columns = lambda x : 'col1_' + str(x))
>> df_expanded
col1_0 col1_1
0 A B
1 A C
Adding these columns to the original dataframe:
>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)
col1_0 col1_1 col2
0 A B 1
1 A C 2
If columns need to be named as the first element in the rows:
df_expanded.columns = ['col1_' + value
for value in df_expanded.iloc[0,:].values.tolist()]
col1_A col1_B
0 A B
1 A C
Zip values and column name and use insert to get right position.
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
Full example
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
print(df)
Returns:
col1_A col1_B col2
0 A B 1
1 A C 2