I have a df
index col1
0 a,c
1 d,f
2 o,k
I need a df like this
index col1
0 {"col1":"a,c"}
1 {"col1":"d,f"}
2 {"col1":"o,k"}
This needs to be applied for all columns in the df.
Tried with orient, but not as expected.
For all columns use double apply, columns name is passed by x.name, get dictionary:
df = df.apply(lambda x: x.apply(lambda y: {x.name: y}))
For json use:
import json
df = df.apply(lambda x: x.apply(lambda y: json.dumps({x.name: y})))
print (df)
col1
0 {"col1": "a,c"}
1 {"col1": "d,f"}
2 {"col1": "o,k"}
Alternative solution for dictionaries:
df = pd.DataFrame({c: [{c: x} for x in df[c]] for c in df.columns}, index=df.index)
Alterative2 solution for json (working well if all columns are filled by strings):
df = '{"' + df.columns + '": "' + df.astype(str) + '"}'
If you want strings exactly as shown, use:
df['col1'] = '{col1:'+df['col1']+'}'
# or
c = 'col1'
df[c] = f'{{{c}:'+df[c]+'}'
output:
0 {col1:a,c}
1 {col1:d,f}
2 {col1:o,k}
Name: col1, dtype: object
or, with quotes:
df['col1'] = '{"col1":"'+df['col1']+'"}'
# or
c = 'col1'
df[c] = f'{{"{c}":"'+df[c]+'"}'
output:
index col1
0 0 {"col1":"a,c"}
1 1 {"col1":"d,f"}
2 2 {"col1":"o,k"}
for all columns:
df = df.apply(lambda c: f'{{"{c.name}":"'+c.astype(str)+'"}')
NB. ensure "index" is the index
for dictionaries:
df['col1'] = [{'col1': x} for x in df['col1']]
output:
index col1
0 0 {'col1': 'a,c'}
1 1 {'col1': 'd,f'}
2 2 {'col1': 'o,k'}
Related
I have a df with a column of strings like so:
col1
a
b
c
d
I also have a string variable x = 'x' and a list of strings list1 = ['ax', cx']
I want to create a new column that checks if the concatenated string of col1 + x is in list1. If yes then col2 = 1 else col2 = 0.
Here is my attempt:
df['col2'] = 1 if str(df['col1'] + x) in list1 else 0
Which doesn't work.
df['col2'] = 1 if df['col1'] + x in list1 else 0
Doesn't work either.
What would be the correct way to format this?
Thank you for any help.
col1 col2 <-- should be this
a 1
b 0
c 1
d 0
Use isin:
df['col2'] = df.col1.add('x').isin(list1).astype(int)
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
Check Results
You can use map function as follows.
df['col2'] = df['col1'].map(lambda val: 1 if x + val in list1 else 0)
An other solution using apply :
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c', 'd']})
def func(row):
list1 = {'ax', 'cx'}
row['col2'] = 1 if row.col1 + 'x' in list1 else 0
return row
df2 = df.apply(func, axis='columns')
# OUTPUTS :
# col1 col2
#0 a 1
#1 b 0
#2 c 1
#3 d 0
I have a dataframe which has a dictionary inside a nested list and i want to split the column 'C' :
A B C
1 a [ {"id":2,"Col":{"x":3,"y":4}}]
2 b [ {"id":5,"Col":{"x":6,"y":7}}]
expected output :
A B C_id Col_x Col_y
1 a 2 3 4
2 b 5 6 7
From the comments, json_normalize might help you.
After extracting id and col columns with:
df[["Col", "id"]] = df["C"].apply(lambda x: pd.Series(x[0]))
You can explode the dictionary in Col with json_normalize and use concat to merge with existing dataframe:
df = pd.concat([df, json_normalize(df.Col)], axis=1)
Also, use drop to remove old columns.
Full code:
# Import modules
import pandas as pd
from pandas.io.json import json_normalize
# from flatten_json import flatten
# Create dataframe
df = pd.DataFrame([[1, "a", [ {"id":2,"Col":{"x":3,"y":4}}]],
[2, "b", [ {"id":5,"Col":{"x":6,"y":7}}]]],
columns=["A", "B", "C"])
# Add col and id column + remove old "C" column
df = pd.concat([df, df["C"].apply(lambda x: pd.Series(x[0]))], axis=1) \
.drop("C", axis=1)
print(df)
# A B Col id
# 0 1 a {'x': 3, 'y': 4} 2
# 1 2 b {'x': 6, 'y': 7} 5
# Show json_normalize behavior
print(json_normalize(df.Col))
# x y
# 0 3 4
# 1 6 7
# Explode dict in "col" column + remove "Col" colun
df = pd.concat([df, json_normalize(df.Col)], axis=1) \
.drop(["Col"], axis=1)
print(df)
# A B id x y
# 0 1 a 2 3 4
# 1 2 b 5 6 7
You can try .apply method
df['C_id'] = df['C'].apply(lambda x: x[0]['id'])
df['C_x'] = df['C'].apply(lambda x: x[0]['Col']['x'])
df['C_y'] = df['C'].apply(lambda x: x[0]['Col']['y'])
Code
import pandas as pd
A = [1, 2]
B = ['a', 'b']
C = [{"id":2,"Col":{"x":3,"y":4}}, {"id":5,"Col":{"x":6,"y":7}}]
df = pd.DataFrame({"A": A, "B": B, "C_id": [element["id"] for element in C],
"Col_x": [element["Col"]["x"] for element in C],
"Col_y": [element["Col"]["y"] for element in C]})
Ouput:
I have below dataframe
clm1, clm2, clm3
10, a, clm4=1|clm5=5
11, b, clm4=2
My desired result is
clm1, clm2, clm4, clm5
10, a, 1, 5
11, b, 2, Nan
I have tried below method
rows = list(df.index)
dictlist = []
for index in rows: #loop through each row to convert clm3 to dict
i = df.at[index, "clm3"]
mydict = dict(map(lambda x: x.split('='), [x for x in i.split('|') if '=' in x]))
dictlist.append(mydict)
l=json_normalize(dictlist) #convert dict column to flat dataframe
resultdf = example.join(l).drop('clm3',axis=1)
This is giving me desired result but I am looking for a more efficient way to convert clm3 to dict which does not involve looping through each row.
two steps :
idea is to create a double split and then group by the index and unstack the values as columns
s = (
df["clm3"]
.str.split("|", expand=True)
.stack()
.str.split("=", expand=True)
.reset_index(level=1, drop=True)
)
final = pd.concat([df, s.groupby([s.index, s[0]])[1].sum().unstack()], axis=1).drop(
"clm3", axis=1
)
print(final)
clm1 clm2 clm4 clm5
0 10 a 1 5
1 11 b 2 NaN
Using str.extractall to get your values and unstack to pivot them to a column for each unique value.
And str.get_dummies to get a column for each unique clm.
values = (
df['clm3'].str.extractall('(=\d)')[0]
.str.replace('=', '')
.unstack()
.rename_axis(None, axis=1)
)
columns = df['clm3'].str.replace('=\d', '').str.get_dummies(sep='|').columns
values.columns = columns
dfnew = pd.concat([df[['clm1', 'clm2']], values], axis=1)
clm1 clm2 0 1
0 10 a 1 5
1 11 b 2 NaN
I have a data frame like this
col1 col2
[A, B] 1
[A, C] 2
I would like to separate col1 into two columns and the output, I would like it out in this form
col1_A col1_B col2
A B 1
A C 2
I have tried this df['col1'].str.rsplit(',',n=2, expand=True)
but it showed TypeError: list indices must be integers or slices, not str
join + pop
df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
columns=['col1_A', 'col1_B']))
print(df)
col2 col1_A col1_B
0 1 A B
1 2 A C
It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.
You can use apply:
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)
col1_A col1_B col2
0 A B 1
1 A C 2
You can do this:
>> df_expanded = df['col1'].apply(pd.Series).rename(
columns = lambda x : 'col1_' + str(x))
>> df_expanded
col1_0 col1_1
0 A B
1 A C
Adding these columns to the original dataframe:
>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)
col1_0 col1_1 col2
0 A B 1
1 A C 2
If columns need to be named as the first element in the rows:
df_expanded.columns = ['col1_' + value
for value in df_expanded.iloc[0,:].values.tolist()]
col1_A col1_B
0 A B
1 A C
Zip values and column name and use insert to get right position.
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
Full example
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
print(df)
Returns:
col1_A col1_B col2
0 A B 1
1 A C 2
I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo
Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))
Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d