Partially merge pandas data fame columns into a dictionary column - python

Given a Pandas df:
col1 col2 col3 col4
a 1 2 56
a 3 4 1
a 5 6 1
b 7 8 2
b 9 10 -11
c 11 12 9
...
Using pandas how to reshape such data frame such that multiple columns are represented using one a dictionary with column names as keys:
col1 dict_col
a { 'col2':1 ,'col3':2 , 'col4':56 }
a { 'col2':3 ,'col3':4 , 'col4':1 }
a { 'col2':5 ,'col3':6 , 'col4':1 }
b { 'col2':7 ,'col3':8 , 'col4':2 }
b { 'col2':9 ,'col3':10, 'col4':-11}
c { 'col2':11 ,'col3':12, 'col4':9 }
Note that values of that that this transformation needs to be done only with pandas and just for a part of the columns across all the data frame rows.

You can use this code :
import pandas as pd
df = pd.DataFrame({
'col1': ['a', 'a', 'a', 'b', 'b', 'c'],
'col2': [ 1, 3, 5, 7, 9, 11],
'col3': [ 2, 4, 6, 8, 10, 12],
'col4': [ 56, 1, 1, 2, -11, 9]
})
cols = ['col2', 'col3', 'col4']
lst = []
for _, row in df[cols].iterrows():
lst.append({col: row[col] for col in cols})
df['dict_col'] = lst
df = df[['','dict_col']]
print(df)
Output :
col1 dict_col
0 a {'col2': 1, 'col3': 2, 'col4': 56}
1 a {'col2': 3, 'col3': 4, 'col4': 1}
2 a {'col2': 5, 'col3': 6, 'col4': 1}
3 b {'col2': 7, 'col3': 8, 'col4': 2}
4 b {'col2': 9, 'col3': 10, 'col4': -11}
5 c {'col2': 11, 'col3': 12, 'col4': 9}

Try this command:
pd.DataFrame({'col1': df['col1'].values, 'dict_col': df.drop('col1', axis=1).to_dict(orient='records')})

Building the dict_col column content(NB: with orient=records, to_dict() returns a list of dictionaries)
dict_col = df.loc[:, ["col2", "col3","col4"]].to_dict(orient="records")
Then create df2 as a new dataframe (having 2 columns, col1, dict_col):
df2 = pd.DataFrame({"col1": df["col1"], "dict_col": dict_col})
print(df2)

Related

Python appending a list to dataframe column

I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Pandas sample by filter criteria

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.
I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Pandas dataframe addition on selecting 2 or more columns

When there are 2 dataframes of same columns, how to select particular columns and add dataframes ?
dataframes in pandas are as follows
a_val = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7, 8]}
b_val = {'col1': [1, 5, 2], 'col2': [3, 2, 4], 'col3': [7, 17, 33]}
a = pd.DataFrame(a_val)
b = pd.DataFrame(b_val)
How to make the resultant dataframe C (see below for the expected resultant C)
for example I have A dataframe as
B dataframe as
C dataframe as
I think you need merge and then sum last column:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
.assign(col3=lambda x: x.col3 + x.col3_).drop('col3_', 1)
What is same as:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
c.col3 = c.col3.add(c.col3_)
c = c.drop('col3_', 1)
print (c)
col1 col2 col3
0 1 3 14
1 2 4 41

Categories