How to aggregate columns in pandas

How to aggregate columns in pandas - python

I have a data frame with number of columns I want to group them under two main groups A and B while preserving the old columns names as dictionary as follow
index userid col1 col2 col3 col4 col5 col6 col7
0 1 6 3 Nora 100 11 22 44
the desired data frame is as follow
index userid A B
0 1 {"col1":6, "col2":3, "col3":"Nora","col4":100} {"col5":11, "col6":22, "col7":44}

To match your desired dataframe exactly:
>>> import pandas as pd
# recreating your data
>>> df = pd.DataFrame.from_dict({'index': [0], 'userid': [1], 'col1': [6], 'col2': [3], 'col3': ['Nora'], 'col4': [100], 'col5': [11], 'col6': [22], 'col7': [44]})
# copy of unchanged columns
>>> df_new = df[['index', 'userid']].copy()
# grouping columns together
>>> df_new['A'] = df[['col1', 'col2', 'col3', 'col4']].copy().to_dict(orient='records')
>>> df_new['B'] = df[['col5', 'col6', 'col7']].copy().to_dict(orient='records')
>>> df_new
index userid A B
0 0 1 {'col1': 6, 'col2': 3, 'col3': 'Nora', 'col4': 100} {'col5': 11, 'col6': 22, 'col7': 44}

You can try something like this:
d = {'col1': 'A',
'col2': 'A',
'col3': 'A',
'col4': 'A',
'col5': 'B',
'col6': 'B',
'col7': 'B'}
df.groupby(d, axis=1).apply(pd.DataFrame.to_dict, orient='series').to_frame().T
Output:
A B
0 {'col1': [6], 'col2': [3], 'col3': ['Nora'], '... {'col5': [11], 'col6': [22], 'col7': [44]}

Working with the original dataframe.
import pandas as pd
df1 = pd.DataFrame({'index':[0], 'userid':[1],
'col1': [6], 'col2': [3], 'col3': ['Nora'] ,'col4':[100],
'col5':[11], 'col6': [22], 'col7':[44]})
df1['A'] = df1[['col1', 'col2', 'col3', 'col4']].to_dict(orient='records')
df1['B'] = df1[['col5', 'col6', 'col7']].to_dict(orient='records')
df1.drop(df1.columns[range(2, 9)], axis=1, inplace=True)
print(df1)

Related

Combine rows of df based on sequence in 1 column

I have df like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2]}
df = pd.DataFrame(data=d)
I'd like to iterate over col3 whenever the next row is increasing, until the same value reoccurs, then combine rows/columns like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2], 'col4': ['B/Done;C/Open;K/Open', 'C/Open;K/Open', 'None', 'None', 'M/Open', 'None']}
df = pd.DataFrame(data=d)
I have thousands of rows, so I am trying to avoid using a for loop if possible.

I believe you can't perform this in a vectorial way.
Here is a working approach, but using a loop in a custom function:
def combine(series):
out = []
for s in series.iloc[1:]:
out.append(out[-1]+';'+s if out else s)
out = out[::-1]
out.append(None)
return pd.Series(out, index=series.index)
group = df['col3'].diff().eq(0)[::-1].cumsum()[::-1]
df['col4'] = (df.assign(col=df['col1']+'/'+df['col2'])
.groupby(group, sort=False)['col']
.apply(combine)
)
output:
col1 col2 col3 col4
0 A Open 1 B/Done;C/Open;K/Open
1 B Done 2 B/Done;C/Open
2 C Open 3 B/Done
3 K Open 3 None
4 L Done 1 M/Open
5 M Open 2 None

How to show data with different number of columns Pandas?

I load CSV document with different number of columns. Therefore I got this error:
Expected 12 fields in line 29, saw 13
To avoid this error I use the hack names=range(24)
df = pd.read_csv(filename, header=None, quoting=csv.QUOTE_NONE, dtype='object', sep=data_file_delimiter, engine='python', encoding = "utf-8", names=range(24))
Problem is I need to know the real number of columns to group this data further into dict data:
data = {}
for row in df.rows:
line = line.strip()
row = line.split(' ')
if len(row) not in data:
data[ len(row) ] = []
data[ len(row) ].append(row)

You can have the number of columns using len(df.columns) but if you only want to convert a pandas df to a dictionary then there are already many built-in methods as given below,
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]},index=['row1', 'row2'])
df
col1 col2
row1 1 0.50
row2 2 0.75
df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
# You can specify the return orientation.
df.to_dict('series')
{'col1': row1 1
row2 2
Name: col1, dtype: int64,
'col2': row1 0.50
row2 0.75
Name: col2, dtype: float64}
df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
# You can also specify the mapping type.
from collections import OrderedDict, defaultdict
df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Taken from here

Stack different column values into one column in a pandas dataframe

I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.

you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.

My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output

Turn pandas columns into JSON string

I have a pandas dataframe with columns col1, col2 and col3 and respective values. I would need to transform column names and values into a JSON string.
For instance, if the dataset is
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
I need to obtain an output like this
newdata= pd.DataFrame({'index': [0,1,2], 'payload': ['{"col1":"bravo", "col2":"1", "col3":"alpha"}', '{"col1":"charlie", "col2":"2", "col3":"beta"}', '{"col1":"price", "col2":"3", "col3":"gamma"}']})
I didn't find any function or iterative tool to perform this.
Thank you in advance!

You can use:
df = data.agg(lambda s: dict(zip(s.index, s)), axis=1).rename('payload').to_frame()
Result:
# print(df)
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}

Here you go:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
new_data = pd.DataFrame({
'payload': data.to_dict(orient='records')
})
print(new_data)
## -- End pasted text --
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}

If my understanding is correct, you want the index and the data records as a dict.
So:
dict(index=list(data.index), payload=data.to_dict(orient='records'))
For your example data:
>>> import pprint
>>> pprint.pprint(dict(index=list(data.index), payload=data.to_dict(orient='records')))
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}

This is one approach using .to_dict('index').
Ex:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
newdata = data.to_dict('index')
print({'index': list(newdata.keys()), 'payload': list(newdata.values())})
#OR -->newdata= pd.DataFrame({'index': list(newdata.keys()), 'payload': list(newdata.values())})
Output:
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}

Use to_dict: newdata = data.T.to_dict()
>>> print(newdata.values())
[
{'col2': 1, 'col3': 'alpha', 'col1': 'bravo'},
{'col2': 2, 'col3': 'beta', 'col1': 'charlie'},
{'col2': 3, 'col3': 'gamma', 'col1': 'price'}
]

Convert Pandas DataFrame to dictionairy

I have a simple DataFrame:
Name Format
0 cntry int
1 dweight str
2 pspwght str
3 pweight str
4 nwspol str
I want a dictionairy as such:
{
"cntry":"int",
"dweight":"str",
"pspwght":"str",
"pweight":"str",
"nwspol":"str"
}
Where dict["cntry"] would return int or dict["dweight"] would return str.
How could I do this?

How about this:
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'D'], 'col_2': [1, 1, 2, 3], 'col_3': ['Bla', 'Foo', 'Sup', 'Asdf']})
res_dict = dict(zip(df['col_1'], df['col_3']))
Contents of res_dict:
{'A': 'Bla', 'B': 'Foo', 'C': 'Sup', 'D': 'Asdf'}

You're looking for DataFrame.to_dict()
From the documentation:
>>> df = pd.DataFrame({'col1': [1, 2],
... 'col2': [0.5, 0.75]},
... index=['row1', 'row2'])
>>> df
col1 col2
row1 1 0.50
row2 2 0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can always invert an internal dictionary if it's not mapped how you'd like it to be:
inv_dict = {v: k for k, v in original_dict['Name'].items()}

I think you want is:
df.set_index('Name').to_dict()['Format']
Since you want to use the values in the Name column as the keys to your dict.
Note that you might want to do:
df.set_index('Name').astype(str).to_dict()['Format']
if you want the values of the dictionary to be strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to aggregate columns in pandas - python

Related

Combine rows of df based on sequence in 1 column

How to show data with different number of columns Pandas?

Stack different column values into one column in a pandas dataframe

Turn pandas columns into JSON string

Convert Pandas DataFrame to dictionairy

Categories

Resources