Python Pandas Group by and add new row with number sequence - python

d = {'ID': ['H1', 'H1', 'H2', 'H2', 'H3', 'H3'], 'Year': ['2012', '2013', '2014', '2013', '2014', '2015'], 'Unit': [5, 10, 15, 7, 15, 20]}
df_input= pd.DataFrame(data=d)
df_input
Group By the above df_input and wanted to get 'lag' and 'lag_u' columns. 'lag' is the number of row sequence at 'ID' and 'Year' group by level.
'lag_u' is just get the first Unit value at 'ID' and 'Year' group by level.
Expected Output:
d = {'ID': ['H1', 'H1', 'H2', 'H2', 'H3', 'H3'], 'Year': ['2012', '2013', '2014', '2013', '2014', '2015'], 'Unit': [5, 10, 15, 7, 15, 20], 'lag': [0, 1, 2, 0, 1, 2], 'lag_u': [5, 5, 5, 7, 7, 7]}
df_output= pd.DataFrame(data=d)
df_output

IIUC need GroupBy.cumcount with GroupBy.transform and GroupBy.first:
g = df_input.groupby('ID')
#if need group by both columns
#g = df_input.groupby(['ID','Year'])
df_input['lag'] = g.cumcount()
df_input['lag_u'] = g['Unit'].transform('first')

Related

How to convert this loops into list comprehension?

I want to convert these loops into a list comprehension but I don't know how to do it. Can anyone help me pls?
this is the list i want to convert:
students = ['Tommy', 'Kitty', 'Jessie', 'Chester', 'Curie', 'Darwing', 'Nancy', 'Sue',
'Peter', 'Andrew', 'Karren', 'Charles', 'Nikhil', 'Justin', 'Astha','Victor',
'Samuel', 'Olivia', 'Tony']
assignment = [2, 5, 5, 7, 1, 5, 2, 7, 5, 1, 1, 1, 2, 1, 5, 2, 7, 2, 7]
x = list(zip(students, assignment))
Output = {}
for ke, y in x:
y = "Group {}".format(y)
if y in Output:
Output[y].append((ke))
else:
Output[y] = [(ke)]
print(Output)
this what I have tried:
{Output[y].append((ke)) if y in Output else Output[y]=[(ke)]for ke, y in x}
You could do this with a nested dictionary/list comprehension:
Output = { f'Group {group}' : [ name for name, g in x if g == group ] for group in set(assignment) }
Output:
{
'Group 2': ['Tommy', 'Nancy', 'Nikhil', 'Victor', 'Olivia'],
'Group 5': ['Kitty', 'Jessie', 'Darwing', 'Peter', 'Astha'],
'Group 7': ['Chester', 'Sue', 'Samuel', 'Tony'],
'Group 1': ['Curie', 'Andrew', 'Karren', 'Charles', 'Justin']
}
data1 = {'students': ['Tommy', 'Kitty', 'Jessie', 'Chester', 'Curie', 'Darwing', 'Nancy', 'Sue',
'Peter', 'Andrew', 'Karren', 'Charles', 'Nikhil', 'Justin', 'Astha','Victor',
'Samuel', 'Olivia', 'Tony'],
'assignment': [2, 5, 5, 7, 1, 5, 2, 7, 5, 1, 1, 1, 2, 1, 5, 2, 7, 2, 7]}
df1 = pd.DataFrame(data1)
df1.groupby('assignment')['students'].agg(set).to_dict()
Output
{1: {'Andrew', 'Charles', 'Curie', 'Justin', 'Karren'},
2: {'Nancy', 'Nikhil', 'Olivia', 'Tommy', 'Victor'},
5: {'Astha', 'Darwing', 'Jessie', 'Kitty', 'Peter'},
7: {'Chester', 'Samuel', 'Sue', 'Tony'}}
You want a dict comprehension which will create a dict whose values come from a list comprehension.
itertools.groupby can help:
from itertools import groupby
x = sorted(list(zip(assignment, students)))
out = {f'Group {x}':[z[1] for z in y] for x,y in groupby(x, lambda y:y[0])}
{'Group 1': ['Andrew', 'Charles', 'Curie', 'Justin', 'Karren'], 'Group 2': ['Nancy', 'Nikhil', 'Olivia', 'Tommy', 'Victor'], 'Group 5': ['Astha', 'Darwing', 'Jessie', 'Kitty', 'Peter'], 'Group 7': ['Chester', 'Samuel', 'Sue', 'Tony']}

Conditional subtraction of Pandas Columns

I have a dataframe with stock returns and I would like to create a new column that contains the difference between that stock return and the return of the sector ETF it belongs to:
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5],
'sector': [np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN,]}
df = pd.DataFrame(dict0)
df = df.set_index(['date', 'ticker'])
That is for instance, for AAPL on 1/1/2020 the return is 2. Since it belongs to the Tech Sector, the relevant return would be the ETF XLK (I have a dictionary that maps sectors to ETF tickers). The in the new column the return would be AAPL's return of 2 minus the XLK return on that day of 4.
I have asked a similar question in the post below, where I wanted to simply compute the difference of reach stock return to 1 ticker, namely SPY.
Computing excess returns
The solution presented there was this:
def func(row):
date, asset = row.name
return df.loc[(date, asset), 'returns'] - df.loc[(date, 'SPY'), 'returns']
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df = pd.DataFrame(dict0) ###
df = df.set_index(['date', 'ticker'])
df['excess_returns'] = df.apply(func, axis=1)
But I haven't been able to modify it so that I can do this sector based. I appreciate any suggestions.
You are almost there:
def func(row):
date, asset = row.name
index = sector_to_index_mapping[row.sector]
return df.loc[(date, asset), 'returns'] - df.loc[(date, index), 'returns']

Turn pandas columns into JSON string

I have a pandas dataframe with columns col1, col2 and col3 and respective values. I would need to transform column names and values into a JSON string.
For instance, if the dataset is
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
I need to obtain an output like this
newdata= pd.DataFrame({'index': [0,1,2], 'payload': ['{"col1":"bravo", "col2":"1", "col3":"alpha"}', '{"col1":"charlie", "col2":"2", "col3":"beta"}', '{"col1":"price", "col2":"3", "col3":"gamma"}']})
I didn't find any function or iterative tool to perform this.
Thank you in advance!
You can use:
df = data.agg(lambda s: dict(zip(s.index, s)), axis=1).rename('payload').to_frame()
Result:
# print(df)
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
Here you go:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
new_data = pd.DataFrame({
'payload': data.to_dict(orient='records')
})
print(new_data)
## -- End pasted text --
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
If my understanding is correct, you want the index and the data records as a dict.
So:
dict(index=list(data.index), payload=data.to_dict(orient='records'))
For your example data:
>>> import pprint
>>> pprint.pprint(dict(index=list(data.index), payload=data.to_dict(orient='records')))
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
This is one approach using .to_dict('index').
Ex:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
newdata = data.to_dict('index')
print({'index': list(newdata.keys()), 'payload': list(newdata.values())})
#OR -->newdata= pd.DataFrame({'index': list(newdata.keys()), 'payload': list(newdata.values())})
Output:
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
Use to_dict: newdata = data.T.to_dict()
>>> print(newdata.values())
[
{'col2': 1, 'col3': 'alpha', 'col1': 'bravo'},
{'col2': 2, 'col3': 'beta', 'col1': 'charlie'},
{'col2': 3, 'col3': 'gamma', 'col1': 'price'}
]

Create a DataFrame birds from this dictionary data which has the index labels

Consider the following Python dictionary data and Python list labels:**
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Create a DataFrame birds from this dictionary data which has the index labels using Pandas
Assuming your dictionary is already ordered into the correct ordering for the labels
import pandas as pd
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
data['labels'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, columns=['birds', 'age', 'visits', 'priority', 'labels'])
df.set_index('labels')
Try the code below,
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {
'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
'labels' : labels
}
df = pd.DataFrame.from_dict(data)
df.set_index('labels')
You can reduce some code like:
DataFrame provides us a flexibilty to provide some values like data,columns,index and the list goes on.
If we are dealing with Dictionary, then by default dictionaries keys are treated as column and values will be as rows.
In following Code I have used name attribute through DataFrame object
df=pd.DataFrame(data,index=Labels) # Custom indexes
df.index.name='labels' # After Running df.index.name you will get index as none, by this approach you can set any name to the column
I hope this will be help full for you.
Even I encountered the same exact issue few days back and we have a very beautiful library to handle dataframes and is better than pandas.
Search for turicreate in python, it is very very similar to the pandas but has a lot more to offer than pandas.
You can define the Sframes in Turienter image description here, somewhat similar to the pandas dataframe. After that you just have to run:
dataframe_name.show()
.show() visualizes any data structure in Turi Create.
You can visit the mentioned notebook for a better understanding: https://colab.research.google.com/drive/1DIFmRjGYx0UOiZtvMi4lOZmaBMnu_VlD
You can try out this.
import pandas as pd
import numpy as np
from pandas import DataFrame
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
df=DataFrame(data,index=labels)
print(df)

Converting list of dictionaries to 1D numpy arrays

I have a list of dictionaries, all with the same 10 keywords.
Looking for a neat way to convert it to 10 1D numpy arrays. Efficiency not important.
20 lines of code at the moment.
names = [x['name'] for x in fields]
names = np.asarray(names)
etc.
You could use a nested list comprehension :
[np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']]
or a dict comprehension:
{attribute:np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']}
As an example :
>>> fields = [{'name': 'A', 'age': 25, 'address' : 'NYC'}, {'name': 'B', 'age': 32, 'address' : 'LA'}]
>>> [np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']]
[array(['A', 'B'],
dtype='|S1'), array([25, 32]), array(['NYC', 'LA'],
dtype='|S3')]
>>> {attribute:np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']}
{'age': array([25, 32]), 'name': array(['A', 'B'],
dtype='|S1'), 'address': array(['NYC', 'LA'],
dtype='|S3')}
To get the attributes in an automatic way, you could use:
>>> fields[0].keys()
['age', 'name', 'address']
Finally, a pandas DataFrame probably is the most suitable type for your data:
>>> pd.DataFrame(fields)
address age name
0 NYC 25 A
1 LA 32 B
It will be plenty fast and should allow you to do any operation you'd like to do on a list of arrays.
Let A be the input list of dictionaries.
For a 2D array output with each row holding data per keyword -
np.column_stack([i.values() for i in A])
Sample run -
In [217]: A # input list of 2 dictionaries, each with same 3 keywords
Out[217]:
[{'a': array([6, 8, 2]), 'b': array([7, 7, 3]), 'c': array([6, 6, 4])},
{'a': array([4, 4, 3]), 'b': array([7, 1, 6]), 'c': array([6, 1, 5])}]
In [244]: np.column_stack([i.values() for i in A])
Out[244]:
array([[6, 8, 2, 4, 4, 3], # key : a
[6, 6, 4, 6, 1, 5], # key : c
[7, 7, 3, 7, 1, 6]]) # key : b
# Get those keywords per row with `keys()` :
In [263]: A[0].keys()
Out[263]: ['a', 'c', 'b']
One more sample run-
In [245]: fields # sample from #Eric's solution
Out[245]:
[{'address': 'NYC', 'age': 25, 'name': 'A'},
{'address': 'LA', 'age': 32, 'name': 'B'}]
In [246]: np.column_stack([i.values() for i in fields])
Out[246]:
array([['25', '32'],
['A', 'B'],
['NYC', 'LA']],
dtype='|S21')
In [267]: fields[0].keys()
Out[267]: ['age', 'name', 'address']

Categories