I have a pandas data frame like this (represent a investment portfolio):
data = {'category':['stock', 'bond', 'cash', 'stock',’cash’],
'name':[‘AA’ , ‘BB’, ‘CC’, ‘DD’, ’EE’],
'quantity':[2, 2, 10, 4, 3],
'price':[10, 15, 4, 2, 4],
'value':[ 20, 30, 40,8, 12],
df = pd.DataFrame(data)
I would like to generate a report in a text file that looks like this :
Stock: Total: 60
Name quantity price value
AA 2 10 20
CC 10 4 40
Bond: Total: 60
Name quantity price value
BB 2 15 30
Cash: Total: 52
Name quantity price value
CC 10 4 40
EE 3 4 12
I found a way to do this by looping through a list of dataframe but it is kind of ugly, I think there should be a way with iterrow or iteritem, but I can’t make it work.
Thank you for your help !
You can loop by groupby object and write custom header with data:
for i, g in df.groupby('category', sort=False):
with open('out.csv', 'a') as f:
f.write(f'{i}: Total: {g["value"].sum()}\n')
(g.drop('category', axis=1)
.to_csv(f, index=False, mode='a', sep='\t', line_terminator='\n'))
f.write('\n')
Output:
stock: Total: 28
name quantity price value
AA 2 10 20
DD 4 2 8
bond: Total: 30
name quantity price value
B 2 15 30
cash: Total: 52
name quantity price value
CC 10 4 40
EE 3 4 12
Related
I have a dataframe derived from a massive list of market tickers from a crypto exchange.
The list includes ALL combos yet I only need the tickers that are vs USD stablecoins.
The 1st 15 entries of the original dataframe...
Asset Price
0 1INCHBTC 0.00009650
1 1INCHBUSD 5.74340000
2 1INCHUSDT 5.74050000
3 AAVEBKRW 164167.00000000
4 AAVEBNB 0.77600000
5 AAVEBTC 0.00615200
6 AAVEBUSD 365.00200000
7 AAVEDOWNUSDT 2.02505200
8 AAVEETH 0.17212000
9 AAVEUPUSDT 81.89500000
10 AAVEUSDT 365.57600000
11 ACMBTC 0.00018420
12 ACMBUSD 10.91700000
13 ACMUSDT 10.89500000
14 ADAAUD 1.59600000
Now...there are many USD stablecoins, however not every ticker has a pair with one.
So I used the most popular ones in order to make sure every asset has at least one match.
df = df.loc[(df.Asset.str[-3:] == 'DAI')|
(df.Asset.str[-4:] == 'USDT')|
(df.Asset.str[-4:] == 'BUSD')|
(df.Asset.str[-4:] == 'TUSD')]
The 1st 15 entries of the new but 'messy' dataframe...
Asset Price
0 1INCHBUSD 5.74340000
1 1INCHUSDT 5.74050000
2 AAVEBUSD 365.00200000
3 AAVEDOWNUSDT 2.02505200
4 AAVEUPUSDT 81.89500000
5 AAVEUSDT 365.57600000
6 ACMBUSD 10.91700000
7 ACMUSDT 10.89500000
8 ADABUSD 1.21439000
9 ADADOWNUSDT 3.46482700
10 ADATUSD 1.21284000
11 ADAUPUSDT 76.12900000
12 ADAUSDT 1.21394000
13 AERGOBUSD 0.43012000
14 AIONBUSD 0.07210000
How do i filter/merge entries in this dataframe so that it removes duplicates?
I also need the substring to be removed at the end, so I'm left with just the asset and the USD price.
It should look something like this...
Asset Price
0 1INCH 5.74340000
2 AAVE 365.00200000
3 AAVEDOWN 2.02505200
4 AAVEUP 81.89500000
6 ACM 10.91700000
8 ADA 1.21439000
9 ADADOWN 3.46482700
11 ADAUP 76.12900000
13 AERGO 0.43012000
14 AION 0.07210000
This is for a portfolio tracker.
Also if there is a better way to do this without the middle step I'm all ears.
According your expected output, you want to remove duplicates but keep first item:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.drop_duplicates(subset="Asset", keep="first")
print(df)
Prints:
Asset Price
0 1INCH 5.743400
2 AAVE 365.002000
3 AAVEDOWN 2.025052
4 AAVEUP 81.895000
6 ACM 10.917000
8 ADA 1.214390
9 ADADOWN 3.464827
11 ADAUP 76.129000
13 AERGO 0.430120
14 AION 0.072100
EDIT: To group and average:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.groupby("Asset")["Price"].mean().reset_index()
print(df)
Prints:
Asset Price
0 1INCH 5.741950
1 AAVE 365.289000
2 AAVEDOWN 2.025052
3 AAVEUP 81.895000
4 ACM 10.906000
5 ADA 1.213723
6 ADADOWN 3.464827
7 ADAUP 76.129000
8 AERGO 0.430120
9 AION 0.072100
Just do
con1 = df.Asset.str[-3:] == 'DAI'
con2 = df.Asset.str[-4:] == 'USDT'
con3 = df.Asset.str[-4:] == 'BUSD'
con4 = df.Asset.str[-4:] == 'TUSD'
df['new'] = np.select(['con1','con2','con3','con4'],
['DAI','USDT','BUSD','TUSD'])
out = df[con1 | con2 | con3 | con4].groupby('new').head(1)
or
df[con1 | con2 | con3 | con4].drop_duplicates('new')
I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!
IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult
Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)
To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young
I have a dataframe like so:
Class price demand
1 22 8
1 60 7
3 32 14
2 72 9
4 45 20
5 42 25
What I'd like to do is group classes 1-3 in one category and classes 4-5 in one category. Then I'd like to get the sum of price for each category and the sum of demand for each category. I'd like to also get the mean. The result should look something like this:
Class TotalPrice TotalDemand AveragePrice AverageDemand
P 186 38 46.5 9.5
E 87 45 43.5 22.5
Where P is classes 1-3 and E is classes 4-5. How can I group by categories in pandas? Is there a way to do this?
In [8]: df.groupby(np.where(df['Class'].isin([1, 2, 3]), 'P', 'E'))[['price', 'demand']].agg(['sum', 'mean'])
Out[8]:
price demand
sum mean sum mean
E 87 43.5 45 22.5
P 186 46.5 38 9.5
You can create a dictionary that defines your groups.
mapping = {**dict.fromkeys([1, 2, 3], 'P'), **dict.fromkeys([4, 5], 'E')}
Then if you pass a dictionary or callable to a groupby it automatically gets mapped onto the index. So, let's set the index to Class
d = df.set_index('Class').groupby(mapping).agg(['sum', 'mean']).sort_index(1, 1)
Finally, we do some tweaking to get column names the way you specified.
rename_dict = {'sum': 'Total', 'mean': 'Average'}
d.columns = d.columns.map(lambda c: f"{rename_dict[c[1]]}{c[0].title()}")
d.rename_axis('Class').reset_index()
Class TotalPrice TotalDemand AveragePrice AverageDemand
0 E 87 45 43.5 22.5
1 P 186 38 46.5 9.5
In general, you can form arbitrary bins to group your data using pd.cut, specifying the right bin edges:
import pandas as pd
pd.cut(df.Class, bins=[0, 3, 5], labels=['P', 'E'])
#0 P
#1 P
#2 P
#3 P
#4 E
#5 E
df2 = (df.groupby(pd.cut(df.Class, bins=[0,3,5], labels=['P', 'E']))[['demand', 'price']]
.agg({'sum', 'mean'}).reset_index())
# Get rid of the multi-level columns
df2.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in df2.columns]
Output:
Class demand_sum demand_mean price_sum price_mean
0 P 38 9.5 186 46.5
1 E 45 22.5 87 43.5
I have written below function in python:
def proc_summ(df,var_names_in,var_names_group):
df['Freq']=1
df_summed=pd.pivot_table(df,index=(var_names_group),
values=(var_names_in),
aggfunc=[np.sum],fill_value=0,margins=True,margins_name='Total').reset_index()
df_summed.columns = df_summed.columns.map(''.join)
df_summed.columns = [x.strip().replace('sum', '') for x in df_summed.columns]
string_repr = df_summed.to_string(index=False,justify='center').splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
string_repr.insert(len(df_summed.index)+1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
print(out)
And below is the code I am using to call the function:
proc_summ (
df,
var_names_in=["Freq","sal"] ,
var_names_group=["name","age"])
and below is the output:
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
Please let me know how can I print the data to the center of the screen like :
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
If you are using Python3 you can try something like this
import shutil
columns = shutil.get_terminal_size().columns
print("hello world".center(columns))
As You are Using DataFrame you can try something like this
import shutil
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
# convert DataFrame to string
df_string = df.to_string()
df_split = df_string.split('\n')
columns = shutil.get_terminal_size().columns
for i in range(len(df)):
print(df_split[i].center(columns))
The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5