Easy code adding additional information into the dataframe

Easy code adding additional information into the dataframe - python

Is there another way to add the information to the datafame, an easier way?
I have a dataframe which looks like the following:
Product Price
2023-01-03 Apple 2.00
2023-01-04 Apple 2.10
2023-01-05 Apple 1.90
2023-01-03 Banana 1.10
2023-01-04 Banana 1.15
2023-01-05 Banana 1.30
2023-01-03 Cucumber 0.50
2023-01-04 Cucumber 0.80
2023-01-05 Cucumber 0.55
I have two additional information:
Apple Fuit
Banana Fruit
Cucumber Vegetable
and
Apple UK
Banana Columbia
Cucumber Mexico
this needs to be added in an additional column as shown below in the results.
Product Price Category Country
2023-01-03 Apple 2.00 Fuit UK
2023-01-04 Apple 2.10 Fuit UK
2023-01-05 Apple 1.90 Fuit UK
2023-01-03 Banana 1.10 Fuit Columbia
2023-01-04 Banana 1.15 Fuit Columbia
2023-01-05 Banana 1.30 Fuit Columbia
2023-01-03 Cucumber 0.50 Vegtable Mexico
2023-01-04 Cucumber 0.80 Vegtable Mexico
2023-01-05 Cucumber 0.55 Vegtable Mexico
The code lokks like the following:
import pandas as pd
df = pd.DataFrame({
'Product': ['Apple', 'Apple', 'Apple',
'Banana', 'Banana', 'Banana',
'Cucumber', 'Cucumber', 'Cucumber'],
'Price': [2.0, 2.1, 1.9,
1.1, 1.15, 1.3,
0.5, 0.8, 0.55]},
index=['2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-03', '2023-01-04', '2023-01-05'],
)
df.index = pd.to_datetime(df.index)
print(df)
def category(row):
if row['Product'] == 'Apple':
return 'Fuit'
elif row['Product'] == 'Banana':
return 'Fuit'
else:
return 'Vegtable'
def country(row):
if row['Product'] == 'Apple':
return 'UK'
elif row['Product'] == 'Banana':
return 'Columbia'
else:
return 'Mexico'
df['Category'] = df.apply(lambda row: category(row), axis=1)
df['Country'] = df.apply(lambda row: country(row), axis=1)
print(df)

You can put the infos in a nested dict, then make a intermediate DataFrame for the merge :
infos = pd.DataFrame({"Category": {"Apple": "Fuit", "Banana": "Fruit", "Cucumber": "Vegetable"},
"Country": {"Apple": "UK", "Banana": "Columbia", "Cucumber": "Mexico"}})


df = df.merge(infos, left_on="Product", right_index=True, how="left")

Output :
print(df)
Product Price Category Country
2023-01-03 Apple 2.00 Fuit UK
2023-01-04 Apple 2.10 Fuit UK
2023-01-05 Apple 1.90 Fuit UK
2023-01-03 Banana 1.10 Fruit Columbia
2023-01-04 Banana 1.15 Fruit Columbia
2023-01-05 Banana 1.30 Fruit Columbia
2023-01-03 Cucumber 0.50 Vegetable Mexico
2023-01-04 Cucumber 0.80 Vegetable Mexico
2023-01-05 Cucumber 0.55 Vegetable Mexico

df['Category'] = df.apply(lambda row: 'Fruit' if (row['Product'] == 'Apple') or (row['Product'] == 'Banana') else 'Vegetable', axis=1)
df['Country'] = df.apply(lambda row: 'UK' if row['Product'] == 'Apple' else ('Columbia' if row['Product'] == 'Banana' else 'Mexico'), axis=1)

Related

Rolling Average in Python with inf values

I'm trying to match this final output, calculating the moving average (3) of Count,
Expected Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 5.0
5 Fruits Apple 6.0 6.5
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0
But the python .rolling code does not take into account of the inf values, is there any work around on this?
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.rolling(3,3).mean())
Current Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 NaN
5 Fruits Apple 6.0 NaN
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0

Create a series S that contains the calculations replacing inf by nan and set min_periods=1. Then, creates a mask for the rows that need to be modified, that is, the ones that are one or two positions after an inf
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=3).mean())
S = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=1).mean())
mask = df['Count'].lt(np.inf) & df['MA3'].isnull() & (df['Count'].shift(1).eq(np.inf) | df['Count'].shift(2).eq(np.inf))
df.loc[mask, 'MA3'] = S.loc[mask]

Subset dataframe based on the slope

I have the following data frame:
df = pd.DataFrame()
df['fruit'] = ['apple','pear','banana','banana','pear','banana','apple','apple','pear','apple','apple','apple']
df['price'] = [2,1,3,3,1,3.3,1.8,1.8,1,1.6,1.6,1.6]
df['date_buy'] = ['01/01/2005','01/01/2005','01/01/2005','01/01/2005','01/02/2005','01/02/2005','01/02/2005','01/02/2005','01/03/2005','01/03/2005','01/03/2005','01/03/2005']
df.date_buy = df.date_buy.astype('datetime64')
df.set_index('date_buy', inplace = True)
The data is:
fruit price
date_buy
2005-01-01 apple 2.0
2005-01-01 pear 1.0
2005-01-01 banana 3.0
2005-01-01 banana 3.0
2005-01-02 pear 1.0
2005-01-02 banana 3.3
2005-01-02 apple 1.8
2005-01-02 apple 1.8
2005-01-03 pear 1.0
2005-01-03 apple 1.6
2005-01-03 apple 1.6
2005-01-03 apple 1.6
I have converted this dataframe into a pivot table:
df.pivot_table(index=['date_buy'],columns = ['fruit'], values = ['fruit'], aggfunc = len).\
fillna(0).resample('D', level=0).sum()
price
fruit apple banana pear
date_buy
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-03 3.0 0.0 1.0
I want to subset this dataset based on a criteria: top two slopes of the trend line. For apple the slope is 1, for banana is -1 and for pear the slope is 0. The result should be:
price
fruit apple pear
date_buy
2005-01-01 1.0 1.0
2005-01-02 2.0 1.0
2005-01-03 3.0 1.0
This dataset is just a concept from a much larger dataset, that's why I'm not subsetting by just the names of the two fruits I see. Please, any help will be greatly appreciated.

you can use polyfit from numpy to get the slopes. While not necessary, you can use a delta on the index in terms of days as x and y as the full pivoted dataframe. Then use argsort and select the number of top slopes you want. Finally, use iloc to get the columns
pv_df = (df.pivot_table(index=['date_buy'],columns = ['fruit'],
values = ['fruit'], aggfunc = len)
.fillna(0).resample('D', level=0).sum()
)
# number of top slopes
nb_top = 2
# get the slopes
slopes = np.polyfit(x=(pv_df.index - pv_df.index.min()).days,
y=pv_df, deg=1)[0]
#select the columns
res = pv_df.iloc[:, np.argsort(slopes)[-nb_top:]]
print(res)
price
fruit pear apple
date_buy
2005-01-01 1.0 1.0
2005-01-02 1.0 2.0
2005-01-03 1.0 3.0
Note: for the slopes, you can use directly slopes = np.polyfit(x=pv_df.index.astype(int), y=pv_df, deg=1)[0] but the values are less obvious compared to 1,0 and -1 you said in your question

adding subtotals to multiple layers of pandas pivot-table

Suppose I have a very basic dataset:
name food city rating
paul cream LA 2
daniel chocolate NY 3
paul chocolate LA 4
john cream NY 5
daniel jam LA 1
daniel butter NY 3
john jam NY 9
I want to compute the descriptive stats for each person's food preferences which is easy enough:
df1 = pd.pivot_table(df, values='rating', index=['city', 'name', 'food'], aggfunc=['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'], margins=True, margins_name="Total")
But I want to add subtotals for each name and city.
I can get subtotals for name and city in separate objects:
df2 = df.groupby('name').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df2.index = pd.MultiIndex.from_arrays([df2.index + '_total', len(df2.index) * ['']])
df3 = df.groupby('city').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df3.index = pd.MultiIndex.from_arrays([df3.index + '_total', len(df3.index) * ['']])
But struggling to combine the three tables.
The output of df1 has columns for 'city' 'name' and 'food' on each row
city name food count nunique...
LA daniel jam 1 1
paul choc 1 1
cream 1 1
NY daniel butter 1 1
but the outputs for df2 and df3 just have 'name' *df2) or 'city' (df3)
name count nunique
daniel_total 3 1
john_total 2 1
I want to merge these files so the name totals are placed in the 'name' column and the city totals in the 'city' like so:
city name food count
LA daniel jam 1
paul choc 1
cream 1
LA_total 3
NY daniel butter 1
NY_total 2
daniel_total 3
john_total 2
paul_total 2
I've tried using pandas concat, but it groups the descriptive columns together
pd.concat([df1, df2, df3].sort_index()
I think I need to tell python which column to join the df2 and df3 datasets into but not sure how

Let's try this:
df2 = df.groupby(['city','name']).agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df2 = df2.rename(index=lambda x: x+'_total', level=1)
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.assign(food='').set_index('food', append=True)
df3 = df.groupby('city').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df3.index = pd.MultiIndex.from_arrays([df3.index + '_total', len(df3.index) * ['']])
df3 = df3.assign(name='', food='').set_index(['name','food'], append=True)
df3 = df3.swaplevel(0,1, axis=1)
df_out = pd.concat([df1,df2,df3]).sort_index()
df_out
Output:
count nunique sum min max mean std sem median mad var skew
rating rating rating rating rating rating rating rating rating rating rating rating
city name food
LA daniel jam 1 1 1 1 1 1.000000 NaN NaN 1 0.000000 NaN NaN
daniel_total 1 1 1 1 1 1.000000 NaN NaN 1 0.000000 NaN NaN
paul chocolate 1 1 4 4 4 4.000000 NaN NaN 4 0.000000 NaN NaN
cream 1 1 2 2 2 2.000000 NaN NaN 2 0.000000 NaN NaN
paul_total 2 2 6 2 4 3.000000 1.414214 1.000000 3 1.000000 2.000000 NaN
LA_total 3 3 7 1 4 2.333333 1.527525 0.881917 2 1.111111 2.333333 0.935220
NY daniel butter 1 1 3 3 3 3.000000 NaN NaN 3 0.000000 NaN NaN
chocolate 1 1 3 3 3 3.000000 NaN NaN 3 0.000000 NaN NaN
daniel_total 2 1 6 3 3 3.000000 0.000000 0.000000 3 0.000000 0.000000 NaN
john cream 1 1 5 5 5 5.000000 NaN NaN 5 0.000000 NaN NaN
jam 1 1 9 9 9 9.000000 NaN NaN 9 0.000000 NaN NaN
john_total 2 2 14 5 9 7.000000 2.828427 2.000000 7 2.000000 8.000000 NaN
NY_total 4 3 20 3 9 5.000000 2.828427 1.414214 4 2.000000 8.000000 1.414214
Total 7 6 27 1 9 3.857143 2.609506 0.986301 3 1.836735 6.809524 1.398866

Altair - highlight single element in bar chart

I have this dagtaframe, talismen_players:
Team TotalShots TotalCreated Goals ... WeightBetweenness %ShotInv Player Color
0 Aston Villa 55.0 68.0 7.0 ... 0.45 0.36 Jack Grealish #770038
1 Manchester City 76.0 96.0 8.0 ... 0.44 0.32 Kevin De Bruyne #97C1E7
2 Watford 62.0 43.0 4.0 ... 0.37 0.34 Gerard Deulofeu #FBEE23
3 Leicester City 60.0 67.0 6.0 ... 0.32 0.34 James Maddison #0053A0
4 Norwich City 29.0 69.0 0.0 ... 0.31 0.32 Emiliano Buendia #00A14E
5 Chelsea 63.0 40.0 5.0 ... 0.28 0.23 Mason Mount #034694
6 Tottenham Hotspur 64.0 30.0 9.0 ... 0.28 0.29 Son Heung-Min #132257
7 Everton 66.0 30.0 10.0 ... 0.22 0.26 Richarlison #274488
8 Arsenal 64.0 18.0 17.0 ... 0.21 0.27 Pierre-Emerick Aubameyang #EF0107
9 Bournemouth 25.0 40.0 1.0 ... 0.21 0.23 Ryan Fraser #D3151B
10 Crystal Palace 42.0 20.0 6.0 ... 0.20 0.24 Jordan Ayew #C4122E
11 Burnley 33.0 40.0 2.0 ... 0.20 0.27 Dwight McNeil #630F33
12 Newcastle United 41.0 24.0 1.0 ... 0.20 0.24 Joelinton #231F20
13 Liverpool 89.0 41.0 14.0 ... 0.18 0.31 Mohamed Salah #CE1317
14 Brighton and Hove Albion 27.0 52.0 2.0 ... 0.16 0.23 Pascal Groß #005DAA
15 Wolverhampton Wanderers 86.0 38.0 11.0 ... 0.16 0.38 Raul Jimenez #FDB913
16 Sheffield United 31.0 35.0 5.0 ... 0.15 0.24 John Fleck #E52126
17 West Ham United 48.0 18.0 6.0 ... 0.15 0.25 Sebastien Haller #7C2C3B
18 Southampton 64.0 21.0 15.0 ... 0.11 0.24 Danny Ings #ED1A3B
19 Manchester United 37.0 31.0 1.0 ... 0.10 0.17 Andreas Pereira #E80909
And I have singled out one element in a series being plotted by altair, like so:
target_team = 'Liverpool'
# the following prints 'Mohamed Salah'
target_player = talismen_players.loc[talismen_players['Team']==target_team, 'Player'].item()
# all elements
talisman_chart = alt.Chart(talismen_players).mark_bar().encode(
alt.Y('Player:N', title="", sort='-x'),
alt.X('WeightBetweenness:Q', title="Interconectividade do craque com o resto do time"),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
).properties(
width=800
).configure_axis(
grid=False
).configure_view(
strokeWidth=0
)
_________
This plots all elements, but I want to highlight one of them:
This is the code for achieving the result using lines: Multiline highlight
I've tried with this code:
highlight = alt.selection(type='single', on='mouseover',
fields=['Player'], nearest=True)
base = alt.Chart(talismen_players).encode(
alt.X('WeightBetweenness:Q'),
alt.Y('Player:N', title=''),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
)
points = base.mark_circle().encode(
opacity=alt.value(0)
).add_selection(
highlight
).properties(
width=700
)
lines = base.mark_bar().encode(
size=alt.condition(~highlight, alt.value(18), alt.value(20))
)
But the result is not ideal:
Highlight is not working.
QUESTION:
Ideally, I'd like to have target_player bar highlighted on default. and THEN be able to highlight the other bars, with mouseover.
But having target_player bar statically highlighted suffices.
Highlight should not change bar colors, but rather have a focus quality, while other bars are out of focus.

You can do this using an initialized selection, along with an opacity for highlighting. For example:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3, 4],
'y': ['A', 'B', 'C', 'D'],
'color': ['W', 'X', 'Y', 'Z'],
})
hover = alt.selection_single(fields=['y'], on='mouseover', nearest=True, init={'y': 'C'})
alt.Chart(df).mark_bar().encode(
x='x',
y='y',
color='color',
opacity=alt.condition(hover, alt.value(1.0), alt.value(0.3))
).add_selection(hover)

Pivot dataframe by column (ID) duplicated

I have a DataFrame with a column named 'ID' that has duplicate observations. Each 'ID' row has one or more 'Article' values columns. I want to transpose the whole dataframe grouping by 'ID' adding new columns at the same row of a unique 'ID'. Order is important
My DataFrame:
ID Article Article_2
1 Banana NaN
2 Apple NaN
1 Apple Coconut
3 Tomatoe Coconut
1 Pineapple Tropical
2 Banana Coconut
4 Apple Coconut
5 Apple Coconut
3 Apple Pineapple
My code (by #Erfan):
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
Output:
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
0001 Banana Apple Pineapple NaN Coconut Tropical
0002 Apple Banana NaN Coconut NaN NaN
0003 Tomatoe Apple Coconut Pineapple NaN NaN
0004 Apple Coconut NaN NaN NaN NaN
0005 Apple Coconut NaN NaN NaN NaN
At first row, Article_4 is NaN and Article_5 and 6 have values. It should be Article_4 Coconut, Article_5 Tropical and Article_6 NaN.
At the second same, Article_3 is NaN and Article_4 is a valid value. It should be Article_3 valid and rest (4,5,6) NaNs
Needed output:
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
0001 Banana Apple Pineapple Coconut Tropical NaN
0002 Apple Banana Coconut NaN NaN NaN
0003 Tomatoe Apple Coconut Pineapple NaN NaN
0004 Apple Coconut NaN NaN NaN NaN
0005 Apple Coconut NaN NaN NaN NaN

Add DataFrame.dropna after melt for remove missing rows by value column:
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2']).dropna(subset=['value'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
print (dfn)
Article_1 Article_2 Article_3 Article_4 Article_5
1 Banana Apple Pineapple Coconut Tropical
2 Apple Banana Coconut NaN NaN
3 Tomatoe Apple Coconut Pineapple NaN
4 Apple Coconut NaN NaN NaN
5 Apple Coconut NaN NaN NaN
If need all columns use a bit modified justify function:
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notna(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
dfn = pd.DataFrame(justify(dfn.values, invalid_val=np.nan, axis=1, side='left'),
index=dfn.index, columns=dfn.columns)
print (dfn)
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
1 Banana Apple Pineapple Coconut Tropical NaN
2 Apple Banana Coconut NaN NaN NaN
3 Tomatoe Apple Coconut Pineapple NaN NaN
4 Apple Coconut NaN NaN NaN NaN
5 Apple Coconut NaN NaN NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Easy code adding additional information into the dataframe - python

df['Category'] = df.apply(lambda row: 'Fruit' if (row['Product'] == 'Apple') or (row['Product'] == 'Banana') else 'Vegetable', axis=1) df['Country'] = df.apply(lambda row: 'UK' if row['Product'] == 'Apple' else ('Columbia' if row['Product'] == 'Banana' else 'Mexico'), axis=1)

Related

Rolling Average in Python with inf values

Subset dataframe based on the slope

adding subtotals to multiple layers of pandas pivot-table

Altair - highlight single element in bar chart

Pivot dataframe by column (ID) duplicated

Categories

Resources