Add incremental values following an id

Add incremental values following an id - python

`I'm trying to add incremetal values for each id in this pandas dataframe
initial table
id col1 col2
1 0.12 10
1 0.23 20
1 1.1 30
2 0.25 10
2 2.1 20
2 1.2 30
what i want to acheive
id col1 col2
1 0.12 10
1 0.23 20
1 1.1 30
1 0 40
1 0 50
2 0.25 10
2 2.1 20
2 1.2 30
2 0 40
2 0 50
i tried :
def func(row):
for i in row["id"]:
for j in range(40,50+1,10):
row["id"] = i
row["col1"] = 0
row["col2"] = j
df = df.apply(lambda row:func(row))
but this raise an error that id doesn't exist
KeyError: 'id'

No need for a loop, you can approach this by using a MultiIndex.from_product :
N = 50 # <- adjust here the limit to reach
gr = df.groupby("id", as_index=False).count()
idx = pd.MultiIndex.from_product([gr["id"], range(10, N+10, 10)], names=["id", "col2"])
out = (df.set_index(["id", "col2"]).reindex(idx, fill_value=0).reset_index()[df.columns])
Output :
print(out)

id col1 col2
0 1 0.12 10
1 1 0.23 20
2 1 1.10 30
3 1 0.00 40
4 1 0.00 50
5 2 0.25 10
6 2 2.10 20
7 2 1.20 30
8 2 0.00 40
9 2 0.00 50

You can group by id and append values to needed columns via pandas concatenation:
df_to_append = pd.DataFrame([[0,40],[0,50]], columns=['col1','col2'])
df = df.groupby('id', as_index=False).apply(lambda x: pd.concat([x, df_to_append], ignore_index=True))\
.reset_index(drop=True).fillna(method='ffill')
id col1 col2
0 1.0 0.12 10
1 1.0 0.23 20
2 1.0 1.10 30
3 1.0 0.00 40
4 1.0 0.00 50
5 2.0 0.25 10
6 2.0 2.10 20
7 2.0 1.20 30
8 2.0 0.00 40
9 2.0 0.00 50

Related

Find optimal combinations of two columns based on another column value

So, my dataframe looks like this
index Client Manager Score
0 1 1 0.89
1 1 2 0.78
2 1 3 0.65
3 2 1 0.91
4 2 2 0.77
5 2 3 0.97
6 3 1 0.35
7 3 2 0.61
8 3 3 0.81
9 4 1 0.69
10 4 2 0.22
11 4 3 0.93
12 5 1 0.78
13 5 2 0.55
14 5 3 0.44
15 6 1 0.64
16 6 2 0.99
17 6 3 0.22
My expected output looks like this
index Client Manager Score
0 1 1 0.89
1 2 3 0.97
2 3 2 0.61
3 4 3 0.93
4 5 1 0.78
5 6 2 0.99
We have 3 managers and 6 clients. I want each manager to have 2 clients based on highest Score. Each manager should have only unique client, so that if one client is good for two managers, we need to take second best score and so on. May I have your suggestions? Thank you in advance.

df = df.drop("index", axis=1)
df = df.sort_values("Score").iloc[::-1,:]
df
selected_client = []
selected_manager = []
selected_df = []
iter_rows = df.iterrows()
for i,d in iter_rows:
client = int(d.to_frame().loc[["Client"],[i]].values[0][0])
manager = int(d.to_frame().loc[["Manager"],[i]].values[0][0])
if client not in selected_client and selected_manager.count(manager) != 2:
selected_client.append(client)
selected_manager.append(manager)
selected_df.append(d)
result = pd.concat(selected_df, axis=1, sort=False)
print(result.T)

Try this:
df = df.sort_values('Score',ascending = False) #sort values to prioritize high scores
d = {i:[] for i in df['Manager']} #create an empty dictionary to fill in the client/manager pairs
n = 2 #set number of clients per manager
for c,m in zip(df['Client'],df['Manager']): #iterate over client and manager pairs
if len(d.get(m))<n and c not in [c2 for i in d.values() for c2,m2 in i]: #if there are not already two pairs, and if the client has not already been added, append the pair to the list
d.get(m).append((c,m))
else:
pass
ndf = pd.merge(df,pd.DataFrame([k for v in d.values() for k in v],columns = ['Client','Manager'])).sort_values('Client') #filter for just the pairs found above.
Output:
index Client Manager Score
3 0 1 1 0.89
1 5 2 3 0.97
5 7 3 2 0.61
2 11 4 3 0.93
4 12 5 1 0.78
0 16 6 2 0.99

Replace column with some rows of another column

I have following dataframe:
midPrice Change % Spike New Oilprice
92.20000 0.00 0 92.043405
92.26454 0.07 0 92.049689
91.96950 -0.32 0 91.979751
91.73958 -0.25 0 91.844369
91.78985 0.05 0 91.724690
91.41000 -0.41 0 91.568880
91.18148 -0.25 0 91.690812
91.24257 0.07 0 91.858391
90.95352 -0.32 0 92.016806
93.24000 2.51 1 92.139872
93.31013 0.08 0 92.321622
93.00690 -0.32 0 92.542687
92.77438 -0.25 0 92.727070
92.86400 0.10 0 92.949655
and whenever I have a Spike (1) in the column, I want to replace the 5 rows after the spike (including) with the new oil prices. The rest of the rows are being kept as they are.
Any ideas how to solve that?
I tried the code based on following:
Iterate through the df (for loop)
If/else statement if spike == 1 then replace following 5 rows with values of new oil prices / else: keep oil prices
def spike(i):
for i in df['Spike']:
if i.loc == 1:
df['midPrice'].replace(df['New Oilprice'][i:5])`
It unfortunately doesn't work and I\m not so strong with pandas. I tried mapping the function as well on the dataframe which didn't work either. I would appreciate any help

Assuming the df is sorted by time in ascending order (as I've seen in the edit history of your question that you had a time column), you could use a mask like so:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='ffill', limit=4).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']
print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.242570 0.07 0 91.858391
8 90.953520 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655
EDIT - 2 rows before, 3 rows after:
You can adjust the mask with another fillna:
mask = df['Spike'].eq(1).where(df['Spike'].eq(1)).fillna(method='bfill', limit=2).fillna(method='ffill', limit=3).fillna(False)
df.loc[mask, 'midPrice'] = df['New Oilprice']

print(df)
midPrice Change % Spike New Oilprice
0 92.200000 0.00 0 92.043405
1 92.264540 0.07 0 92.049689
2 91.969500 -0.32 0 91.979751
3 91.739580 -0.25 0 91.844369
4 91.789850 0.05 0 91.724690
5 91.410000 -0.41 0 91.568880
6 91.181480 -0.25 0 91.690812
7 91.858391 0.07 0 91.858391
8 92.016806 -0.32 0 92.016806
9 92.139872 2.51 1 92.139872
10 92.321622 0.08 0 92.321622
11 92.542687 -0.32 0 92.542687
12 92.727070 -0.25 0 92.727070
13 92.949655 0.10 0 92.949655

Selectively use df.div() to divide only a certain column based on index match

I have 2 DataFrames, one is a monthly total and the other contains values by which I want to divide the first in order to get monthly percentage contributions.
Here are some example DataFrames:
MonthlyTotals = pd.DataFrame(data={'Month':[1,2,3],'Value':[100,200,300]})
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
I am using df.div() so I set the index like so
MonthlyTotals.set_index('Month', inplace=True)
Data.set_index('Month', inplace=True)
Then I do the division
Contributions = Data.div(MonthlyTotals, axis='index')
The resulting DataFrame is what I want but I cannot see the ID that the Value relates to as this isn't in the MonthlyTotals frame. How would I use df.div() but only selectively on certain columns?
Here is an example dataframe of the result I am looking for
result = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],'Value':[0.4,0.3,0.3,0.3,0.35,0.35,0.5,0.2,0.3]})

You may not need MonthlyTotals if Data is complete. You can calculate MonthlyTotal using transform and then calculate Contributions.
Data = pd.DataFrame(data={'ID':[1,2,3,1,2,3,1,2,3],
'Month':[1,1,1,2,2,2,3,3,3],
'Value':[40,30,30,60,70,70,150,60,90]})
Data['MonthlyTotal'] = Data.Gropuby('Month')['Value'].transform('sum')
Data['Contributions'] = Data['Value'] / Data['MonthlyTotal']
Output
ID Month Value MonthlyTotal Contributions
0 1 1 40 100 0.40
1 2 1 30 100 0.30
2 3 1 30 100 0.30
3 1 2 60 200 0.30
4 2 2 70 200 0.35
5 3 2 70 200 0.35
6 1 3 150 300 0.50
7 2 3 60 300 0.20
8 3 3 90 300 0.30

Also if you would like only use pandas you can fix your code with reindex + update
Data.update(Data['Value'].div(MonthlyTotals['Value'].reindex(Data.index),axis=0))
Data
ID Value
Month
1 1 0.40
1 2 0.30
1 3 0.30
2 1 0.30
2 2 0.35
2 3 0.35
3 1 0.50
3 2 0.20
3 3 0.30

Np random sampling in python

I have two pd data tables. I want to create a new column in df2 by assign random Rate using Weight from df1.
df1
Income_Group Rate Weight
0 1 3.5 0.5
1 1 2.5 0.25
2 1 3.75 0.15
3 1 5.0 0.15
4 2 4.5 0.35
5 2 2.5 0.25
6 2 4.75 0.20
7 2 5.0 0.20
....
30 8 2.25 0.75
31 8 4.15 0.05
32 8 6.35 0.20
df2
ID Income_Group State Rate
0 12 1 9 3.5
1 13 2 6 4.5
2 15 8 1 6.35
3 8 1 5 2.5
4 9 8 4 6.35
5 17 2 3 4.75
......
100 50 1 4 3.75
I tried the following code:
df2['Rate']=df1.groupby('Income_Group').apply(lambda gp.np.random.choice(a=gp.Rate, p=gp.Weight,
replace=True))
Of course, the code didn't work. Can someone help me on this? Thank you in advance.

Your data is pretty small, so we can do:
rate_dict = df1.groupby('Income_Group')[['Rate', 'Weight']].agg(list)
df2['Rate'] = df2.Income_Group.apply(lambda x: np.random.choice(rate_dict.loc[x, 'Rate'],
p=rate_dict.loc[x, 'Weight'])
)
Or you can do groupby on df2 as well:
(df2.groupby('Income_Group')
.Income_Group
.transform(lambda x: np.random.choice(rate_dict.loc[x.iloc[0], 'Rate'],
size=len(x),
p=rate_dict.loc[x.iloc[0], 'Weight']))
)

You can try:
df1 = pd.DataFrame([[1,3.5,.5], [1,2.5,.25], [1,3.75,.15]],
columns=['Income_Group', 'Rate', 'Weight'])
df2 = pd.DataFrame()
weights = np.random.rand(df1.shape[0])
df2['Rate'] = df1.Rate.values * weights

Remapping and regrouping values in python pandas

I have a dataframe where values have been assigned to groups:
import pandas as pd
df = pd.DataFrame({ 'num' : [0.43, 5.2, 1.3, 0.33, .74, .5, .2, .12],
'group' : [1, 2, 2, 2, 3,4,5,5]
})
df
group num
0 1 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 3 0.74
5 4 0.50
6 5 0.20
7 5 0.12
I would like to ensure that no value is in a group alone. If a value is an "orphan", it should be reassigned to the next highest group with more than one member. So the resultant dataframe should look like this instead:
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12
What's the most pythonic way to achieve this result?

Here is one solution I found, there may be much better ways to do this...
# Find the orphans
count = df.group.value_counts().sort_index()
orphans = count[count == 1].index.values.tolist()
# Find the sets
sets = count[count > 1].index.values.tolist()
# Find where orphans should be remapped
where = [bisect.bisect(sets, x) for x in orphans]
remap = [sets[x] for x in where]
# Create a dictionary for remapping, and replace original values
change = dict(zip(orphans, remap))
df = df.replace({'group': change})
df
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

It is possible to use only vectorised operations for this task. You can use pd.Series.bfill to create a mapping from your original index to a new one:
counts = df['group'].value_counts().sort_index().reset_index()
counts['original'] = counts['index']
counts.loc[counts['group'] == 1, 'index'] = np.nan
counts['index'] = counts['index'].bfill().astype(int)
print(counts)
index group original
0 2 1 1
1 2 3 2
2 5 1 3
3 5 1 4
4 5 2 5
Then use pd.Series.map to perform your mapping:
df['group'] = df['group'].map(counts.set_index('original')['index'])
print(df)
group num
0 2 0.43
1 2 5.20
2 2 1.30
3 2 0.33
4 5 0.74
5 5 0.50
6 5 0.20
7 5 0.12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add incremental values following an id - python

Related

Find optimal combinations of two columns based on another column value

Replace column with some rows of another column

Selectively use df.div() to divide only a certain column based on index match

Np random sampling in python

Remapping and regrouping values in python pandas

Categories

Resources