I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)
This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0
This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1
I am working with ICD-9 codes for a data mining project using python and I am having trouble converting the specific codes into categories. For example, I am trying to change everything that's between 001 and 139 with 0, everything that's between 140 and 239 with 1, etc
This is what I have tried:
df = df.replace({'diag_1' : {'(1-139)' : 0, '(140-239)' : 1}})
You can use pd.cut to achieve this:
In [175]:
df = pd.DataFrame({'value':np.random.randint(0,20,10)})
df
Out[175]:
value
0 12
1 2
2 10
3 5
4 19
5 2
6 8
7 14
8 12
9 16
here we set bin intervals of (0-5) (5-15), (15-20):
In [183]:
df['new_value'] = pd.cut(df['value'], bins=[0,5,15,20], labels=[0,1,2])
df
Out[183]:
value new_value
0 12 1
1 2 0
2 10 1
3 5 0
4 19 2
5 2 0
6 8 1
7 14 1
8 12 1
9 16 2
I think in your case the following should work:
df['diag_1']= pd.cut(df['diag_1'], [1,140,240] , labels=[1,2,3])
you can set the bins and labels dynamically using np.arange or similar
There is nothing wrong with an if-statement.
newvalue = 1 if oldvalues <= 139 else 2
Apply this function as a lambda expression with map.
Given a DataFrame with a hierarchical index containing three levels (experiment, trial, slot) and a second DataFrame with a hierarchical index containing two levels (experiment, trial), how do I drop all the rows in the first DataFrame whose (experiment, trial) are not contained in the second dataframe?
Example data:
from io import StringIO
import pandas as pd
df1_data = StringIO(u',experiment,trial,slot,token\n0,btn144a10_p_RDT,0,0,4.0\n1,btn144a10_p_RDT,0,1,14.0\n2,btn144a10_p_RDT,1,0,12.0\n3,btn144a10_p_RDT,1,1,14.0\n4,btn145a07_p_RDT,0,0,6.0\n5,btn145a07_p_RDT,0,1,19.0\n6,btn145a07_p_RDT,1,0,17.0\n7,btn145a07_p_RDT,1,1,13.0\n8,chn004b06_p_RDT,0,0,6.0\n9,chn004b06_p_RDT,0,1,8.0\n10,chn004b06_p_RDT,1,0,2.0\n11,chn004b06_p_RDT,1,1,5.0\n12,chn008a06_p_RDT,0,0,12.0\n13,chn008a06_p_RDT,0,1,14.0\n14,chn008a06_p_RDT,1,0,6.0\n15,chn008a06_p_RDT,1,1,4.0\n16,chn008b06_p_RDT,0,0,3.0\n17,chn008b06_p_RDT,0,1,13.0\n18,chn008b06_p_RDT,1,0,12.0\n19,chn008b06_p_RDT,1,1,19.0\n20,chn008c04_p_RDT,0,0,17.0\n21,chn008c04_p_RDT,0,1,2.0\n22,chn008c04_p_RDT,1,0,1.0\n23,chn008c04_p_RDT,1,1,6.0\n')
df1 = pd.DataFrame.from_csv(df1_data).set_index(['experiment', 'trial', 'slot'])
df2_data = StringIO(u',experiment,trial,target\n0,btn145a07_p_RDT,1,13\n1,chn004b06_p_RDT,1,9\n2,chn008a06_p_RDT,0,15\n3,chn008a06_p_RDT,1,15\n4,chn008b06_p_RDT,1,1\n5,chn008c04_p_RDT,1,12\n')
df2 = pd.DataFrame.from_csv(df2_data).set_index(['experiment', 'trial'])
The first dataframe looks like:
token
experiment trial slot
btn144a10_p_RDT 0 0 4
1 14
1 0 12
1 14
btn145a07_p_RDT 0 0 6
1 19
1 0 17
1 13
chn004b06_p_RDT 0 0 6
1 8
1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 0 0 3
1 13
1 0 12
1 19
chn008c04_p_RDT 0 0 17
1 2
1 0 1
1 6
The second dataframe looks like:
target
experiment trial
btn145a07_p_RDT 1 13
chn004b06_p_RDT 1 9
chn008a06_p_RDT 0 15
1 15
chn008b06_p_RDT 1 1
chn008c04_p_RDT 1 12
The result I want:
token
experiment trial slot
btn145a07_p_RDT 1 0 17
1 13
chn004b06_p_RDT 1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 1 0 12
1 19
chn008c04_p_RDT 1 0 1
1 6
One way to do it would by using merge
merged = pd.merge(
df2.reset_index(),
df1.reset_index(),
left_on=['experiment', 'trial'],
right_on=['experiment', 'trial'],
how='left')
You just need to reindex merged to whatever you like (I could not tell exactly from the question).
What should work is
df1.loc[df2.index]
but multi indexing still has some problems. What does work is
df1.reset_index(2).loc[df2.index].set_index('slot', append=True)
which is a bit of a hack around this problem. Note that
df1.loc[df2.index[:1]]
gives garbage while
df.loc[df2.index[0]]
gives what you would expect. So passing multiple values from a m-level index to an n-level index where n > m > 2 doesn't work, though it should.