Replace everything that starts with number python - python

I am working with ICD-9 codes for a data mining project using python and I am having trouble converting the specific codes into categories. For example, I am trying to change everything that's between 001 and 139 with 0, everything that's between 140 and 239 with 1, etc
This is what I have tried:
df = df.replace({'diag_1' : {'(1-139)' : 0, '(140-239)' : 1}})

You can use pd.cut to achieve this:
In [175]:
df = pd.DataFrame({'value':np.random.randint(0,20,10)})
df
Out[175]:
value
0 12
1 2
2 10
3 5
4 19
5 2
6 8
7 14
8 12
9 16
here we set bin intervals of (0-5) (5-15), (15-20):
In [183]:
df['new_value'] = pd.cut(df['value'], bins=[0,5,15,20], labels=[0,1,2])
df
Out[183]:
value new_value
0 12 1
1 2 0
2 10 1
3 5 0
4 19 2
5 2 0
6 8 1
7 14 1
8 12 1
9 16 2
I think in your case the following should work:
df['diag_1']= pd.cut(df['diag_1'], [1,140,240] , labels=[1,2,3])
you can set the bins and labels dynamically using np.arange or similar

There is nothing wrong with an if-statement.
newvalue = 1 if oldvalues <= 139 else 2
Apply this function as a lambda expression with map.

Related

Auto re-assign ids in a dataframe

I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)

Dataframe.loc does not work on multiple index , very simple

I have a pandas data frame that looks like this:
1: As you can see, I have the index "State" and "City"
And I want to filter by state using loc, for example using:
nuevo4.loc["Bulgaria"]
(The name of the Dataframe is "nuevo4"), but instead of getting the results, I want I get the error:
KeyError: 'Bulgaria'
I read the loc documentation online and I cannot see the fail here, I'm sorry if this is too obvious, the names are well spelled, and that...
You command should work see the below example. You might have some whitespace issues with your data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape(-1,5), index = pd.MultiIndex.from_tuples([('A',1), ('A',2),('B', 0), ('C', 1), ('C', 2)]))
print(df)
print(df.loc['A'])
Output:
0 1 2 3 4
A 1 0 1 2 3 4
2 5 6 7 8 9
B 0 10 11 12 13 14
C 1 15 16 17 18 19
2 20 21 22 23 24
Using loc:
0 1 2 3 4
1 0 1 2 3 4
2 5 6 7 8 9

How do I perform rolling division with several columns in Pandas?

I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)

Replacing all values in a Pandas column, with no conditions

I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1

Iterative subtraction of each row using pandas?

I have a dataframe like this:
abc
9 32.242063
3 24.419279
8 25.464011
6 25.029761
10 18.851918
2 26.027582
1 27.885187
4 20.141231
5 31.179138
7 22.893074
11 31.640625
0 33.150434
I want to subtract the first row from 100, then subtract the 2nd row from the remaining value from (100 - first row) and so on.
I tried:
a = 100 - df["abc"]
but everytime it is subtracting it from 100.
can anybody suggest the correct way to do it?
It seems you need:
df['new'] = 100 - df['abc'].cumsum()
print (df)
abc new
9 32.242063 67.757937
3 24.419279 43.338658
8 25.464011 17.874647
6 25.029761 -7.155114
10 18.851918 -26.007032
2 26.027582 -52.034614
1 27.885187 -79.919801
4 20.141231 -100.061032
5 31.179138 -131.240170
7 22.893074 -154.133244
11 31.640625 -185.773869
0 33.150434 -218.924303
Option 1
np.cumsum -
df["abc"] = 100 - np.cumsum(df.abc.values)
df
abc
9 67.757937
3 43.338658
8 17.874647
6 -7.155114
10 -26.007032
2 -52.034614
1 -79.919801
4 -100.061032
5 -131.240170
7 -154.133244
11 -185.773869
0 -218.924303
This is faster than pd.Series.cumsum in the other answer.
Option 2
Loopy equivalent, cythonized.
%load_ext Cython
%%cython
def foo(r):
x = [100 - r[0]]
for i in r[1:]:
x.append(x[-1] - i)
return x
df['abc'] = foo(df['abc'])
df
abc
9 66.849566
3 42.430287
8 16.966276
6 -8.063485
10 -26.915403
2 -52.942985
1 -80.828172
4 -100.969403
5 -132.148541
7 -155.041615
11 -186.682240
0 -219.832674

Categories