Pandas how to apply normalization on a column with a condtion - python

I have a data frame like below and i want to normalize the values per customer .please help me how to achieve the solution.I tried minmaxscaler from sklearn on complete price column but it is giving me values close to zero.
Dataframe
Customer price
A 0
A 3
A 7
A 0
A 0
B 2
B 2
B 0
C 5
C 1
D 0
D 0
D 15
D 0

If you want per customer,
df.groupby('Customer').price.transform(\
lambda s: MinMaxScaler().fit_transform(s.values.reshape(-1,1)).ravel()
)
0 0.000000
1 0.428571
2 1.000000
3 0.000000
4 0.000000
5 1.000000
6 1.000000
7 0.000000
8 1.000000
9 0.000000
10 0.000000
11 0.000000
12 1.000000
13 0.000000
Name: price, dtype: float64

You can solve it without MinMaxScaler:
df["norm"]=df.groupby("Customer").apply(\
lambda grp: grp.price.div(grp.price.max()) ).values
Customer price norm
0 A 0 0.000000
1 A 3 0.428571
2 A 7 1.000000
3 A 0 0.000000
4 A 0 0.000000
5 B 2 1.000000
6 B 2 1.000000
7 B 0 0.000000
8 C 5 1.000000
9 C 1 0.200000
10 D 0 0.000000
11 D 0 0.000000
12 D 15 1.000000
13 D 0 0.000000
Edit:
For another normalization, you can divide by grp.price.sum() instead of grp.price.max().
Edit2:
For more columns you can do:
cols=["price","weights"] # group the requested column names
df2= df.groupby("Customer").apply(lambda grp: grp[cols].div(grp[cols].max()) )
new_df=pd.concat([df,df2],axis=1)
You must rename the last, normalized columns:
new_df.columns
Index(['Customer', 'price', 'weight', 'price', 'weight'], dtype='object')
new_df.columns= df.columns.append(pd.Index(["norm_"+c for c in df2.columns]))

Related

How to make a pivot table with the index and values in Python pandas? [duplicate]

This question already has answers here:
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
(9 answers)
Closed 7 months ago.
Give a example:
data:
Group1 Group2 date value1 value2
0 A 01/20/20 0 1
1 A 01/25/20 0 3
2 A 02/28/20 0 2
3 B 01/25/20 0 1
4 B 01/27/20 2 2
5 C 01/29/20 0 5
6 C 01/30/20 2 6
I want get a pivot table to count the frequences of different values in Group2 and make Group2 column the index of the Final table. It's very easy when the index and value of the pivot table are not the same using pandas in python. While when they are same Python will raise a error, I can't solve this problem.
The output I want get is a table like the following table to get the frequency of different values in column 'Group 1' of the data:
Group 1
Frequency
A
3
B
2
C
2
I'm not sure how your desired output looks like, but I sounds like you are looking for something like this:
import pandas as pd
from io import StringIO
data = StringIO("""
Group1,Group2,date,value1,value2
0,A,01/20/20,0,1
1,A,01/25/20,0,3
2,A,02/28/20,0,2
3,B,01/25/20,0,1
4,B,01/27/20,2,2
5,C,01/29/20,0,5
6,C,01/30/20,2,6
""")
df = pd.read_csv(data, sep = ",")
pd.crosstab(index=df['Group2'], columns=[df['value1'], df['value2']],
normalize='index')
Output
value1 0 2
value2 1 2 3 5 2 6
Group2
A 0.333333 0.333333 0.333333 0.0 0.0 0.0
B 0.500000 0.000000 0.000000 0.0 0.5 0.0
C 0.000000 0.000000 0.000000 0.5 0.0 0.5
Or are you just inerested in one value column?
pd.crosstab(index=df['Group2'], columns=df['value2'],
normalize='index')
Output
value2 1 2 3 5 6
Group2
A 0.333333 0.333333 0.333333 0.0 0.0
B 0.500000 0.500000 0.000000 0.0 0.0
C 0.000000 0.000000 0.000000 0.5 0.5

Pandas - Adjust stock prices to stock splits

I have a DataFrame, with prices data and stock splits data.
I want to put all the prices on the same page, so for example if we had a stock split of 0.11111 (1/9),
from now on all the stock prices would be multiplied by 9.
So for example if this is my initial dataframe:
df= Price Stock_Splits
0 100 0
1 99 0
2 10 0.1111111
3 8 0
4 8.5 0
5 4 0.5
The "Price" column will become:
df= Price Stock_Splits
0 100 0
1 99 0
2 90 0.1111111
3 72 0
4 76.5 0
5 72 0.5
Here is one example:
df['New_Price'] = (1 / df.Stock_Splits).replace(np.inf, 1).cumprod() * df.Price
Price Stock_Splits New_Price
0 100.0 0.000000 100.000000
1 99.0 0.000000 99.000000
2 10.0 0.111111 90.000009
3 8.0 0.000000 72.000007
4 8.5 0.000000 76.500008
5 4.0 0.500000 72.000007

Divide several columns in a python dataframe where the both the numerator and denominator columns will vary based on a picklist

I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.
county _tcount _tvote _f_npb_18_count _f_npb_18_vote
countycode
35 San Benito 28194 22335 2677 1741
36 San Bernardino 912653 661838 108724 61832
countycode _f_npb_30_count _f_npb_30_vote
35 384 288
36 76749 53013
However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples (Divide multiple columns by another column in pandas), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:
calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]
output:
[ county _tcount _tvote _f_npb_18_count _f_npb_18_vote \
countycode
35 NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN]
RuntimeWarning: invalid value encountered in greater
(abs_vals > 0)).any()
The use of [5: :2] does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).
My preference would be to organize it by setting the index and using filter to split out a counts and votes dataframes separately. Then use join
d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))
d1[['_tcount', '_tvote']].join(votes / counts)
_tcount _tvote _f_npb_18 _f_npb_30
countycode county
35 San Benito 28194 22335 0.650355 0.750000
36 San Bernardino 912653 661838 0.568706 0.690732
I think you can divide by numpy arrays created by values, because then not align columns names. Last create new DataFrame by constructor:
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
Sample:
np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
columns=list('abcdefghij'))
print (county_select_frame)
a b c d e f g h i j
0 9 4 0 1 9 0 1 8 9 0
1 8 6 4 3 0 4 6 8 1 8
2 4 1 3 6 5 3 9 6 9 1
3 9 4 2 6 7 8 8 9 2 0
4 6 7 8 1 7 1 4 0 8 5
5 4 7 8 8 2 6 2 8 8 6
6 6 5 6 0 0 6 9 1 8 9
7 1 2 8 9 9 5 0 2 7 3
8 0 4 2 0 3 3 1 2 5 9
9 0 1 0 1 9 0 9 2 1 1
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
f h j
0 0.000000 8.000000 0.000000
1 inf 1.333333 8.000000
2 0.600000 0.666667 0.111111
3 1.142857 1.125000 0.000000
4 0.142857 0.000000 0.625000
5 3.000000 4.000000 0.750000
6 inf 0.111111 1.125000
7 0.555556 inf 0.428571
8 1.000000 2.000000 1.800000
9 0.000000 0.222222 1.000000
How about something like
cols = my_df.columns
for i in range(2, 6):
print(u'Creating new col %s', cols[i])
my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1]

Pandas Efficient Way to Calculate Annual Inventory Based on Equipment Date Ranges

I'm looking to translate a dataframe of equipment date ranges and characteristics into their annual total install time by characteristic groupings. I'm looking to translate a dataframe like this:
df_eq=pd.DataFrame({'equip':np.arange(0,10),'char1':[4]*4+[1,2,3]+[5]*3,
'char2':['A']*3+['B']*3+['C']*4,
'start':pd.to_datetime(['2010-01-10', '2010-01-10','2011-02-24','2011-06-06','2013-09-30','2010-01-10', '2010-01-10','2011-02-24','2011-06-06','2013-09-30']),
'end':pd.to_datetime(['2014-05-05']*2+['2015-01-01']*3+[None]*5)})
df_eq
char1 char2 end equip start
0 4 A 2014-05-05 0 2010-01-10
1 4 A 2014-05-05 1 2010-01-10
2 4 A 2015-01-01 2 2011-02-24
3 4 B 2015-01-01 3 2011-06-06
4 1 B 2015-01-01 4 2013-09-30
5 2 B NaT 5 2010-01-10
6 3 C NaT 6 2010-01-10
7 5 C NaT 7 2011-02-24
8 5 C NaT 8 2011-06-06
9 5 C NaT 9 2013-09-30
Where the NaT datetime's for end represent equipment that has not yet been retired. Using this dataframe I'm looking to translate to produce the following samples where the quantities are the install time of units within the given year:
char1 2011 2012 2013 2014
0 1 0.000000 0 0.254795 1.000000
1 2 1.000000 1 1.000000 1.000000
2 3 1.000000 1 1.000000 1.000000
3 4 3.424658 4 4.000000 2.684932
4 5 1.424658 2 2.254795 3.000000
char1 char2 2011 2012 2013 2014
0 1 B 0.000000 0 0.254795 1.000000
1 2 B 1.000000 1 1.000000 1.000000
2 3 C 1.000000 1 1.000000 1.000000
3 4 A 2.852055 3 3.000000 1.684932
4 4 B 0.572603 1 1.000000 1.000000
5 5 C 1.424658 2 2.254795 3.000000
I can produce the desired tables with the following code, but I'm looking to see if there is a more pythonic way using pandas to produce the same output tables:
df_eq.end=df_eq.end.fillna(pd.to_datetime(datetime.date.today()))
def days_in_year(start,end,year):
start_of_year=pd.to_datetime(datetime.date(year,1,1))
end_of_year=pd.to_datetime(datetime.date(year,12,31))
if start.year>year or end.year<year:
return 0
initial_date=start_of_year if start_of_year>start else start
ending_date=end_of_year if end_of_year<end else end
return (ending_date-initial_date+pd.Timedelta(days=1))/(end_of_year-start_of_year+pd.Timedelta(days=1))
df_inv_yr=pd.DataFrame(np.asarray(map(lambda year: map(lambda srt, end: days_in_year(srt,end,year), df_eq.start,df_eq.end),[2011,2012,2013,2014])).T.tolist(),columns=[2011,2012,2013,2014])
first_sample_output=pd.concat([df_eq,df_inv_yr],axis=1).groupby(['char1'])[[2011,2012,2013,2014]].sum().reset_index()
second_sample_output=pd.concat([df_eq,df_inv_yr],axis=1).groupby(['char1','char2'])[[2011,2012,2013,2014]].sum().reset_index()
I think you can vectorize some of your code using .where like this:
def days_in_year(years, df_eq):
df=df_eq.copy()
for year in years:
beg=pd.datetime(year,1,1)
end=pd.datetime(year+1,1,1)
df[year]=(df.end.where(df.end<=end,other=end)\
-df.start.where(df.start<=end,other=end).where(df.start>beg, beg))/(end-beg)
return df
years=range(2011,2015)
df = days_in_year(years,df_eq)
first_sample_output=df.groupby(['char1'])[years].sum().reset_index()
second_sample_output=df.groupby(['char1','char2'])[years].sum().reset_index()

Pandas sum across columns and divide each cell from that value

I have read a csv file and pivoted it to get to following structure:
pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)
Section of the result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 group
user_id
2 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 l-1
4 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 l-2
16 1383 70 27 17 17 1 0 0 0 0 1 0 0 0 l-2
50 396 72 34 5 18 0 0 0 0 0 0 0 0 0 l-3
51 3915 1170 402 832 2791 316 12 5 118 51 32 9 62 27 l-4
I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?
More simply:
result.div(result.sum(axis=1), axis=0)
Try the following:
In [1]: import pandas as pd
In [2]: df = pd.read_csv("test.csv")
In [3]: df
Out[3]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [4]: df["sum"] = df.sum(axis=1)
In [5]: df
Out[5]:
id value1 value2 value3 sum
0 A 1 2 3 6
1 B 4 5 6 15
2 C 7 8 9 24
In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [7]: df_new
Out[7]:
value1 value2 value3
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
Or you can do the following:
In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [9]: df
Out[9]:
id value1 value2 value3 sum
0 A 0.166667 0.333333 0.500 6
1 B 0.266667 0.333333 0.400 15
2 C 0.291667 0.333333 0.375 24
Or just straight up from the beginning:
In [10]: df = pd.read_csv("test.csv")
In [11]: df
Out[11]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)
In [13]: df
Out[13]:
id value1 value2 value3
0 A 0.166667 0.333333 0.500
1 B 0.266667 0.333333 0.400
2 C 0.291667 0.333333 0.375
Changing the column value1 and the like to your headers should work similarly.
easier to work per column:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T
result:
0 1 2
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
The following seemed to work fine for me:
In [39]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols] = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result
Out[39]:
0 1 2 3 4 5 6 \
user_id
2 0.864827 0.059749 0.023540 0.018503 0.022280 0.004806 0.000797
4 0.837285 0.018345 0.049453 0.025258 0.052732 0.002437 0.004077
16 0.912269 0.046174 0.017810 0.011214 0.011214 0.000660 0.000000
50 0.754286 0.137143 0.064762 0.009524 0.034286 0.000000 0.000000
51 0.401868 0.120099 0.041265 0.085403 0.286491 0.032437 0.001232
7 8 9 10 11 12 13 \
user_id
2 0.000000 0.000154 0.000077 0.001079 0.001439 0.002364 0.000385
4 0.000000 0.005406 0.001019 0.003456 0.000266 0.000177 0.000089
16 0.000000 0.000000 0.000000 0.000660 0.000000 0.000000 0.000000
50 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
51 0.000513 0.012113 0.005235 0.003285 0.000924 0.006364 0.002772
group
user_id
2 l-1
4 l-2
16 l-2
50 l-3
51 l-4
OK scratch the above, the following will be much faster:
result[cols] = result[cols].div(result[cols].sum(axis=1), axis=0)
And just to prove the result is the same:
In [47]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True
And that it's faster:
In [48]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0)
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop
result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)
result.iloc[:,:-1] gets all rows and columns except last column
result.iloc[:,:-1].sum(axis=1) sums across a row due to axis=1, default is axis=0 i.e. column
result.div(result, axis=0) axis=0 because default for div is column i.e. axis=1

Categories