Add to values of a DataFrame using cooridnates - python

I have a dataframe a:
Out[68]:
p0_4 p5_7 p8_9 p10_14 p15 p16_17 p18_19 p20_24 p25_29 \
0 1360.0 921.0 676.0 1839.0 336.0 668.0 622.0 1190.0 1399.0
1 308.0 197.0 187.0 411.0 67.0 153.0 172.0 336.0 385.0
2 76.0 59.0 40.0 72.0 16.0 36.0 20.0 56.0 82.0
3 765.0 608.0 409.0 1077.0 220.0 359.0 342.0 873.0 911.0
4 1304.0 906.0 660.0 1921.0 375.0 725.0 645.0 1362.0 1474.0
5 195.0 135.0 78.0 262.0 44.0 97.0 100.0 265.0 229.0
6 1036.0 965.0 701.0 1802.0 335.0 701.0 662.0 1321.0 1102.0
7 5072.0 3798.0 2865.0 7334.0 1399.0 2732.0 2603.0 4976.0 4575.0
8 1360.0 962.0 722.0 1758.0 357.0 710.0 713.0 1761.0 1660.0
9 743.0 508.0 369.0 1118.0 286.0 615.0 429.0 738.0 885.0
10 1459.0 1015.0 679.0 1732.0 337.0 746.0 677.0 1493.0 1546.0
11 828.0 519.0 415.0 1057.0 190.0 439.0 379.0 788.0 1024.0
12 1042.0 690.0 503.0 1204.0 219.0 451.0 465.0 1193.0 1406.0
p30_44 p45_59 p60_64 p65_74 p75_84 p85_89 p90plus
0 4776.0 8315.0 2736.0 5463.0 2819.0 738.0 451.0
1 1004.0 2456.0 988.0 2007.0 1139.0 313.0 153.0
2 291.0 529.0 187.0 332.0 108.0 31.0 10.0
3 2807.0 5505.0 2060.0 4104.0 2129.0 516.0 252.0
4 4524.0 9406.0 3034.0 6003.0 3366.0 840.0 471.0
5 806.0 1490.0 606.0 1288.0 664.0 185.0 108.0
6 4127.0 8311.0 2911.0 6111.0 3525.0 1029.0 707.0
7 16917.0 27547.0 8145.0 15950.0 9510.0 2696.0 1714.0
8 5692.0 9380.0 3288.0 6458.0 3830.0 1050.0 577.0
9 2749.0 5696.0 2014.0 4165.0 2352.0 603.0 288.0
10 4676.0 7654.0 2502.0 5077.0 3004.0 754.0 461.0
11 2799.0 4880.0 1875.0 3951.0 2294.0 551.0 361.0
12 3288.0 5661.0 1974.0 4007.0 2343.0 623.0 303.0
and a series d:
Out[70]:
2 p45_59
10 p45_59
11 p45_59
Is there a simple way to add 1 to number in a with the same index and column labels in d?
I have tried:
a[d] +=1
However this adds 1 to every value in the column, not just the values with indices 2, 10 and 11.
Thanking you in advance.

You might want to try this.
a.loc[list(d.index), list(d.values)] += 1

Related

How can I iterate over a pandas dataframe so I can divide specific values based on a condition?

I have a dataframe like below:
0 1 2 ... 62 63 64
795 89.0 92.0 89.0 ... 74.0 64.0 4.0
575 80.0 75.0 78.0 ... 70.0 68.0 3.0
1119 2694.0 2437.0 2227.0 ... 4004.0 4010.0 6.0
777 90.0 88.0 88.0 ... 71.0 67.0 4.0
506 82.0 73.0 77.0 ... 69.0 64.0 2.0
... ... ... ... ... ... ... ...
65 84.0 77.0 78.0 ... 78.0 80.0 0.0
1368 4021.0 3999.0 4064.0 ... 1.0 4094.0 8.0
1036 80.0 80.0 79.0 ... 73.0 66.0 5.0
1391 3894.0 3915.0 3973.0 ... 4.0 4090.0 8.0
345 81.0 74.0 75.0 ... 80.0 75.0 1.0
I want to divide all elements over 1000 in this dataframe by 100. So 4021.0 becomes 40.21, et cetera.
I've tried something like below:
for cols in df:
for rows in df[cols]:
print(df[cols][rows])
I get index out of bound errors. I'm just not sure how to properly iterate the way I'm looking for.
I think loops are here slow, so better is use vectorizes solutions - select values greater like 1000 and divide:
df[df.gt(1000)] = df.div(100)
Or using DataFrame.mask:
df = df.mask(df.gt(1000), df.div(100))
print (df)
0 1 2 62 63 64
795 89.00 92.00 89.00 74.00 64.00 4.0
575 80.00 75.00 78.00 70.00 68.00 3.0
1119 26.94 24.37 22.27 40.04 40.10 6.0
777 90.00 88.00 88.00 71.00 67.00 4.0
506 82.00 73.00 77.00 69.00 64.00 2.0
65 84.00 77.00 78.00 78.00 80.00 0.0
1368 40.21 39.99 40.64 1.00 40.94 8.0
1036 80.00 80.00 79.00 73.00 66.00 5.0
1391 38.94 39.15 39.73 4.00 40.90 8.0
345 81.00 74.00 75.00 80.00 75.00 1.0
You can use the applymap function and create your custom function
def mapper_function(x):
if x >= 1000:
x=x/100
else:
x
return x
df=df.applymap(mapper_function)

Create well structured pandas dataframe using dataframe

I have a Panda DataFreme data from 2018 to 2020. I want to structure these data as follows.
Month | 2018 | 2019
Jan 115 73
Feb 112 63
....
up to December.
How can I solve this issue using panda data frame syntax?
Date
2018-01-01 115.0
2018-02-01 112.0
2018-03-01 104.5
2018-04-01 91.1
2018-05-01 85.5
2018-06-01 76.5
2018-07-01 86.5
2018-08-01 77.9
2018-09-01 65.0
2018-10-01 71.0
2018-11-01 76.0
2018-12-01 72.5
2019-01-01 73.0
2019-02-01 63.0
2019-03-01 63.0
2019-04-01 61.0
2019-05-01 58.3
2019-06-01 59.0
2019-07-01 67.0
2019-08-01 64.0
2019-09-01 59.9
2019-10-01 70.4
2019-11-01 78.9
2019-12-01 75.0
2020-01-01 73.9
Name: Close, dtype: float64
This is more like pivot but with crosstab
s = pd.crosstab(df.index.strftime('%b'),df.index.year,df.values,aggfunc='sum')
Out[87]:
col_0 2018 2019 2020
row_0
Apr 91.1 61.0 NaN
Aug 77.9 64.0 NaN
Dec 72.5 75.0 NaN
Feb 112.0 63.0 NaN
Jan 115.0 73.0 73.9
Jul 86.5 67.0 NaN
Jun 76.5 59.0 NaN
Mar 104.5 63.0 NaN
May 85.5 58.3 NaN
Nov 76.0 78.9 NaN
Oct 71.0 70.4 NaN
Sep 65.0 59.9 NaN
You can use groupby and unstack:
(s.groupby([s.index.month, s.index.year]).first().unstack()
.rename_axis(columns='Year',index='Month')
)
Output:
Year 2018 2019 2020
Month
1 115.0 73.0 73.9
2 112.0 63.0 NaN
3 104.5 63.0 NaN
4 91.1 61.0 NaN
5 85.5 58.3 NaN
6 76.5 59.0 NaN
7 86.5 67.0 NaN
8 77.9 64.0 NaN
9 65.0 59.9 NaN
10 71.0 70.4 NaN
11 76.0 78.9 NaN
12 72.5 75.0 NaN

Taking the mean value of N last days, including NaNs

I have this data frame:
ID Date X 123_Var 456_Var 789_Var
A 16-07-19 3 777.0 250.0 810.0
A 17-07-19 9 637.0 121.0 529.0
A 20-07-19 2 295.0 272.0 490.0
A 21-07-19 3 778.0 600.0 544.0
A 22-07-19 6 741.0 792.0 907.0
A 25-07-19 6 435.0 416.0 820.0
A 26-07-19 8 590.0 455.0 342.0
A 27-07-19 6 763.0 476.0 753.0
A 02-08-19 6 717.0 211.0 454.0
A 03-08-19 6 152.0 442.0 475.0
A 05-08-19 6 564.0 340.0 302.0
A 07-08-19 6 105.0 929.0 633.0
A 08-08-19 6 948.0 366.0 586.0
B 07-08-19 4 509.0 690.0 406.0
B 08-08-19 2 413.0 725.0 414.0
B 12-08-19 2 170.0 702.0 912.0
B 13-08-19 3 851.0 616.0 477.0
B 14-08-19 9 475.0 447.0 555.0
B 15-08-19 1 412.0 403.0 708.0
B 17-08-19 2 299.0 537.0 321.0
B 18-08-19 4 310.0 119.0 125.0
C 16-07-19 3 777.0 250.0 810.0
C 17-07-19 9 637.0 121.0 529.0
C 20-07-19 2 NaN NaN NaN
C 21-07-19 3 NaN NaN NaN
C 22-07-19 6 741.0 792.0 907.0
C 25-07-19 6 NaN NaN NaN
C 26-07-19 8 590.0 455.0 342.0
C 27-07-19 6 763.0 476.0 753.0
C 02-08-19 6 717.0 211.0 454.0
C 03-08-19 6 NaN NaN NaN
C 05-08-19 6 564.0 340.0 302.0
C 07-08-19 6 NaN NaN NaN
C 08-08-19 6 948.0 366.0 586.0
I want to show the mean value of n last days (using Date column), excluding the value of current day.
I'm using this code (what should I do to fix this?):
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
n = 4
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n+1}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n+1}D')[cols].mean()
df3 = (df1.mul(df2)
.sub(df_[cols])
.div(df1[cols].sub(1)).add_suffix(f'_{n}')
)
df4 = df_.join(df3)
Expected result:
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
A 16-07-19 3 777.0 250.0 810.0 NaN NaN NaN
A 17-07-19 9 637.0 121.0 529.0 777.000000 250.000000 810.0
A 20-07-19 2 295.0 272.0 490.0 707.000000 185.500000 669.5
A 21-07-19 3 778.0 600.0 544.0 466.000000 196.500000 509.5
A 22-07-19 6 741.0 792.0 907.0 536.500000 436.000000 517.0
A 25-07-19 6 435.0 416.0 820.0 759.500000 696.000000 725.5
A 26-07-19 8 590.0 455.0 342.0 588.000000 604.000000 863.5
A 27-07-19 6 763.0 476.0 753.0 512.500000 435.500000 581.0
A 02-08-19 6 717.0 211.0 454.0 NaN NaN NaN
A 03-08-19 6 152.0 442.0 475.0 717.000000 211.000000 454.0
A 05-08-19 6 564.0 340.0 302.0 434.500000 326.500000 464.5
A 07-08-19 6 105.0 929.0 633.0 358.000000 391.000000 388.5
A 08-08-19 6 948.0 366.0 586.0 334.500000 634.500000 467.5
B 07-08-19 4 509.0 690.0 406.0 NaN NaN NaN
B 08-08-19 2 413.0 725.0 414.0 509.000000 690.000000 406.0
B 12-08-19 2 170.0 702.0 912.0 413.000000 725.000000 414.0
B 13-08-19 3 851.0 616.0 477.0 291.500000 713.500000 663.0
B 14-08-19 9 475.0 447.0 555.0 510.500000 659.000000 694.5
B 15-08-19 1 412.0 403.0 708.0 498.666667 588.333333 648.0
B 17-08-19 2 299.0 537.0 321.0 579.333333 488.666667 580.0
B 18-08-19 4 310.0 119.0 125.0 395.333333 462.333333 528.0
C 16-07-19 3 777.0 250.0 810.0 NaN NaN NaN
C 17-07-19 9 637.0 121.0 529.0 777.000000 250.000000 810.0
C 20-07-19 2 NaN NaN NaN 707.000000 185.500000 669.5
C 21-07-19 3 NaN NaN NaN 637.000000 121.000000 529.0
C 22-07-19 6 741.0 792.0 907.0 NaN NaN NaN
C 25-07-19 6 NaN NaN NaN 741.000000 792.000000 907.0
C 26-07-19 8 590.0 455.0 342.0 741.000000 792.000000 907.0
C 27-07-19 6 763.0 476.0 753.0 590.000000 455.000000 342.0
C 02-08-19 6 717.0 211.0 454.0 NaN NaN NaN
C 03-08-19 6 NaN NaN NaN 717.000000 211.000000 454.0
C 05-08-19 6 564.0 340.0 302.0 717.000000 211.000000 454.0
C 07-08-19 6 NaN NaN NaN 564.000000 340.000000 302.0
C 08-08-19 6 948.0 366.0 586.0 564.000000 340.000000 302.0
Numbers after the decimal point is not the matter.
These threads might help:
Taking the mean value of N last days
Taking the min value of N last days
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = (df.groupby('ID')['Date','123_Var','456_Var','789_Var'].rolling('4D', on='Date', closed='left').mean())
dfx = (df.set_index(['ID','Date'])
.join(df1.reset_index().set_index(['ID','Date']), rsuffix='_4')
.reset_index()
.drop('level_1',axis=1))
print(dfx.to_string())
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
0 A 2019-07-16 3 777.0 250.0 810.0 NaN NaN NaN
1 A 2019-07-17 9 637.0 121.0 529.0 777.000000 250.000000 810.0
2 A 2019-07-20 2 295.0 272.0 490.0 707.000000 185.500000 669.5
3 A 2019-07-21 3 778.0 600.0 544.0 466.000000 196.500000 509.5
4 A 2019-07-22 6 741.0 792.0 907.0 536.500000 436.000000 517.0
5 A 2019-07-25 6 435.0 416.0 820.0 759.500000 696.000000 725.5
6 A 2019-07-26 8 590.0 455.0 342.0 588.000000 604.000000 863.5
7 A 2019-07-27 6 763.0 476.0 753.0 512.500000 435.500000 581.0
8 A 2019-08-02 6 717.0 211.0 454.0 NaN NaN NaN
9 A 2019-08-03 6 152.0 442.0 475.0 717.000000 211.000000 454.0
10 A 2019-08-05 6 564.0 340.0 302.0 434.500000 326.500000 464.5
11 A 2019-08-07 6 105.0 929.0 633.0 358.000000 391.000000 388.5
12 A 2019-08-08 6 948.0 366.0 586.0 334.500000 634.500000 467.5
13 B 2019-08-07 4 509.0 690.0 406.0 NaN NaN NaN
14 B 2019-08-08 2 413.0 725.0 414.0 509.000000 690.000000 406.0
15 B 2019-08-12 2 170.0 702.0 912.0 413.000000 725.000000 414.0
16 B 2019-08-13 3 851.0 616.0 477.0 170.000000 702.000000 912.0
17 B 2019-08-14 9 475.0 447.0 555.0 510.500000 659.000000 694.5
18 B 2019-08-15 1 412.0 403.0 708.0 498.666667 588.333333 648.0
19 B 2019-08-17 2 299.0 537.0 321.0 579.333333 488.666667 580.0
20 B 2019-08-18 4 310.0 119.0 125.0 395.333333 462.333333 528.0
21 C 2019-07-16 3 777.0 250.0 810.0 NaN NaN NaN
22 C 2019-07-17 9 637.0 121.0 529.0 777.000000 250.000000 810.0
23 C 2019-07-20 2 NaN NaN NaN 707.000000 185.500000 669.5
24 C 2019-07-21 3 NaN NaN NaN 637.000000 121.000000 529.0
25 C 2019-07-22 6 741.0 792.0 907.0 NaN NaN NaN
26 C 2019-07-25 6 NaN NaN NaN 741.000000 792.000000 907.0
27 C 2019-07-26 8 590.0 455.0 342.0 741.000000 792.000000 907.0
28 C 2019-07-27 6 763.0 476.0 753.0 590.000000 455.000000 342.0
29 C 2019-08-02 6 717.0 211.0 454.0 NaN NaN NaN
30 C 2019-08-03 6 NaN NaN NaN 717.000000 211.000000 454.0
31 C 2019-08-05 6 564.0 340.0 302.0 717.000000 211.000000 454.0
32 C 2019-08-07 6 NaN NaN NaN 564.000000 340.000000 302.0
33 C 2019-08-08 6 948.0 366.0 586.0 564.000000 340.000000 302.0

Rename specific columns with numbers with str+number

I originally have r number of csv files.
I created one dataframe with 9 columns and r of them have numbers as headers.
I would like to target only them and change their name into ['Apple']+range(len(files)).
Example:
I have 3 csv files.
The current 3 targeted columns in my dataframe are:
0 1 2
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
I would like:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
Thank you
IIUC, you can initialise a itertools.count object and reset the columns in a list comprehension.
from itertools import count
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
This will also work very well if the digits are not contiguous, but you want them to be renamed with a contiguous suffix:
print(df)
1 Col1 5 Col2 500
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
print(df)
Apple1 Col1 Apple2 Col2 Apple3
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
You can use rename_axis:
df.rename_axis(lambda x: 'Apple{}'.format(int(x)+1) if str(x).isdigit() else x, axis="columns")
Out[9]:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0

Python: Create a new column of date from an existing column of date by subtracting consecutive rows [duplicate]

This question already has answers here:
Adding a column thats result of difference in consecutive rows in pandas
(4 answers)
Closed 5 years ago.
Code:
import pandas as pd
df = pd.read_csv('xyz.csv', usecols=['transaction_date', 'amount'])
df=pd.concat(g for _, g in df.groupby("amount") if len(g) > 3)
df=df.reset_index(drop=True)
print(df)
Output:
transaction_date amount
0 2016-06-02 50.0
1 2016-06-02 50.0
2 2016-06-02 50.0
3 2016-06-02 50.0
4 2016-06-02 50.0
5 2016-06-02 50.0
6 2016-07-04 50.0
7 2016-07-04 50.0
8 2016-09-29 225.0
9 2016-10-29 225.0
10 2016-11-29 225.0
11 2016-12-30 225.0
12 2017-01-30 225.0
13 2016-05-16 1000.0
14 2016-05-20 1000.0
I need to add another column next to the amount column which gives the difference between corresponding rows of transaction_date
e.g.
transaction_date amount delta(days)
0 2016-06-02 50.0 -
1 2016-06-02 50.0 0
2 2016-06-02 50.0 0
3 2016-06-02 50.0 0
4 2016-06-02 50.0 0
5 2016-06-02 50.0 0
6 2016-07-04 50.0 32
7 2016-07-04 50.0 .
8 2016-09-29 225.0 .
9 2016-10-29 225.0 .
10 2016-11-29 225.0
there're probably some better methods, but you can use pandas.Series.shift:
>>> df.transaction_date.shift(-1) - df.transaction_date
0 0 days
1 0 days
2 0 days
3 0 days
4 0 days
5 32 days
6 0 days
7 87 days
8 30 days
9 31 days
10 31 days
11 31 days
12 -259 days
13 4 days
14 NaT
I think you need diff + dt.days:
df['delta(days)'] = df['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 -259.0
14 2016-05-20 1000.0 4.0
But if need count it by groups add groupby:
df['delta(days)'] = df.groupby('amount')['transaction_date'].diff().dt.days
print (df)
transaction_date amount delta(days)
0 2016-06-02 50.0 NaN
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 NaN
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
To get exact output you've requested (sorting optional), use shift to solve for timedelta, use dt.days to find int:
df.transaction_date = pd.to_datetime(df.transaction_date)
df.sort_values('transaction_date', inplace=True)
df['delta(days)'] = (df['transaction_date'] - df['transaction_date'].shift(1)).dt.days
Output:
transaction_date amount delta(days)
13 2016-05-16 1000.0 NaN
14 2016-05-20 1000.0 4.0
0 2016-06-02 50.0 13.0
1 2016-06-02 50.0 0.0
2 2016-06-02 50.0 0.0
3 2016-06-02 50.0 0.0
4 2016-06-02 50.0 0.0
5 2016-06-02 50.0 0.0
6 2016-07-04 50.0 32.0
7 2016-07-04 50.0 0.0
8 2016-09-29 225.0 87.0
9 2016-10-29 225.0 30.0
10 2016-11-29 225.0 31.0
11 2016-12-30 225.0 31.0
12 2017-01-30 225.0 31.0

Categories