Move data from row 1 to row 0

Move data from row 1 to row 0 - python

I have this function written in python. I want this thing show difference between row from production column.
Here's the code
def print_df():
mycursor.execute("SELECT * FROM productions")
myresult = mycurson.fetchall()
myresult.sort(key=lambda x: x[0])
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Dif'] = abs(df['Production (Ton)']. diff())
print(abs(df))
And of course the output is this
Year Production (Ton) Dif
0 2010 339491 NaN
1 2011 366999 27508.0
2 2012 361986 5013.0
3 2013 329461 32525.0
4 2014 355464 26003.0
5 2015 344998 10466.0
6 2016 274317 70681.0
7 2017 200916 73401.0
8 2018 217246 16330.0
9 2019 119830 97416.0
10 2020 66640 53190.0
But I want the output like this
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
What should I change or add to my code?

You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column:
df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs()
Output:
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0

Use shift(-1) to shift all rows one position up.
df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs()
Notice that by setting fillna(0), you avoid the NaNs.
You can also use diff:
df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()

Related

How to get years without starting with df=df.set_index

I have this set of dataframe:Dataframe
I can obtain the values that is 15% greater than the mean by:
df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string()
I obtained all values that are 15% greater than interest in their respective categories
The question is how do I get the year where these values occurred without starting with:
df=df.set_index('Year")
at the start as the function above requires my year values with df.iloc

How do I get the year where these values occurred without starting with df.set_index('Year")
Use .loc:
>>> df
Year Dividends Interest Other Types Rent Royalties Trade Income
0 2007 7632014 4643033 206207 626668 89715 18654926
1 2008 6718487 4220161 379049 735494 58535 29677697
2 2009 1226858 5682198 482776 1015181 138083 22712088
3 2010 978925 2229315 565625 1260765 146791 15219378
4 2011 1500621 2452712 675770 1325025 244073 19697549
5 2012 308064 2346778 591180 1483543 378998 33030888
6 2013 275019 4274425 707344 1664747 296136 17503798
7 2014 226634 3124281 891466 1807172 443671 16023363
8 2015 2171559 3474825 1144862 1858838 585733 16778858
9 2016 767713 4646350 2616322 1942102 458543 13970498
10 2017 759016 4918320 1659303 2001220 796343 9730659
11 2018 687308 6057191 1524474 2127583 1224471 19570540
>>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']]
Year Interest
0 2007 4643033
2 2009 5682198
9 2016 4646350
10 2017 4918320
11 2018 6057191

This will return a DataFrame with Year and the Interest values that match your condition
df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]

This will return the Year :-
df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.

The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

Python pandas: Calculated Week number overlaps with two months

I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.

Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?

Pandas: Simple Analysis of growth (comparatively) and with Fillna

Below is the basic data I'm provided with every month. There are many department related files I get and the job gets very monotonous and repetitive.
Month,year,sales,
January,2017,34400,
February,2017,35530,
March,2017,34920,
April,2017,35950,
May,2017,36230,
June,2017,36820,
July,2017,34590,
August,2017,36500,
September,2017,36600,
October,2017,37140,
November,2017,36790,
December,2017,43500,
January,2018,34900,
February,2018,37700,
March,2018,37900,
April,2018,38100,
May,2018,37800,
June,2018,38500,
July,2018,39400,
August,2018,39700,
September,2018,39980,
October,2018,40600,
November,2018,39100,
December,2018,46600,
January,2019,42500,
I've tried to use certain functions like value_count(sadly, giving only summary) in order to achieve this output. And failed. (See output below.)
I need to autofill the 3rd and 4th columns (with fillna=True/False)
the third column is just telling if it is P/L compared to previous month (like if April is greater than March, then it is Profit.)
The fourth column is showing the sequence of P/L achieved, i.e. 2 months or 5months profit(/loss) in a row. (I mean continuously, as it results in certain awards/recognition for teams.)
The fifth column is the max sales achieved in the last 'n' number of months.
They only allow Apache OpenOffice for our job, and hence no Excel. But we have the permission by IT to install Python.
The solution in this Link is not helping me as they are grouping-by two columns. The columns in my output are inter-dependent.
import pandas as pd
df = pd.read_csv("Test_1.csv", "a")
df['comparative_position'] = df['sales'].diff().fillna=True
df.loc[df['comparative_position'] > 0.0, 'comparative_position'] = "Profit"
df.loc[df['comparative_position'] < 0.0, 'comparative_position'] = "Loss"
Month,Year,Sales,comparative_position,Months_in_P(or)L,Highest_in_12Months
January,2016,34400,NaN,NaN,NaN
February,2016,35530,Profit,1,NaN
March,2016,34920,Loss,1,NaN
April,2016,35950,Profit,1,NaN
May,2016,36230,Profit,2,NaN
June,2016,36820,Profit,3,NaN
July,2016,34590,Loss,1,NaN
August,2016,36500,Profit,1,NaN
September,2016,36600,Profit,2,NaN
October,2016,37140,Profit,3,NaN
November,2016,36790,Loss,1,NaN
December,2016,43500,Profit,1,43500
January,2017,34900,Loss,1,43500
February,2017,37700,Profit,1,43500
March,2017,37900,Profit,2,43500
April,2017,38100,Profit,3,43500
May,2017,37800,Loss,1,43500
June,2017,38500,Profit,1,43500
July,2017,39400,Profit,2,43500
August,2017,39700,Profit,3,43500
September,2017,39980,Profit,4,43500
October,2017,40600,Profit,5,43500
November,2017,39100,Loss,1,43500
December,2017,46600,Profit,1,46600
January,2018,42500,Loss,1,46600

AFAIU this should work for you:
# Get difference from previous as True / False
df['P/L'] = df.sales > df.sales.shift()
# Add column counting 'streaks' of P or L
df['streak'] = df['P/L'].groupby(df['P/L'].ne(df['P/L'].shift()).cumsum()).cumcount()
# map True/False to string of Profit/Loss
df['P/L'] = df['P/L'].map({True:'Profit', False:'Loss'})
# max of last n months where n is 12, as in your example, you can change it to any int
df['12_max'] = df.sales.rolling(12).max()
Output:
Month year sales P/L streak 12_max
0 January 2017 34400 False 0 NaN
1 February 2017 35530 True 0 NaN
2 March 2017 34920 False 0 NaN
3 April 2017 35950 True 0 NaN
4 May 2017 36230 True 1 NaN
5 June 2017 36820 True 2 NaN
6 July 2017 34590 False 0 NaN
7 August 2017 36500 True 0 NaN
8 September 2017 36600 True 1 NaN
9 October 2017 37140 True 2 NaN
10 November 2017 36790 False 0 NaN
11 December 2017 43500 True 0 43500.0
12 January 2018 34900 False 0 43500.0
13 February 2018 37700 True 0 43500.0
14 March 2018 37900 True 1 43500.0
15 April 2018 38100 True 2 43500.0
16 May 2018 37800 False 0 43500.0
17 June 2018 38500 True 0 43500.0
18 July 2018 39400 True 1 43500.0
19 August 2018 39700 True 2 43500.0
20 September 2018 39980 True 3 43500.0
21 October 2018 40600 True 4 43500.0
22 November 2018 39100 False 0 43500.0
23 December 2018 46600 True 0 46600.0
24 January 2019 42500 False 0 46600.0

Python pandas create additional dataframe columns by grouping on existing column

Trying to create new dataframe columns from the contents of an existing column. Easier to explain with an example. I would like to convert this:
. Yr Month Class Cost
1 2015 1 L 19.2361
2 2015 1 M 29.4723
3 2015 1 S 48.5980
4 2015 1 T 169.7630
5 2015 2 L 19.1506
6 2015 2 M 30.0886
7 2015 2 S 49.3765
8 2015 2 T 167.0000
9 2015 3 L 19.3465
10 2015 3 M 29.1991
11 2015 3 S 46.2580
12 2015 3 T 157.7916
13 2015 4 L 18.3165
14 2015 4 M 28.2314
15 2015 4 S 44.5844
16 2015 4 T 162.3241
17 2015 5 L 17.4556
18 2015 5 M 27.0434
19 2015 5 S 42.8841
20 2015 5 T 159.3457
21 2015 6 L 16.5343
22 2015 6 M 24.9853
23 2015 6 S 40.5612
24 2015 6 T 153.4902
...into the following so that I can plot 4 separate lines [L, M, S, T]:
. Yr Month L M S T
1 2015 1 19.2361 29.4723 48.5980 169.7630
2 2015 2 19.1506 30.0886 49.3765 167.0000
3 2015 3 19.3465 29.1991 46.2580 157.7916
4 2015 4 18.3165 28.2314 44.5844 162.3241
5 2015 5 17.4556 27.0434 42.8841 159.3457
6 2015 6 16.5343 24.9853 40.5612 153.4902
I was able to work through it in what feels like a very clumsy way, by filtering the dataframe on the 'class' column... and then 3 separate merges.
list_class = ['L', 'M', 'S', 'T']
year = 'Yr'
month = 'Month'
df_class = pd.DataFrame()
df_class1 = pd.DataFrame()
df_class2 = pd.DataFrame()
df_class1 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[0]], df[[month, year, 'Class','Cost']][df['Class']==list_class[1]], \
left_on=[month, year], right_on=[month, year])
df_class2 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[2]], df[[month, year, 'Class','Cost']][df['Class']==list_class[3]], \
left_on=[month, year], right_on=[month, year])
df_class = merge(df_class1, df_class2, left_on=[month, year], right_on=[month, year]).groupby([year, month]).mean().plot(figsize(15,8))
There must be a more efficient way. Feels like it should be done with groupby, but I couldn't nail it down.

You can first convert the df to a multi-level index type and then unstack the level Class will give you what you want. Suppose df is the original dataframe shown on the very beginning of your post.
df.set_index(['Yr', 'Month', 'Class'])['Cost'].unstack('Class')
Out[29]:
Class L M S T
Yr Month
2015 1 19.2361 29.4723 48.5980 169.7630
2 19.1506 30.0886 49.3765 167.0000
3 19.3465 29.1991 46.2580 157.7916
4 18.3165 28.2314 44.5844 162.3241
5 17.4556 27.0434 42.8841 159.3457
6 16.5343 24.9853 40.5612 153.4902

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Move data from row 1 to row 0 - python

Use shift(-1) to shift all rows one position up. df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs() Notice that by setting fillna(0), you avoid the NaNs. You can also use diff: df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()

Related

How to get years without starting with df=df.set_index

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Python pandas: Calculated Week number overlaps with two months

Pandas: Simple Analysis of growth (comparatively) and with Fillna

Python pandas create additional dataframe columns by grouping on existing column

Categories

Resources