I am trying to create a new list using data from a pandas Dataframe. The Dataframe in question has a column of Dates as well as a column for Units Sold as seen below:
Peep = Xsku[['new_date', 'cum_sum']]
Peep.head(15)
Out[159]:
new_date cum_sum
18 2011-01-17 214
1173 2011-01-24 343
2328 2011-01-31 407 #Save Entry in List
3483 2011-02-07 71
4638 2011-02-14 159
5793 2011-02-21 294
6948 2011-02-28 425 #Save Entry in List
8103 2011-03-07 113
9258 2011-03-14 249
10413 2011-03-21 347
11568 2011-03-28 463 #Save Entry in List
12723 2011-04-04 99
13878 2011-04-11 186
15033 2011-04-18 291
16188 2011-04-25 385
I am trying to make a new list, where the list contains the maximum 'cum_sum' before the number is reset (i.e. becomes smaller). For example, in the first four entries above, the cum_sum reaches 407 and then goes back down to 71. I am thus trying to save the number 407 as well as the corresponding 'new_date' (2011-01-31 in this example) and do this for every entry.
My final List will thus have all the maximum 'cum_sum' values before it is reset.
For example it will look like as follows:
(First Three Expected Values)
MyList
Out[]:
new_date cum_sum
2011-01-31 407
2011-02-28 425
2011-03-28 463
...
I have been trying to do something as a for loop, but continually run into problems:
MyList= [] ##My Empty List
for i in range(len(Peep['new_date'])):
if Peep.iloc[i,1] > Peep.iloc[i + 1,1]:
MyList.append(Peep.iloc[i,1])
Can anyone help me in this regard?
Use .diff and filter like
In [17]: df[df['cum_sum'].diff(-1).ge(0)]
Out[17]:
new_date cum_sum
2 2011-01-31 407
6 2011-02-28 425
10 2011-03-28 463
Related
I'm quite new to Phyton and working with data frames, so this might be a very simple problem.
I successfully imported some measurement data (1 minute resolution) and did some calculations on them. I want to recalculate some data processing on a 15 minute basis (not average), for which I extracted every row at :00, :15, :30 and :45 from the original data frame.
df_interval = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 15) | (df['DateTime'].dt.minute == 30) | (df['DateTime'].dt.minute == 45)]
This seems to work fine. Now I want to recalculate the concentration every 15 minute based on what the instrument is internally doing, which is a simple formula.
So what I tried is:
for i in df_interval.index:
if np.isnan(df_interval.ATN[i]) == False and np.isnan(df_interval.ATN[i+1]) == False:
df_15min = (0.785 *((df_interval.ATN[i+1]-df_interval.ATN[i])/100))/(df_interval.Flow[i]*(1-0.07)*10.8*(1-df_interval.K[i]*df_interval.ATN[i])*15)
however, I end up with a KeyError: 226. And I don't understand why...
Update:
Here is the data and in the last column (df_15min) also the result that I want to get:
ATN
Flow
K
df_15min
150
3647
0.00994
165
3634
0.00996
180
3634
0.00995
195
3621
0.00995
210
3615
0.00994
225
1.703678939
3754
0.00994
3.75E-08
240
4.356519267
3741
0.00994
3.84E-08
255
6.997422571
3741
0.00994
3.94E-08
270
9.627710046
3736
0.00995
4.02E-08
285
12.23379251
3728
0.01007
3.89E-08
300
14.67175418
3727
0.01026
3.76E-08
315
16.9583747
3714
0.01043
3.73E-08
330
19.1497249
3714
0.01061
3.96E-08
345
21.39628083
3709
0.01079
3.87E-08
360
23.51512717
3701
0.01086
4.02E-08
375
25.63995721
3700
0.01083
3.90E-08
390
27.63886191
3688
0.0108
3.47E-08
405
29.36343728
3688
0.01076
3.68E-08
420
31.14291069
3677
0.01072
3.90E-08
I do a lot of things in Igor, so that is how I would do it there (unfortunately for me, it has to be in python this time):
variable i
For (i=0; i<numpnts(ATN)-1; i+=1)
df_15min[i] = (0.785 *((ATN[i+1]-ATN[i])/100))/(Flow[i]*(1-0.07)*10.8*(1-K[i]*ATN[i])*15)
endfor
Any help would be appreciated, thanks!
You can literally write the same operation as vectorial code. Just use the whole rows and shift(-1) to get the "next" row.
df['df_15min'] = (0.785 *((df['ATN'].shift(-1)-df['ATN'])/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
Or using diff:
df['df_15min'] = (0.785 *((-df['ATN'].diff(-1))/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
output:
ATN Flow K df_15min
index
150 NaN 3647 0.00994 NaN
165 NaN 3634 0.00996 NaN
180 NaN 3634 0.00995 NaN
195 NaN 3621 0.00995 NaN
210 NaN 3615 0.00994 NaN
225 1.703679 3754 0.00994 3.745468e-08
240 4.356519 3741 0.00994 3.844700e-08
255 6.997423 3741 0.00994 3.937279e-08
270 9.627710 3736 0.00995 4.019633e-08
285 12.233793 3728 0.01007 3.886148e-08
300 14.671754 3727 0.01026 3.763219e-08
315 16.958375 3714 0.01043 3.734876e-08
330 19.149725 3714 0.01061 3.955360e-08
345 21.396281 3709 0.01079 3.870011e-08
360 23.515127 3701 0.01086 4.017342e-08
375 25.639957 3700 0.01083 3.897022e-08
390 27.638862 3688 0.01080 3.473242e-08
405 29.363437 3688 0.01076 3.675232e-08
420 31.142911 3677 0.01072 NaN
Your if condition checks bc_interval.row1[i+1] for nan and then you access df_interval.row1[i+1]. Looks like you wanted to check df_interval.row1[i+1] instead.
I have a pandas dataframe I am trying to sort, which contains a int column (encoded target) which I sort like so:
some_set.encoded_target = train_set.encoded_target.astype(int) # last but one column
some_set.sort_values(by='encoded_target', ascending=True)
print(some_set)
and this gives me:
1953 61c4930b42ca426eb8dfaf7314899d08__11_115_3... 61c4930b42ca426eb8dfaf7314899d08__115 134 61c4930b42ca426eb8dfaf7314899d08
1623 3659cfea02b44543812e13f0d7fb7147__105_105_4... 3659cfea02b44543812e13f0d7fb7147__105 63 3659cfea02b44543812e13f0d7fb7147
241 bd67717fe59e4fa8bb5307a663016eb3__13_13_3_p... bd67717fe59e4fa8bb5307a663016eb3__13 290 bd67717fe59e4fa8bb5307a663016eb3
1573 9fdfabfad9974d6cac5b588ff2d9e47a__194__194_2... 9fdfabfad9974d6cac5b588ff2d9e47a__194 238 9fdfabfad9974d6cac5b588ff2d9e47a
602 0a64aee93755481cb9f5162373c776f8__182__182_1... 0a64aee93755481cb9f5162373c776f8__182 13 0a64aee93755481cb9f5162373c776f8
... ... ... ... ...
1779 7b19321376b842a2aece02cd458fb043__186__186_3... 7b19321376b842a2aece02cd458fb043__186 187 7b19321376b842a2aece02cd458fb043
2910 64bff78431914373a78c8f547d985b7d__141__141_2... 64bff78431914373a78c8f547d985b7d__141 142 64bff78431914373a78c8f547d985b7d
1377 2410de3f2fee45cdab25b61428f282bd__93__93_3_p... 2410de3f2fee45cdab25b61428f282bd__93 39 2410de3f2fee45cdab25b61428f282bd
2533 a567db4f10c34228b5452f79b5ff08d7__43__43_1_p... a567db4f10c34228b5452f79b5ff08d7__43 247 a567db4f10c34228b5452f79b5ff08d7
2790 9430d8f375bc4888a0a61b47bc7228fd__102__102_3... 9430d8f375bc4888a0a61b47bc7228fd__102 217 9430d8f375bc4888a0a61b47bc7228fd
clearly, this is wrong, 13 must come before 134
I have spent two hours trying to figure WTF could be wrong, but I am having no lick whatsoever.
:((
Any clues would be great.
One thing need to remember is to assign it back
some_set = some_set.sort_values(by='encoded_target', ascending=True)
I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')
I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.
I have two DataFrames that are each of the exact sane dimensions and I would like to multiply just one specific column from each of them together:
My first DataFrame is:
In [834]: patched_benchmark_df_sim
Out[834]:
build_number name cycles
0 390 adpcm 21598
1 390 aes 5441
2 390 blowfish NaN
3 390 dfadd 463
....
284 413 jpeg 766742
285 413 mips 4263
286 413 mpeg2 2021
287 413 sha 348417
[288 rows x 3 columns]
My second DataFrame is:
In [835]: patched_benchmark_df_syn
Out[835]:
build_number name fmax
0 390 adpcm 143.45
1 390 aes 309.60
2 390 blowfish NaN
3 390 dfadd 241.02
....
284 413 jpeg 197.75
285 413 mips 202.39
286 413 mpeg2 291.29
287 413 sha 243.19
[288 rows x 3 columns]
And I would like to take each element of the cycles column of patched_benchmark_df_sim and multiply that to the corresponding element of the fmax column of patched_benchmark_df_syn, and then store the result in a new DataFrame that has exactly the same structure, contiaining the build_number and name columns, but now the last column containing all the numerical data will be called latency, which is the product of fmax and cycles.
So the output DataFrame has to look something like this:
build_number name latency
0 390 adpcm ## each value here has to be product of cycles and fmax and they must correspond to one another ##
......
I tried doing a straightforward patched_benchmark_df_sim * patched_benchmark_df_syn but that did not work as my DataFrames had the name column that's of string type. Is there no builtin pandas method that can do this for me? How could I proceed with the multiplication to get the result I need?
Thank you very much.
The simplest thing to do is to add a new column to the df and then select the columns you want and if you want assign that to a new df:
In [356]:
df['latency'] = df['cycles'] * df1['fmax']
df
Out[356]:
build_number name cycles latency
0 390 adpcm 21598 3.098233e+06
1 390 aes 5441 1.684534e+06
2 390 blowfish NaN NaN
3 390 dfadd 463 1.115923e+05
284 413 jpeg 766742 1.516232e+08
285 413 mips 4263 8.627886e+05
286 413 mpeg2 2021 5.886971e+05
287 413 sha 348417 8.473153e+07
In [357]:
new_df = df[['build_number', 'name', 'latency']]
new_df
Out[357]:
build_number name latency
0 390 adpcm 3.098233e+06
1 390 aes 1.684534e+06
2 390 blowfish NaN
3 390 dfadd 1.115923e+05
284 413 jpeg 1.516232e+08
285 413 mips 8.627886e+05
286 413 mpeg2 5.886971e+05
287 413 sha 8.473153e+07
As you've found you can't multiply non-numeric type df's together like you tried. The above is assuming that the build_number and name columns are the same from both dfs.