Reshaping & Selecting Pandas from Pivot - python

Problem
I have the following data frame (note values are just to show format):
>>> print df
Country Public Private
Date
2013-01-17 BE 3389
2013-01-17 DK 4532 681
2013-02-21 DE 2453 1752
2013-02-21 IE 5143
2013-02-21 ES 8633 353
2013-03-21 FR 262
2013-03-21 LT 358
I would like to pivot it to show the following format:
Country Country1 Country2
Private Public Private Public
Date
2013-01-17 681 353 262 5143
2013-02-21 149 176 124 1757
2013-03-21 149 176 124 1757
Generate Problem
This will generate the problem
import pandas as pd
data =[['2013-01-17', 'BE',1000,3389],
['2013-01-17', 'IE',5823, 681],
['2013-01-17', 'FR',1000,1752],
['2013-02-17', 'IE',1000,5143],
['2013-02-17', 'FR',1000, 353],
['2013-03-17', 'FR',1000, 262],
['2013-03-17', 'BE',1000, 358]]
df = pd.DataFrame(data,columns=['Date','Country','Public','Private']).set_index('Date')
Attempts
The best I can manage is getting Country and the Data Description the wrong way round:
>>> print df.pivot(index=df.index,columns='Country').fillna('')
Public Private
Country AT BE DE DK
Date
2013-01-17 1000 1000 1000 1000
2013-02-21 1000 1000 1000 1000
2013-03-21 1000 1000 1000 1000

You can use swap levels to swap them. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.swaplevel.html
df.pivot(index=df.index,columns='Country').fillna('').swaplevel(0,1, axis=1).sortlevel(axis=1)

Related

For loop with non-consecutive indices

I'm quite new to Phyton and working with data frames, so this might be a very simple problem.
I successfully imported some measurement data (1 minute resolution) and did some calculations on them. I want to recalculate some data processing on a 15 minute basis (not average), for which I extracted every row at :00, :15, :30 and :45 from the original data frame.
df_interval = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 15) | (df['DateTime'].dt.minute == 30) | (df['DateTime'].dt.minute == 45)]
This seems to work fine. Now I want to recalculate the concentration every 15 minute based on what the instrument is internally doing, which is a simple formula.
So what I tried is:
for i in df_interval.index:
if np.isnan(df_interval.ATN[i]) == False and np.isnan(df_interval.ATN[i+1]) == False:
df_15min = (0.785 *((df_interval.ATN[i+1]-df_interval.ATN[i])/100))/(df_interval.Flow[i]*(1-0.07)*10.8*(1-df_interval.K[i]*df_interval.ATN[i])*15)
however, I end up with a KeyError: 226. And I don't understand why...
Update:
Here is the data and in the last column (df_15min) also the result that I want to get:
ATN
Flow
K
df_15min
150
3647
0.00994
165
3634
0.00996
180
3634
0.00995
195
3621
0.00995
210
3615
0.00994
225
1.703678939
3754
0.00994
3.75E-08
240
4.356519267
3741
0.00994
3.84E-08
255
6.997422571
3741
0.00994
3.94E-08
270
9.627710046
3736
0.00995
4.02E-08
285
12.23379251
3728
0.01007
3.89E-08
300
14.67175418
3727
0.01026
3.76E-08
315
16.9583747
3714
0.01043
3.73E-08
330
19.1497249
3714
0.01061
3.96E-08
345
21.39628083
3709
0.01079
3.87E-08
360
23.51512717
3701
0.01086
4.02E-08
375
25.63995721
3700
0.01083
3.90E-08
390
27.63886191
3688
0.0108
3.47E-08
405
29.36343728
3688
0.01076
3.68E-08
420
31.14291069
3677
0.01072
3.90E-08
I do a lot of things in Igor, so that is how I would do it there (unfortunately for me, it has to be in python this time):
variable i
For (i=0; i<numpnts(ATN)-1; i+=1)
df_15min[i] = (0.785 *((ATN[i+1]-ATN[i])/100))/(Flow[i]*(1-0.07)*10.8*(1-K[i]*ATN[i])*15)
endfor
Any help would be appreciated, thanks!
You can literally write the same operation as vectorial code. Just use the whole rows and shift(-1) to get the "next" row.
df['df_15min'] = (0.785 *((df['ATN'].shift(-1)-df['ATN'])/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
Or using diff:
df['df_15min'] = (0.785 *((-df['ATN'].diff(-1))/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
output:
ATN Flow K df_15min
index
150 NaN 3647 0.00994 NaN
165 NaN 3634 0.00996 NaN
180 NaN 3634 0.00995 NaN
195 NaN 3621 0.00995 NaN
210 NaN 3615 0.00994 NaN
225 1.703679 3754 0.00994 3.745468e-08
240 4.356519 3741 0.00994 3.844700e-08
255 6.997423 3741 0.00994 3.937279e-08
270 9.627710 3736 0.00995 4.019633e-08
285 12.233793 3728 0.01007 3.886148e-08
300 14.671754 3727 0.01026 3.763219e-08
315 16.958375 3714 0.01043 3.734876e-08
330 19.149725 3714 0.01061 3.955360e-08
345 21.396281 3709 0.01079 3.870011e-08
360 23.515127 3701 0.01086 4.017342e-08
375 25.639957 3700 0.01083 3.897022e-08
390 27.638862 3688 0.01080 3.473242e-08
405 29.363437 3688 0.01076 3.675232e-08
420 31.142911 3677 0.01072 NaN
Your if condition checks bc_interval.row1[i+1] for nan and then you access df_interval.row1[i+1]. Looks like you wanted to check df_interval.row1[i+1] instead.

Python - Adding grouped mode as additional column in original dataset

So I have data similar to this:
import pandas as pd
df = pd.DataFrame({'Order ID':[555,556,557,558,559,560,561,562,563,564,565,566],
'State':["MA","MA","MA","MA","MA","MA","CT","CT","CT","CT","CT","CT"],
'County':["Essex","Essex","Essex","Worcester","Worcester","Worcester","Bristol","Bristol","Bristol","Hartford","Hartford","Hartford"],
'AP':[50,50,75,100,100,125,150,150,175,200,200,225]})
but I need to add a column that shows the mode of AP grouped by State and County. I can get the mode this way:
(df.groupby(['State', 'County']).AP.agg(Mode = (lambda x: x.value_counts().index[0])).reset_index().round(0))
I'm just not sure how I can get that data added to the original data so that it looks like this:
Order ID
State
County
AP
Mode
555
MA
Essex
50
50
556
MA
Essex
50
50
557
MA
Essex
75
50
558
MA
Worcester
100
100
559
MA
Worcester
100
100
560
MA
Worcester
125
100
561
CT
Bristol
150
150
562
CT
Bristol
150
150
563
CT
Bristol
175
150
564
CT
Hartford
200
200
565
CT
Hartford
200
200
566
CT
Hartford
225
200
Use GroupBy.transform for new column:
df['Mode'] = (df.groupby(['State', 'County']).AP
.transform(lambda x: x.value_counts().index[0]))
Or Series.mode:
df['Mode'] = df.groupby(['State', 'County']).AP.transform(lambda x: x.mode().iat[0])

How to compare two values at a specific location in a loop, and append data in a range of values in Pandas Dataframe

I have a dataframe, from where I extracted some sample data:
Time Val
0 70000 -322
1 70500 -439
2 71000 -528
3 71500 -606
4 72000 -642
5 72500 -663
6 73000 -620
7 73500 -561
8 74000 -592
9 74500 -614
10 75000 -630
11 75500 -719
12 80000 -613
13 80500 -127
14 81000 -235
15 81500 -186
16 82000 -82
17 82500 836
18 83000 1137
183 70000 -106
184 70500 -117
185 71000 -626
186 71500 -810
187 72000 -822
188 72500 -676
189 73000 -639
190 73500 -664
191 74000 -708
192 74500 -515
193 75000 -61
194 75500 -121
195 80000 -145
196 80500 -57
197 81000 -133
198 81500 101
199 82000 235
200 82500 585
201 83000 550
366 70000 18
367 70500 138
368 71000 22
369 71500 -68
370 72000 -146
371 72500 -163
372 73000 -251
373 73500 -230
374 74000 -218
375 74500 -137
376 75000 -126
Now I would like to compare the value from 'Val' at time 73000 with the value [i-3].
If the value is less, then append the continuous values to the list until Time has reached 80000.
I wrote this loop but the problem is that 'Val' compares ALL values [i-3] between 73000 and 80000. I want that the comparison happens ONLY at 73000, and if the condition is true, write the data to the list (until Time 80000)
box = []
for i in df.index:
if df.Time[i] >= 73000 and df.Time[i] <= 80000 and df.Val[i] < df.Val[i-3]:
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
How could I change the code in order to achieve this?
You need to remember the result of the comparison in another variable, and reset it whenever you encounter a time value outside your desired interval. The code would look like this.
box = []
writeToList = False
for i in df.index:
if df.Time[i] < 73000 or df.Time[i] > 80000:
writeToList = False
if df.Time[i] == 73000 and df.Val[i] < df.Val[i-3]:
writeToList = True
if writeToList and df.Time[i] >= 73000 and df.Time[i] <= 80000 :
box.append(
{
'Time': df.Time[i],
'newVAL': df.Val[i],
}
)
box = pd.DataFrame (box, columns = ['Time','newVAL'])
Hope this helps.

I want to compare values in a dataframe column and report the index for the value that satisfy a conditional argument?

Unnamed: 4 GDP in billions of chained 2009 dollars.1
214 2000q1 12359.1
215 2000q2 12592.5
216 2000q3 12607.7
217 2000q4 12679.3
218 2001q1 12643.3
219 2001q2 12710.3
220 2001q3 12670.1
221 2001q4 12705.3
222 2002q1 12822.3
223 2002q2 12893.0
224 2002q3 12955.8
225 2002q4 12964.0
226 2003q1 13031.2
227 2003q2 13152.1
228 2003q3 13372.4
229 2003q4 13528.7
230 2004q1 13606.5
231 2004q2 13706.2
232 2004q3 13830.8
233 2004q4 13950.4
234 2005q1 14099.1
235 2005q2 14172.7
236 2005q3 14291.8
237 2005q4 14373.4
238 2006q1 14546.1
239 2006q2 14589.6
240 2006q3 14602.6
241 2006q4 14716.9
242 2007q1 14726.0
243 2007q2 14838.7
... ... ...
250 2009q1 14375.0
251 2009q2 14355.6
252 2009q3 14402.5
253 2009q4 14541.9
254 2010q1 14604.8
255 2010q2 14745.9
256 2010q3 14845.5
257 2010q4 14939.0
258 2011q1 14881.3
259 2011q2 14989.6
260 2011q3 15021.1
261 2011q4 15190.3
262 2012q1 15291.0
263 2012q2 15362.4
264 2012q3 15380.8
265 2012q4 15384.3
266 2013q1 15491.9
267 2013q2 15521.6
268 2013q3 15641.3
269 2013q4 15793.9
270 2014q1 15747.0
271 2014q2 15900.8
272 2014q3 16094.5
273 2014q4 16186.7
274 2015q1 16269.0
275 2015q2 16374.2
276 2015q3 16454.9
277 2015q4 16490.7
278 2016q1 16525.0
279 2016q2 16583.1
I have the above dataframe. I want to compare the values in the column GDP in billions of chained 2009 dollars.1 and report the index and value of the row for which the value of the column is consecutively less for two values above it. I am using the following code but i am not getting the result
datan = pd.read_excel('gdplev.xls', skiprows = 5)
datan.drop(datan.iloc[0:230, 0:4], inplace = True, axis = 1)
datan = datan[214:]
datan = datan.drop(['GDP in billions of current dollars.1', 'Unnamed: 7'], axis = 1)
datan
for item in datan['GDP in billions of chained 2009 dollars.1']:
if item > item+1 and item+1 > item+2:
print(item+2)
Please help
I suggest the following:
# First I reproduce a similar DataFrame than yours
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"quarter" : pd.date_range("2000q1", freq="Q", periods = 10),
"gdp": np.random.rand(10)*10000})
df["quarter"] = pd.Series(df["quarter"].dt.year).astype("str") + "q" + pd.Series(df["quarter"].dt.quarter).astype("str")
# Then I create two columns that are the lags of gdp
df["gdpN_1"] = df["gdp"].shift()
df["gdpN_2"] = df["gdpN_1"].shift()
# I create a top when gdp is below gdp at past quarter and the quarter before that
df["top"] = (df["gdp"] < df["gdpN_1"]) & (df["gdp"] < df["gdpN_2"])
# I only select rows for which top is True
new_df = df.loc[df["top"], ["quarter", "gdp"]]
And the result for new_df is :
quarter gdp
2 2000q3 2268.514536
5 2001q2 4231.064601
8 2002q1 4809.319015
9 2002q2 3921.175182

Selecting Column from pandas Series

I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.

Categories