How to Change the Structure of Pandas Dataframe in Python? - python

I have a current Pandas DataFrame in the format below (see Current DataFrame) but I want to change the structure of it to look like the Desired DataFrame below. The top row of titles is longitudes and the first column of titles is latitudes.
Current DataFrame:
E0 E1 E2 E3 E4
LAT
89 0.01 0.01 0.02 0.01 0.00
88 0.01 0.00 0.00 0.01 0.00
87 0.00 0.02 0.01 0.02 0.01
86 0.02 0.00 0.03 0.02 0.00
85 0.00 0.00 0.00 0.01 0.03
Code to build it:
df = pd.DataFrame({
'LAT': [89, 88, 87, 86, 85],
'E0': [0.01, 0.01, 0.0, 0.02, 0.0],
'E1': [0.01, 0.0, 0.02, 0.0, 0.0],
'E2': [0.02, 0.0, 0.01, 0.03, 0.0],
'E3': [0.01, 0.01, 0.02, 0.02, 0.01],
'E4': [0.0, 0.0, 0.01, 0.0, 0.03]
}).set_index('LAT')
Desired DataFrame:
LAT LON R
89 0 0.01
89 1 0.01
89 2 0.02
89 3 0.01
89 4 0.00
88 0 0.01
88 1 0.00
88 2 0.00
88 3 0.01
88 4 0.00
87 0 0.00
87 1 0.02
87 2 0.01
87 3 0.02
87 4 0.01
86 0 0.02
86 1 0.00
86 2 0.03
86 3 0.02
86 4 0.00
85 0 0.00
85 1 0.00
85 2 0.00
85 3 0.01
85 4 0.03

Try with stack + str.extract:
new_df = (
df.stack()
.reset_index(name='R')
.rename(columns={'level_1': 'LON'})
)
new_df['LON'] = new_df['LON'].str.extract(r'(\d+$)').astype(int)
Or with pd.wide_to_long + reindex:
new_df = df.reset_index()
new_df = (
pd.wide_to_long(new_df, stubnames='E', i='LAT', j='LON')
.reindex(new_df['LAT'], level=0)
.rename(columns={'E': 'R'})
.reset_index()
)
new_df:
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03

You could solve it with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = None,
names_to = 'LON',
values_to = "R",
names_pattern = r".(.)",
sort_by_appearance = True,
ignore_index = False).reset_index()
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
Here we are only interested in the numbers that are at the end of the columns - we get this by passing a regular expression to names_pattern.
You can avoid pyjanitor altogether by using melt and rename:
(df.rename(columns=lambda col: col[-1])
.melt(var_name='LON', value_name='R', ignore_index=False)
)
LON R
LAT
89 0 0.01
88 0 0.01
87 0 0.00
86 0 0.02
85 0 0.00
89 1 0.01
88 1 0.00
87 1 0.02
86 1 0.00
85 1 0.00
89 2 0.02
88 2 0.00
87 2 0.01
86 2 0.03
85 2 0.00
89 3 0.01
88 3 0.01
87 3 0.02
86 3 0.02
85 3 0.01
89 4 0.00
88 4 0.00
87 4 0.01
86 4 0.00
85 4 0.03

Another approach, does this work:
pd.wide_to_long(df.reset_index(), ['E'], i = 'LAT', j = 'LON').reset_index().sort_values(by = ['LAT','LON'])
LAT LON E
4 85 0 0.00
9 85 1 0.00
14 85 2 0.00
19 85 3 0.01
24 85 4 0.03
3 86 0 0.02
8 86 1 0.00
13 86 2 0.03
18 86 3 0.02
23 86 4 0.00
2 87 0 0.00
7 87 1 0.02
12 87 2 0.01
17 87 3 0.02
22 87 4 0.01
1 88 0 0.01
6 88 1 0.00
11 88 2 0.00
16 88 3 0.01
21 88 4 0.00
0 89 0 0.01
5 89 1 0.01
10 89 2 0.02
15 89 3 0.01
20 89 4 0.00

Quick and dirty.
Pad your LAT with you LON in a list of tuple pairs.
[
(89.0, 0.01),
(89.1, 0.01),
(89.2, 0.02)
]
Im sure someone can break down a way to organize it like you want... but from what I know you need a unique ID data point for most data in a query structure.
OR:
If you aren't putting this back into a db, then maybe you can use a dict something like this:
{ '89' : { '0' : 0.01,
'1' : 0.01,
'2' : 0.02 .....
}
}
You can then get the data with
dpoint = data['89']['0']
assert dpoint == 0.01
\\ True
dpoint = data['89']['2']
assert dpoint == 0.02
\\ True

Related

Selecting options strike from Options Chain in python from NSE website

enter image description here
Output
CALL OI CALL CHNG OI CALL LTP STRIKE PRICE PUT OI PUT CHNG OI PUT LTP
0 1 0 1685.00 16100 36668 17505 0.90
1 0 0 0.00 16150 2110 678 1.05
2 0 0 0.00 16200 8381 3731 1.15
3 0 0 0.00 16250 219 24 1.00
4 0 0 0.00 16300 5573 791 1.05
.. ... ... ... ... ... ... ...
67 199 111 0.75 19450 0 0 0.00
68 42219 4957 0.65 19500 23 -1 1632.05
69 489 373 0.80 19550 0 0 0.00
70 8104 3463 0.60 19600 0 0 0.00
71 6923 1894 0.65 19650 0 0 0.00
[72 rows x 7 columns]
Result optionchain :
Empty DataFrame
Columns: [CALL OI, CALL CHNG OI, CALL LTP, STRIKE PRICE, PUT OI, PUT CHNG OI, PUT LTP]
Index: []
#Problem
not able to get specific strike row data

Sum rows where index steps is not bigger than 1. Pandas Python

I have foot sensor data and I want to calculate the Std of the swing times.
The dataframe looks like this:
Time Force
83 0.83 80
84 0.84 60
85 0.85 40
86 0.86 20
87 0.87 0
88 0.88 0
89 0.89 20
90 0.90 40
91 0.91 60
92 0.92 40
93 0.93 0
94 0.94 0
95 0.95 0
96 0.96 20
So to get the times for when the force ==0, I did:
df[(df['Force']==0)]
Resulting in:
Time Force
87 0.87 0
88 0.88 0
93 0.93 0
94 0.94 0
95 0.95 0
Now I want to sum the Time per swing.
swing 1 = index 87 + 88, swing 2 = index 93 + 94 + 95
How can I achieve this? How can I sum the rows where the index steps is not bigger than 1?
(Imagine I have thousands of rows to sum)
I tried complicated loops like:
swing_durations = []
start = []
start.append(0)
swings_left = swing_times_left.reset_index(drop = True)
for subject in swings_left[['filename']]:
i = 1
for time in swings_left['Time'][1:-1]:
j = i - 1
k = swings_left.where(swings_left['Time'].loc[i] - swings_left['Time'].loc[j] > 0.01)
if k == True:
start.append(time)
swing_durations.append(swings_left[['Time']].loc[j] - start[j])
i = i + 1
totalswingtime_l['filename'== subject]['Variance'] = swing_durations.std()
resulting in an error
Thanks for the help!
A solution is to create an ID for each group of consecutive 0s.
This is what (df.Force.shift()!=(df.Force)).cumsum() does.
Afterwards you only keep the groups containing 0s with np.where.
In [83]: df["swing_id"] = np.where(df.Force==0, (df.Force.shift()!=(df.Force)).cumsum(),np.nan)
...: df
Out[83]:
Time Force swing_id
0 0.83 80 NaN
1 0.84 60 NaN
2 0.85 40 NaN
3 0.86 20 NaN
4 0.87 0 5.0
5 0.88 0 5.0
6 0.89 20 NaN
7 0.90 40 NaN
8 0.91 60 NaN
9 0.92 40 NaN
10 0.93 0 10.0
11 0.94 0 10.0
12 0.95 0 10.0
13 0.96 20 NaN
In [84]: df.groupby("swing_id")["Time"].sum()
Out[84]:
swing_id
5.0 1.75
10.0 2.82
Name: Time, dtype: float64

pandas.Series.drop_duplicates for picking a single value from a subpart of a column

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index A B
0 0.00 0.00
1 0.00 0.00
29 0.50 105.00
36 0.80 167.00
37 0.80 167.00
42 1.00 209.00
44 0.50 105.00
45 0.50 105.00
46 0.50 105.00
50 0.00 0.00
51 0.00 0.00
52 0.00 0.00
53 0.00 0.00
When I use:
df.drop_duplicates(subset=['A'], keep='last')
I get:
Index A B
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Which makes sense, that's what the function does. However, what I actually would like to achieve is something like:
Index A B
1 0.00 0.00
29 0.50 105.00
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Basically from each subpart of column A (0,0), (0.80, 0.80), etc. To pick the last value.
It is also important the values in column A stay in this order 0; 0.5; 0.8; 1; 0.5;0 and they do not get mixed.
Compare by not equal Series.ne with Series.shift and filter by boolean indexing:
df1 = df[df['A'].ne(df['A'].shift(-1))]
print (df1)
A B
Index
1 0.0 0.0
29 0.5 105.0
37 0.8 167.0
42 1.0 209.0
46 0.5 105.0
53 0.0 0.0
Details:
print (df['A'].ne(df['A'].shift(-1)))
Index
0 False
1 True
29 True
36 False
37 True
42 True
44 False
45 False
46 True
50 False
51 False
52 False
53 True
Name: A, dtype: bool

Error: list indices must be integers, not Series

I have grabbed a column from a pandas data frame and made a list from the rows.
If I print the first two values of the list, I get this output:
print dfList[0]
print dfList[1]
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
65 1.00
66 1.00
67 1.00
68 1.00
69 1.00
70 1.00
71 0.55
72 0.00
73 0.39
74 0.51
75 0.70
76 0.83
77 0.87
78 0.85
79 0.53
...
126 0.83
127 0.83
128 0.83
129 0.71
130 0.26
131 0.11
132 0.00
133 0.00
134 0.50
135 1.00
136 1.00
137 0.59
138 0.59
139 0.59
140 1.00
Name: 9, Length: 76, dtype: float64
When I try to iterate over the list with a for loop, I get an error message:
for i in dfList:
print dfList[i]
Traceback (most recent call last):
File "./windows.py", line 84, in <module>
print dfList[i]
TypeError: list indices must be integers, not Series
If I write my code as:
for i in dfList:
print i
I get the correct output:
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
...
...
Name: 9, Length: 108, dtype: float64
507919 0.00
507920 0.83
507921 1.00
507922 1.00
507923 0.46
507924 0.83
507925 1.00
507926 1.00
507927 1.00
507928 1.00
507929 1.00
507930 1.00
507931 1.00
507932 1.00
507933 1.00
...
508216 1
508217 1
508218 1
508219 1
508220 1
508221 1
508222 1
508223 1
508224 1
508225 1
508226 1
508227 1
508228 1
508229 1
508230 1
Name: 9, Length: 312, dtype: float64
I do not know why this happens.
I ultimately want to iterate through the list for 5 consecutive "windows" and calculate their means and put these means in a list.
In python when you write for i in then i is the element not the index you need to then print i not print dfList[i].
Either of the following two options are fine:
for i in dfList[i]:
print i
for i in range(len(dfList)):
print dfList[i]
The first is more pythonic and elegant unless you need the index.
Edit
As jwilner suggested you can also do:
for i, element in enumerate(dfList):
print i, element
Here i is the index and element is the dfList[i].

indexing a pandas DataFrame

I have a Multindex DataFrame with the following structure:
0 1 2 ref
A B
21 45 0.01 0.56 0.23 0.02
22 45 0.30 0.88 0.53 0.87
23 46 0.45 0.23 0.90 0.23
What I want to do with it is:
From the columns [0:2] choose the closest value to the column 'ref', so the expected result would be:
closest
A B
21 45 0.01
22 45 0.88
23 46 0.23
Reconstructing your DataFrame:
In [1]: index = MultiIndex.from_tuples(zip([21,22,23],[45,45,46]), names=['A', 'B'])
In [2]: df = DataFrame({0:[0.01, 0.30, 0.45],
1:[0.56, 0.88, 0.23],
2:[0.23, 0.53, 0.90],
'ref': [0.02, 0.87, 0.23]}, index=index)
In [3]: df
Out[3]:
0 1 2 ref
A B
21 45 0.01 0.56 0.23 0.02
22 45 0.30 0.88 0.53 0.87
23 46 0.45 0.23 0.90 0.23
I would first get the absolute distance of columns0, 1 and 2 from ref:
In [4]: dist = df[[0,1,2]].sub(df['ref'], axis=0).apply(np.abs)
In [5]: dist
Out[5]:
0 1 2
A B
21 45 0.01 0.54 0.21
22 45 0.57 0.01 0.34
23 46 0.22 0.00 0.67
Given now dist you can determine the column with the min value by row using DataFrame.idxmin:
In [5]: idx = dist.idxmin(axis=1)
In [5]: idx
Out[5]:
A B
21 45 0
22 45 1
23 46 1
To now generate your new closest, then you simply need to use idx to index df:
In [6]: df['closest'] = idx.index.map(lambda x: df.ix[x][idx.ix[x]])
In [7]: df
Out[7]:
0 1 2 ref closest
A B
21 45 0.01 0.56 0.23 0.02 0.01
22 45 0.30 0.88 0.53 0.87 0.88
23 46 0.45 0.23 0.90 0.23 0.23
For the last step, there might be a more elegant way to do it but I'm relatively new to Pandas and that's the best I can think of right now.

Categories