pandas indexing error with seaborn - python

I have a pandas dataframe that I am trying to create a tsplot with seaborn and I am getting index duplicate errors. My question is two fold. First off, when I look at the data there are no duplicate indexes (sample df is given below):
BOLD campaign time type
0 2.735430 6041148 3 Default
1 1.356943 6041148 3 None
2 NaN 6041148 3 Vertical
3 9.550452 6013866 6 Default
4 1.000000 6013866 6 None
5 NaN 6013866 6 Vertical
6 322.675089 6086810 8 Default
7 1.508849 6086810 8 None
8 773.393385 6086810 8 Vertical
9 43.396084 6046619 10 Default
10 26.124405 6046619 10 None
11 NaN 6046619 10 Vertical
12 103.955111 6065909 10 Default
13 1.000000 6065909 10 None
14 NaN 6065909 10 Vertical
15 9.744664 6013866 9 Default
16 9.031970 6013866 9 None
17 NaN 6013866 9 Vertical
18 10.980322 6065742 8 Default
19 0.803821 6065742 8 None
However when I go to plot this dataframe with
sns.tsplot(df, time="time", unit="campaign", condition="type", value="BOLD")
I get
ValueError: Index contains duplicate entries, cannot reshape
Can someone explain why this is? I do not see any duplicate entries (or indices). I also tried using drop_duplicates and get the same result

Related

Plot time series and color by column names

I have a dataset with the following [structure][1] -
On a high level it is a time series data. I want to plot this time series data and have a unique color for each column. This will enable me to show the transitions better to the viewer. The column names/labels change from one data set to another. That means I need to create colors for the y value based on labels present in each dataset. I am trying to decide how to do this in a scalable manner.
Sample data ->
;(1275, 51) PCell Tput Avg (kbps) (Average);(1275, 95) PCell Tput Avg (kbps) (Average);(56640, 125) PCell Tput Avg (kbps) (Average);Time Stamp
0;;;79821.1;2021-04-29 23:01:53.624
1;;;79288.3;2021-04-29 23:01:53.652
2;;;77629.2;2021-04-29 23:01:53.682
3;;;78980.3;2021-04-29 23:01:53.695
4;;;77953.4;2021-04-29 23:01:53.723
5;;;;2021-04-29 23:01:53.748
6;;75558.7;;2021-04-29 23:01:53.751
7;;73955.5;;2021-04-29 23:01:53.780
8;;73689.8;;2021-04-29 23:01:53.808
9;;74819.8;;2021-04-29 23:01:53.839
10;10000;;;2021-04-29 23:01:53.848
11;68499;;;2021-04-29 23:01:53.867
[1]: https://i.stack.imgur.com/YM2P6.png
As long as each dataframe has the datetime column labeled as 'Time Stamp', you should just be able to do this for each one:
import matplotlib.pyplot as plt
df = #get def
df.plot.line(x = 'Time Stamp',grid=True)
plt.show()
Example df:
>>> df
A B C D Time Stamp
0 6 9 7 8 2018-01-01 00:00:00.000000000
1 7 3 8 6 2018-05-16 18:04:05.267114496
2 4 1 4 0 2018-09-29 12:08:10.534228992
3 1 2 5 8 2019-02-12 06:12:15.801343744
4 6 7 9 3 2019-06-28 00:16:21.068458240
5 8 5 9 9 2019-11-10 18:20:26.335572736
6 3 0 8 2 2020-03-25 12:24:31.602687232
7 8 0 0 9 2020-08-08 06:28:36.869801984
8 0 9 7 8 2020-12-22 00:32:42.136916480
9 7 8 9 2 2021-05-06 18:36:47.404030976
Resulting plot:

Iterating through rows of a dataframe to fill column in python

I have the following dataframe (called data_coh):
roi mag phase coherence
0 1 0.699883 0.0555903 NaN
1 2 0.640482 0.1053 NaN
2 3 0.477865 1.14926 NaN
3 4 0.128119 2.28403 NaN
4 5 0.563046 2.53091 NaN
5 6 0.58869 0.94647 NaN
6 7 0.428383 1.13915 NaN
7 8 0.164036 1.95959 NaN
8 9 0.27912 3.07456 NaN
9 10 0.244237 2.78111 NaN
10 11 0.696592 2.61011 NaN
11 12 0.237346 3.01836 NaN
For every row, I want to calculate its coherence value as follows (note that I want
to use the imaginary unit j):
import math
import cmath
for roin, val in enumerate(data_coh):
data_coh.loc[roin,'coherence'] = mag*math.cos(phase) + mag*math.sin(phase)*j
First of all, it is not able to perform the computation (which is calculating a complex number based on magnitude and phase). J is a complex unit (from cmath).
But in addition, even when j is left out, the allocation to the rows is not done
correctly. Why is that, and how can it be corrected?
No need to iterate or to import math or cmath, just pandas and numpy:
import pandas as pd
import numpy as np
df['coherence'] = df['mag'] * (np.cos(df['phase']) + 1j*df['phase'])
# Result
df
roi mag phase coherence
0 1 0.699883 0.05559 0.698802+0.038907j
1 2 0.640482 0.10530 0.636934+0.067443j
2 3 0.477865 1.14926 0.195525+0.549191j
3 4 0.128119 2.28403 -0.083826+0.292628j
4 5 0.563046 2.53091 -0.461279+1.425019j
5 6 0.588690 0.94647 0.344119+0.557177j
6 7 0.428383 1.13915 0.179221+0.487992j
7 8 0.164036 1.95959 -0.062182+0.321443j
8 9 0.279120 3.07456 -0.278493+0.858171j
9 10 0.244237 2.78111 -0.228539+0.67925j
10 11 0.696592 2.61011 -0.600502+1.818182j
11 12 0.237346 3.01836 -0.235546+0.716396j

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

Line chart in matplotlib with a double axis(strings on the axis)

I am trying to create a chart using python from a data in an Excel sheet. The data looks like this
Location Values
Trial 1 Edge 12
M-2 13
Center 14
M-4 15
M-5 12
Top 13
Trial 2 Edge 10
N-2 11
Center 11
N-4 12
N-5 13
Top 14
Trial 3 Edge 15
R-2 13
Center 12
R-4 11
R-5 10
Top 3
I want my graph to look like this:
Chart-1
.The chart should have the Location column values as X-axis, i.e, string object. This can be done easily(by using/creating Location as an array),
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
datalink=('/Users/Maxwell/Desktop/W1.xlsx')
df=pd.read_excel(datalink,skiprows=2)
x1=df.loc[:,['Location']]
x2=df.loc[:,['Values']]
x3=np.linspace(1,len(x2),num=len(x2),endpoint=True)
vals=['Location','Edge','M-2','Center','M-4','M-5','Top','Edge','N-2','Center','N-4','N-5','Top','Edge','R-2']
plt.figure(figsize=(12,8),dpi=300)
plt.subplot(1,1,1)
plt.xticks(x3,vals)
plt.plot(x3,x2)
plt.show()
But, I also want to show Trial-1, Trial-2 .. on X-axis. Upto now I had been using Excel to generate chart but, I have a lot of similar data and want to use python to automate the task.
With your excel sheet that has data as follows,
,
you can use matplotlib to create the plot you wanted. It is not straightforward but can be done. See below:
EDIT: earlier I suggested factorplot, but it is not applicable because your location values for each trial are not constant.
df = pd.read_excel(r'test_data.xlsx', header = 1, parse_cols = "D:F",
names = ['Trial', 'Location', 'Values'])
'''
Trial Location Values
0 Trial 1 Edge 12
1 NaN M-2 13
2 NaN Center 14
3 NaN M-4 15
4 NaN M-5 12
5 NaN Top 13
6 Trial 2 Edge 10
7 NaN N-2 11
8 NaN Center 11
9 NaN N-4 12
10 NaN N-5 13
11 NaN Top 14
12 Trial 3 Edge 15
13 NaN R-2 13
14 NaN Center 12
15 NaN R-4 11
16 NaN R-5 10
17 NaN Top 3
'''
# this will replace the nan with corresponding trial number for each set of trials
df = df.fillna(method = 'ffill')
'''
Trial Location Values
0 Trial 1 Edge 12
1 Trial 1 M-2 13
2 Trial 1 Center 14
3 Trial 1 M-4 15
4 Trial 1 M-5 12
5 Trial 1 Top 13
6 Trial 2 Edge 10
7 Trial 2 N-2 11
8 Trial 2 Center 11
9 Trial 2 N-4 12
10 Trial 2 N-5 13
11 Trial 2 Top 14
12 Trial 3 Edge 15
13 Trial 3 R-2 13
14 Trial 3 Center 12
15 Trial 3 R-4 11
16 Trial 3 R-5 10
17 Trial 3 Top 3
'''
from matplotlib import rcParams
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
rcParams.update({'font.size': 10})
fig1 = plt.figure()
f, ax1 = plt.subplots(1, figsize = (10,3))
ax1.plot(list(df.Location.index), df['Values'],'o-')
ax1.set_xticks(list(df.Location.index))
ax1.set_xticklabels(df.Location, rotation=90 )
ax1.yaxis.set_label_text("Values")
# create a secondary axis
ax2 = ax1.twiny()
# hide all the spines that we dont need
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
pos1 = ax2.get_position() # get the original position
pos2 = [pos1.x0 + 0, pos1.y0 -0.2, pos1.width , pos1.height ] # create a new position by offseting it
ax2.xaxis.set_ticks_position('bottom')
ax2.set_position(pos2) # set a new position
trials_ticks = 1.0 * df.Trial.value_counts().cumsum()/ (len(df.Trial)) # create a series object for ticks for each trial group
trials_ticks_positions = [0]+list(trials_ticks) # add a additional zero. this will make tick at zero.
trials_labels_offset = 0.5 * df.Trial.value_counts()/ (len(df.Trial)) # create an offset for the tick label, we want the tick label to between ticks
trials_label_positions = trials_ticks - trials_labels_offset # create the position of tick labels
# set the ticks and ticks labels
ax2.set_xticks(trials_ticks_positions)
ax2.xaxis.set_major_formatter(ticker.NullFormatter())
ax2.xaxis.set_minor_locator(ticker.FixedLocator(trials))
ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(list(trials_label_positions.index)))
ax2.tick_params(axis='x', length = 10,width = 1)
plt.show()
results in

Change value if consecutive number of certain condition is achieved in Pandas

I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.
Example:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40
Which gives me this DF:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 -1.045209
4 40.000000 0.598657 -1.268399
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 1.744822
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 1.416020
10 -1.337494 -40.000000 -1.195780
11 -0.703669 -40.000000 0.657519
12 -40.000000 -0.288235 -0.840145
13 -1.084869 -0.298030 -1.592004
14 -0.617568 -1.046210 -0.531523
Now, if I do
a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')
I get
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 NaN 0.598657 NaN
5 NaN 0.442297 -0.016363
6 NaN -0.316817 NaN
7 NaN 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.
Any ideas?
See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.
In [23]: cond = abs(df) > abs(df.std())
In [24]: for c in df.columns:
...: grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
...: fill = (df.groupby(grouper)[c].transform('size') <= 2)
...: df.loc[fill, c] = np.nan
In [25]: df
Out[25]:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.
The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.
The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)
So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').
Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.

Categories