Iterating through rows of a dataframe to fill column in python

Iterating through rows of a dataframe to fill column in python - python

I have the following dataframe (called data_coh):
roi mag phase coherence
0 1 0.699883 0.0555903 NaN
1 2 0.640482 0.1053 NaN
2 3 0.477865 1.14926 NaN
3 4 0.128119 2.28403 NaN
4 5 0.563046 2.53091 NaN
5 6 0.58869 0.94647 NaN
6 7 0.428383 1.13915 NaN
7 8 0.164036 1.95959 NaN
8 9 0.27912 3.07456 NaN
9 10 0.244237 2.78111 NaN
10 11 0.696592 2.61011 NaN
11 12 0.237346 3.01836 NaN
For every row, I want to calculate its coherence value as follows (note that I want
to use the imaginary unit j):
import math
import cmath
for roin, val in enumerate(data_coh):
data_coh.loc[roin,'coherence'] = mag*math.cos(phase) + mag*math.sin(phase)*j
First of all, it is not able to perform the computation (which is calculating a complex number based on magnitude and phase). J is a complex unit (from cmath).
But in addition, even when j is left out, the allocation to the rows is not done
correctly. Why is that, and how can it be corrected?

No need to iterate or to import math or cmath, just pandas and numpy:
import pandas as pd
import numpy as np
df['coherence'] = df['mag'] * (np.cos(df['phase']) + 1j*df['phase'])
# Result
df
roi mag phase coherence
0 1 0.699883 0.05559 0.698802+0.038907j
1 2 0.640482 0.10530 0.636934+0.067443j
2 3 0.477865 1.14926 0.195525+0.549191j
3 4 0.128119 2.28403 -0.083826+0.292628j
4 5 0.563046 2.53091 -0.461279+1.425019j
5 6 0.588690 0.94647 0.344119+0.557177j
6 7 0.428383 1.13915 0.179221+0.487992j
7 8 0.164036 1.95959 -0.062182+0.321443j
8 9 0.279120 3.07456 -0.278493+0.858171j
9 10 0.244237 2.78111 -0.228539+0.67925j
10 11 0.696592 2.61011 -0.600502+1.818182j
11 12 0.237346 3.01836 -0.235546+0.716396j

Related

Shift Dataframe and filling up NaN

I want to create a DataFrame which includes the hourly heatdemand of n consumers (here n=5) for 10 hours.
--> DataFrame called "Village", n columns (each representing a consumer) and 10 rows (10 hours)
All consumers follow the same demand-profile with the only difference that it is shiftet within a random amount of hours. The random number follows a normal distribution.
I managed to create a list of discrete numbers that follow a normal distribution and i managed to create a DataFrame with n rows where the same demand-profile gets shiftet by that random number.
The problem i cant solve is, that NaN appears instead of filling up the shifted hours with the values that where cut of because of the shift.
Example: if a the demand-profile gets shiftet by 1 hour (like consumer 5 for example). Now there appears "NaN" as the demand in the first hour. Instead of "NaN" i would like the value of the 10th hour of the original demand-profile to appear (4755.005240). So instead of shifting the values of the demand-profile i kind of want it more to "rotate".
heat_demand
0 1896.107462
1 1964.878199
2 2072.946499
3 2397.151402
4 3340.292937
5 4912.195496
6 6159.893152
7 5649.024821
8 5157.805271
9 4755.005240
Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer 5
0 1896.107462 NaN 1964.878199 NaN NaN
1 1964.878199 NaN 2072.946499 NaN 1896.107462
2 2072.946499 NaN 2397.151402 NaN 1964.878199
3 2397.151402 1896.107462 3340.292937 1896.107462 2072.946499
4 3340.292937 1964.878199 4912.195496 1964.878199 2397.151402
5 4912.195496 2072.946499 6159.893152 2072.946499 3340.292937
6 6159.893152 2397.151402 5649.024821 2397.151402 4912.195496
7 5649.024821 3340.292937 5157.805271 3340.292937 6159.893152
8 5157.805271 4912.195496 4755.005240 4912.195496 5649.024821
9 4755.005240 6159.893152 NaN 6159.893152 5157.805271
Could someone maybe give me a hint how to solve that problem? Thanks a lot already in advance and kind regards
Luise
import numpy as np
import pandas as pd
import os
path= os.path.dirname(os.path.abspath(os.path.join(file)))
#Create a list with discrete numbers following normal distribution
n = 5
timeshift_1h = np.random.normal(loc=0.1085, scale=1.43825, size=n)
timeshift_1h = np.round(timeshift_1h).astype(int)
print ("Time Shift in h:", timeshift_1h)
#Read the Standard Load Profile
cols = ["heat_demand"]
df_StandardLoadProfile = pd.read_excel(os.path.join(path, '10_h_example.xlsx'),usecols=cols)
print(df_StandardLoadProfile)
#Create a df for n consumers, whose demand equals a shifted StandardLoadProfile.
#It is shifted by a random amount of hours, that is taken from the list timeshift_1h
list_consumers = list(range(1,n+1))
Village=pd.DataFrame()
for i in list_consumers:
a=timeshift_1h[i-1]
name = "Consumer {}".format(i)
Village[name] = df_StandardLoadProfile.shift(a)
print(Village)

There's a very nice numpy function for that use-case, namely np.roll (see here for the documentation). It takes an array and shifts it by the steps specified withshift.
For your example, this could look like the following:
import pandas as pd
import numpy as np
df = pd.read_csv("demand.csv")
df['Consumer 1'] = np.roll(df["heat_demand"], shift=1)

You could fill the nan values from the reversed column -
df = pd.DataFrame(np.arange(10))
df
# 0
#0 0
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
df[0].shift(3).fillna(pd.Series(reversed(df[0])))
#0 9.0
#1 8.0
#2 7.0
#3 0.0
#4 1.0
#5 2.0
#6 3.0
#7 4.0
#8 5.0
#9 6.0

Plot time series and color by column names

I have a dataset with the following [structure][1] -
On a high level it is a time series data. I want to plot this time series data and have a unique color for each column. This will enable me to show the transitions better to the viewer. The column names/labels change from one data set to another. That means I need to create colors for the y value based on labels present in each dataset. I am trying to decide how to do this in a scalable manner.
Sample data ->
;(1275, 51) PCell Tput Avg (kbps) (Average);(1275, 95) PCell Tput Avg (kbps) (Average);(56640, 125) PCell Tput Avg (kbps) (Average);Time Stamp
0;;;79821.1;2021-04-29 23:01:53.624
1;;;79288.3;2021-04-29 23:01:53.652
2;;;77629.2;2021-04-29 23:01:53.682
3;;;78980.3;2021-04-29 23:01:53.695
4;;;77953.4;2021-04-29 23:01:53.723
5;;;;2021-04-29 23:01:53.748
6;;75558.7;;2021-04-29 23:01:53.751
7;;73955.5;;2021-04-29 23:01:53.780
8;;73689.8;;2021-04-29 23:01:53.808
9;;74819.8;;2021-04-29 23:01:53.839
10;10000;;;2021-04-29 23:01:53.848
11;68499;;;2021-04-29 23:01:53.867
[1]: https://i.stack.imgur.com/YM2P6.png

As long as each dataframe has the datetime column labeled as 'Time Stamp', you should just be able to do this for each one:
import matplotlib.pyplot as plt
df = #get def
df.plot.line(x = 'Time Stamp',grid=True)
plt.show()
Example df:
>>> df
A B C D Time Stamp
0 6 9 7 8 2018-01-01 00:00:00.000000000
1 7 3 8 6 2018-05-16 18:04:05.267114496
2 4 1 4 0 2018-09-29 12:08:10.534228992
3 1 2 5 8 2019-02-12 06:12:15.801343744
4 6 7 9 3 2019-06-28 00:16:21.068458240
5 8 5 9 9 2019-11-10 18:20:26.335572736
6 3 0 8 2 2020-03-25 12:24:31.602687232
7 8 0 0 9 2020-08-08 06:28:36.869801984
8 0 9 7 8 2020-12-22 00:32:42.136916480
9 7 8 9 2 2021-05-06 18:36:47.404030976
Resulting plot:

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.

You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

Change value if consecutive number of certain condition is achieved in Pandas

I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.
Example:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40
Which gives me this DF:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 -1.045209
4 40.000000 0.598657 -1.268399
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 1.744822
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 1.416020
10 -1.337494 -40.000000 -1.195780
11 -0.703669 -40.000000 0.657519
12 -40.000000 -0.288235 -0.840145
13 -1.084869 -0.298030 -1.592004
14 -0.617568 -1.046210 -0.531523
Now, if I do
a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')
I get
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 NaN 0.598657 NaN
5 NaN 0.442297 -0.016363
6 NaN -0.316817 NaN
7 NaN 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.
Any ideas?

See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.
In [23]: cond = abs(df) > abs(df.std())
In [24]: for c in df.columns:
...: grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
...: fill = (df.groupby(grouper)[c].transform('size') <= 2)
...: df.loc[fill, c] = np.nan
In [25]: df
Out[25]:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.
The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.
The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)
So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').
Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.

pandas indexing error with seaborn

I have a pandas dataframe that I am trying to create a tsplot with seaborn and I am getting index duplicate errors. My question is two fold. First off, when I look at the data there are no duplicate indexes (sample df is given below):
BOLD campaign time type
0 2.735430 6041148 3 Default
1 1.356943 6041148 3 None
2 NaN 6041148 3 Vertical
3 9.550452 6013866 6 Default
4 1.000000 6013866 6 None
5 NaN 6013866 6 Vertical
6 322.675089 6086810 8 Default
7 1.508849 6086810 8 None
8 773.393385 6086810 8 Vertical
9 43.396084 6046619 10 Default
10 26.124405 6046619 10 None
11 NaN 6046619 10 Vertical
12 103.955111 6065909 10 Default
13 1.000000 6065909 10 None
14 NaN 6065909 10 Vertical
15 9.744664 6013866 9 Default
16 9.031970 6013866 9 None
17 NaN 6013866 9 Vertical
18 10.980322 6065742 8 Default
19 0.803821 6065742 8 None
However when I go to plot this dataframe with
sns.tsplot(df, time="time", unit="campaign", condition="type", value="BOLD")
I get
ValueError: Index contains duplicate entries, cannot reshape
Can someone explain why this is? I do not see any duplicate entries (or indices). I also tried using drop_duplicates and get the same result

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating through rows of a dataframe to fill column in python - python

Related

Shift Dataframe and filling up NaN

Plot time series and color by column names

Weighted Means for columns in Pandas DataFrame including Nan

Change value if consecutive number of certain condition is achieved in Pandas

pandas indexing error with seaborn

Categories

Resources