Manipulating columns in pandas, Dataframe or Series

Manipulating columns in pandas, Dataframe or Series - python

I'm trying to plot RINEX (GPS) data and am very new to Pandas, numpy. Here is a snippet of my code:
#Plotting of the data
pr1 = sat['P1']
pr2 = sat['P2']
calc_pr = pr1 - (((COEFF_3)**2) * pr2)
plt.plot(calc_pr,label='calc_pr')
where "sat" is a Dataframe as follows:
sat:
Panel: <class 'pandas.core.panel.Panel'>
Dimensions: 32 (items) x 2880 (major_axis) x 7 (minor_axis)
Items axis: G01 to G32
Major_axis axis: 0.0 to 23.9916666667
Minor_axis axis: L1 to S2
where each Item (G01, G02, etc) corresponds to:
(G01)
DataFrame: L1 L2 P1 P2 C1 \
0.000000 669486.833 530073.330 24568752.516 24568762.572 24568751.442
0.008333 786184.519 621006.551 24590960.634 24590970.218 24590958.374
0.016667 902916.181 711966.252 24613174.234 24613180.219 24613173.065
0.025000 1019689.006 802958.016 24635396.428 24635402.410 24635395.627
Within the first column (I assume this is the major axis, which I manipulated with: epoch_time = int((hour * 3600) + (minute * 60) + second)), states the time. These are 30 second intervals, over 24h. They were originally epochs (0 to 2880). The first epochs "calc_pr" are shown below:
Series: 0.000000 26529.507524
0.008333 31432.322196
0.016667 36336.563310
0.025000 41242.536096
0.033333 46149.208022
0.041667 51057.059006
0.050000 55965.873639
0.058333 60875.510720
0.066667 65785.965112
0.075000 70697.114838
However when plotting these (plt.plot(calc_pr,label='calc_pr')) instead of the x-axis being the time in hours, it is displayed in epochs. I've tried different permutations of trying to manipulate this
"calc_pr" so that the times are displayed, not epoch numbers, but so far to no avail. Could someone indicate where/how I can manipulate this?

Looks like I solved this myself in the end. I realised I just need to plot the "index", like so:
plt.plot(sat.index, calc_pr, label='calc_pr')
I love it when I solve my own problems myself. It means I'm getting even more awesome.

Related

Pandas Time series manipulation with large panel data

Here is my large panel dataset:
Date
x1
x2
x3
2017-07-20
50
60
Kevin
2017-07-21
51
80
Kevin
2016-05-23
100
200
Cathy
2016-04-20
20
20
Cathy
2019-01-02
50
60
Leo
This dataset contains billions of rows. What I would like to do is that I would like to calculate the 1-day different in terms of percentage for x1 and x2: Denote t and t+1 to the time representing today and tomorrow. I would like to calculate (x1_{t+1} - x2_t) / x2_t
First I used the fastest way in terms of writing:
I created a nested list containing all the target values of each group of x3:
nested_list = []
flatten_list = []
for group in df.x3.unique():
df_ = df[df.x3 == group]
nested_list.append((df_.x1.shift(-1) / df_.x2) / df_.x2))
for lst in nested_list:
for i in lst:
flatten_list.append(i)
df["target"] = flatten_list
However, this method will literality take a year to run, which is not implementable.
I also tried the native pandas groupby method for potentially runnable outcome but it DID NOT seem to work:
def target_calculation(x):
target = (x.x1.shift(-1) - x.x2) / x.x2
return target
df["target"] = df.groupby("x3")[["x1", "x2"]].apply(target_calculation)
How can I calculate this without using for loop or possibly vectorize the whole process?

You could groupby + shift "x1" and subtract "x2" from it:
df['target'] = (df.groupby('x3')['x1'].shift(-1) - df['x2']) / df['x2']
Output:
Date x1 x2 x3 target
0 2017-07-20 50 60 Kevin -0.15
1 2017-07-21 51 80 Kevin NaN
2 2016-05-23 100 200 Cathy -0.90
3 2016-04-20 20 20 Cathy NaN
4 2019-01-02 50 60 Leo NaN
Note that
(df.groupby('x3')['x1'].shift(-1) / df['x2']) / df['x2']
produces the output equivalent to flatten_list but I don't think this is your true desired output but rather a typo.

Create Multiple Subplots of sns.factorplot based on Dataframe Integer Column Values

I have a dataframe like on below,
df_sales:
ProductCode Weekly_Units_Sold Is_Promo
Date
2015-01-11 1 49.0 No
2015-01-11 2 35.0 No
2015-01-11 3 33.0 No
2015-01-11 4 40.0 No
2015-01-11 5 53.0 No
... ... ... ...
2015-07-26 313 93.0 No
2015-07-26 314 4.0 No
2015-07-26 315 1.0 No
2015-07-26 316 5.0 No
2015-07-26 317 2.0 No
Want to observe Promotime effect on Each ProductCode with Sns.factorplot like code on below:
sns.factorplot(data= df_sales,
x= 'Is_Promo',
y= 'Weekly_Units_Sold',
hue= 'ProductCode');
It is working good but it seems very confusing and overlapped due to 317 product plotted in one table.(https://i.stack.imgur.com/fgrjV.png)
When i split the dataframe with this code:
df_sales=df_sales.query('1<=ProductCode<=10')
It looks great readbility.
https://i.stack.imgur.com/NTQev.png
So, ı wanted to draw as subplots with help of splitting data respect of 10 productcode range(like first subplot ProductCOde is [1-10], second[11-20]..[291-300],[301-310],[311-317] in each subplot.
My Failed Tries :`
g=sns.FacetGrid(df_sales,col='ProductCode')
g.map(sns.factorplot,'Is_Promo','Weekly_Units_Sold')
sns.factorplot(data= df_sales,
x= 'Is_Promo',
y= 'Weekly_Units_Sold',
hue= 'ProductCode');
I tried not splitting with 10 ProductCode ranges.
I have just tried to create subplot for each ProductCode but
gave me image size error.
So how can I create subplots of sns.factorplot splitted respect to ProductCode range to get more readible results?
Thanks

You need to create a new column with a unique value for each group of products. A simple way of doing that is using pd.cut()
Nproducts = 100
Ngroups = 10
df1 = pd.DataFrame({'ProductCode':np.arange(Nproducts),
'Weekly_Units_Sold': np.random.random(size=Nproducts),
'Is_Promo':'No'})
df2 = pd.DataFrame({'ProductCode':np.arange(Nproducts),
'Weekly_Units_Sold': np.random.random(size=Nproducts),
'Is_Promo':'Yes'})
df = pd.concat([df1,df2])
df['ProductGroup'] = pd.cut(df['ProductCode'], Ngroups, labels=False)
After that, you can facet based on ProductGroup, and plot whatever relationship you want for each group.
g = sns.FacetGrid(data=df, col='ProductGroup', col_wrap=3, hue='ProductCode')
g.map(sns.pointplot, 'Is_Promo', 'Weekly_Units_Sold', order=['No','Yes'])
Note that this using seaborn v.0.10.0. factorplot() was renamed catplot in v.0.9 so you may have to adjust for version difference.
EDIT: to create a legend, I had to modify a bit the code to move the hue parameter out of the FacetGrid:
g = sns.FacetGrid(data=df, col='ProductGroup', col_wrap=3)
g.map_dataframe(sns.pointplot, 'Is_Promo', 'Weekly_Units_Sold', order=['Yes','No'], hue='ProductCode')
for ax in g.axes.ravel():
ax.legend(loc=1, bbox_to_anchor=(1.1,1))

Definite numerical integration in a python pandas dataframe

I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.
This is the function that works on a 1D array:
def findB(x,y):
y_int = np.zeros(y.size)
y_int_min = np.zeros(y.size)
y_int_max = np.zeros(y.size)
end = y.size-1
y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])
for i in range(1,end,1):
j=i+1
y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]
return y_int
I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:
B_df = y_df.applymap(integrator)
EDIT:
Starting dataframe dB_df:
Sample1 1 dB Sample1 2 dB Sample1 3 dB Sample1 4 dB Sample1 5 dB Sample1 6 dB
0 2.472389 6.524537 0.306852 -6.209527 -6.531123 -4.901795
1 6.982619 -0.534953 -7.537024 8.301643 7.744730 7.962163
2 -8.038405 -8.888681 6.856490 -0.052084 0.018511 -4.117407
3 0.040788 5.622489 3.522841 -8.170495 -7.707704 -6.313693
4 8.512173 1.896649 -8.831261 6.889746 6.960343 8.236696
5 -6.234313 -9.908385 4.934738 1.595130 3.116842 -2.078000
6 -1.998620 3.818398 5.444592 -7.503763 -8.727408 -8.117782
7 7.884663 3.818398 -8.046873 6.223019 4.646397 6.667921
8 -5.332267 -9.163214 1.993285 2.144201 4.646397 0.000627
9 -2.783008 2.288842 5.836786 -8.013618 -7.825365 -8.470759
Ending dataframe B_df:
Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
0 0.000038 0.000024 -0.000029 0.000008 0.000005 0.000012
1 0.000034 -0.000014 -0.000032 0.000041 0.000036 0.000028
2 0.000002 -0.000027 0.000010 0.000008 0.000005 -0.000014
3 0.000036 0.000003 -0.000011 0.000003 0.000002 -0.000006
4 0.000045 -0.000029 -0.000027 0.000037 0.000042 0.000018
5 0.000012 -0.000053 0.000015 0.000014 0.000020 -0.000023
6 0.000036 -0.000023 0.000004 0.000009 0.000004 -0.000028
7 0.000046 -0.000044 -0.000020 0.000042 0.000041 -0.000002
8 0.000013 -0.000071 0.000011 0.000019 0.000028 -0.000036
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
In the above example,
(x[j]-x[i]) = 0.000008

First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like
def findB(x, y):
"""
x : pandas.Series
y : pandas.DataFrame
"""
mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
delta_x = x.shift(-1)[:-1] - x[:-1]
scaled_int = mean_y.multiply(delta_x)
cumulative_int = scaled_int.cumsum(axis='index')
return cumulative_int.shift(1).fillna(0)
Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.

Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.
import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()

Expanding pandas data frame with date range in columns

I have a pandas dataframe with dates and strings similar to this:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
What's the best way to do this with pandas? Some sort of multi-index apply?

You can iterate over each row and create a new dataframe and then concatenate them together
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B

You don't need iteration at all.
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()

If the number of unique values of df['End'] - df['Start'] is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded

So I recently spent a bit of time trying to figure out an efficient pandas-based approach to this issue (which is very trivial with data.table in R) and wanted to share the approach I came up with here:
df.set_index("Note").apply(
lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()
Note: using .values makes a big difference in performance!
There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods - see results (in seconds) below:
n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
the other columns are named after the posters of the solutions
note I made a slight tweak to Gen's approach whereby, after pd.melt(), I do df.set_index("date").groupby("Note").resample("W-SAT").ffill() - I labelled this Gen2 and it seems to perform slightly better and gives the same result
each n_rows, n_periods combination was ran 10 times and results were then averaged
Anyway, jwdink's solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:
n_rows
n_periods
jwdink
TedPetrou
Gen
Gen2
robbie
250
4000
6.63
0.33
0.64
0.45
0.28
500
2000
3.21
0.65
1.18
0.81
0.34
1000
1000
1.57
1.28
2.30
1.60
0.48
2000
500
0.83
2.57
4.68
3.24
0.71
5000
200
0.40
6.10
13.26
9.59
1.43
If you want to run your own tests on this, my code is available in my GitHub repo - note I created a DateExpander class object that wraps all the functions to make it easier to scale the simulation.
Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM - only for about 10 minutes, so this literally me giving my 2 cents on the issue!

How to find slope of time series variable

I have a time series data similar to this
val
2015-10-15 7.85
2015-10-16 8
2015-10-19 8.18
2015-10-20 5.39
2015-10-21 2.38
2015-10-22 1.98
2015-10-23 9.25
2015-10-26 14.29
2015-10-27 15.52
2015-10-28 15.93
2015-10-29 15.79
2015-10-30 13.83
How can i find the slope of the adjecent rows (eg 8 and 7.85) of val variable and print it in a different column in R or python
I know the formula for a slope that is
but the problem is how we will take difference of x (that is date) values in a time series data
(Here x is date and y is val)

If by slope you mean (value(y)-value(x))/(y-x), then your slope should have at least one value less than your data frame, so it will be difficult to show it in the same data frame.
In R, this would be my answer:
slope<-numeric(length = nrow(df))
for(i in 2:(nrow(df)){
slope[i-1]<-(df[i-1,"val"]-df[i,"val"])/(as.numeric(df[i-1,1]-df[i,1]))
}
slope[nrow(df)]<-NA
df$slope<-slope
Edit (answering to your edition)
In R, dates is a class of data (like integers, numeric, or characters).
For example I can define a vector of dates:
x<-as.Date(c("2015-10-15","2015-10-16"))
print( x )
[1] "2015-10-15" "2015-10-16"
And the difference of 2 dates returns:
x[2]-x[1]
Time difference of 1 days
As you mentioned, you cannot divide by a date:
2/(x[2]-x[1])
Error in `/.difftime`(2, (x[2] - x[1])) :
second argument cannot be "difftime"
That is why I used as.numeric, which forces the vector to be a numeric value (in days):
2/as.numeric(x[2]-x[1])
[1] 2
To prove that it works:
as.numeric(as.Date("2016-10-15")-as.Date("2015-10-16"))
[1] 365
(2016 being a bissextile year, this is correct!)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Manipulating columns in pandas, Dataframe or Series - python

Looks like I solved this myself in the end. I realised I just need to plot the "index", like so: plt.plot(sat.index, calc_pr, label='calc_pr') I love it when I solve my own problems myself. It means I'm getting even more awesome.

Related

Pandas Time series manipulation with large panel data

Create Multiple Subplots of sns.factorplot based on Dataframe Integer Column Values

Definite numerical integration in a python pandas dataframe

Expanding pandas data frame with date range in columns

How to find slope of time series variable

Categories

Resources