Differences in one column based on differences in another, pandas - python

How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you

This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0

#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)

Related

How to groupby average and extract the last item in a dataframe

I want to extract the closing balance for the week across different date from the below dataframe
Date Week Balance
2017-02-12 6 50000.46
2017-02-12 6 49531.46
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2017-08-12 32 21561.1
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
2018-08-12 32 21561.1
Expected output is:
Date Week Balance
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2018-02-07 6 39.53
2018-08-12 32 21561.1
I tried to use the .last() attribute of groupby function but I get multiple returns for the same week
weekly = df.groupby(["Transaction Date",'Week']).last().Balance
weekly
Date. week Balance
2017-02-12 6 48108.46
2017-03-12 10 46802.46
2017-04-12 15 39588.46
2017-05-12 19 21558.96
2018-02-03 5 24699.73
2018-02-04 5 103.20
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
You can use shift to check for consecutive rows and keep the last one:
df.loc[df['Week'] != df['Week'].shift(-1)]
Output:
| | Date | Week | Balance |
|---:|:-----------|-------:|----------:|
| 2 | 2017-02-12 | 6 | 48108.46 |
| 3 | 2017-05-12 | 19 | 21558.96 |
| 4 | 2017-08-12 | 32 | 21561.10 |
| 7 | 2018-02-07 | 6 | 39.53 |
| 8 | 2018-08-12 | 32 | 21561.10 |

Python: How to transform a column data csv to row data

I have a csv file that looks like this:
Signal Channel
-60 1
-40 6
-70 11
-80 3
-80 4
-66 1
-60 11
-50 6
I want to create a new csv file using those data in matrix form:
channel 1 | channel 2 | channel 3 | channel 4 | channel 5 | channel 6 | channel 7 | channel 8 | channel 9 | channel 10 | channel 11
-60 | | -80 | -80 | | -40 | | | | | -70
-66 | | | | | -50 | | | | | -60
But I don't know how to do this.
I think you can manage with, you just have to put the args you want in the to_csv function to make it display as you want:
import pandas as pd
data={"Signal":[-60,-40,-70,-80,-80,-66,-60,-50],
"Channel":[1,6,11,3,4,1,11,6]}
df=pd.DataFrame(data)
df['count']=df.groupby('Signal')['Signal'].cumcount()
pivot=pd.pivot_table(df,values=["Signal"],columns=["Channel"],index=['count'])
pivot=pivot.add_prefix('Channel_')
pivot.to_csv("test.csv",index=False)
You can use df.pivot() from pandas (see here). First cumcount is used to determine the index location for the pivoting.
from io import StringIO
import pandas as pd
csv = """
Signal,Channel
-60,1
-40,6
-70,11
-80,3
-80,4
-66,1
-60,11
-50,6"""
df = pd.read_csv(StringIO(csv))
df['index'] = df.groupby('Channel').cumcount()
df.pivot(index='index', columns="Channel", values="Signal")
gives you
Channel 1 3 4 6 11
index
0 -60.0 -80.0 -80.0 -40.0 -70.0
1 -66.0 NaN NaN -50.0 -60.0
This answer is adapted from #unutbu's answer here

Plots shifting in heatmaps in Seaborn Facetgrid

Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()

Column values and Column header iteraton calculation

I have an excel sheet setup as below:
avgdegf | 50 | 55| 60| 65| 70| 75| 80 |
76 |
68 |
39 |
note: the values under the values 50,55,60,65,70,75, and 80 are empty.
What I am trying to achieve is filling these values based off of the number values in the column. so If avgdegf value is greater than (header number) of the specific column than do (avgdegf-header number) else the value is 0 and put the value in the specific row for example.
avgdegf | 50 | 55| 60| 65| 70| 75| 80 |
76 | 26 |21 |16 |11 | 6 | 1 | 0 |
68 | 18 |13 | 8 |11 | 0 | 0 | 0 |
39 | 0 |0 | 0 | 0 | 0 | 0 | 0 |
This above is what I expect to get, but instead I just get:
Python: ValueError: The Truth Value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
What am I doing wrong and how can I fix this? Thanks!
Here is a chunk of my code below:
df_avgdegf = df["avgdegf"]
x=50
for x in range(50, 81):
if df_avgdegf > x:
df[x]= (df_avgdegf)-x
else:
df[x]=0
df.head()
df_cdd = df[x]
df_cdd = pd.DataFrame(df_cdd)
writer = ExcelWriter('thecddhddtestque.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
x += 1
I've assumed your data in a csv file.
The same principles would apply if you are using excel reader.
data.csv:
avgdegf,50,55,60,65,70,75,80
76,,,,,,
68,,,,,,
39,,,,,,
get your data into a dataframe:
df = pd.read_csv('data.csv')
so your df will look like this:
avgdegf 50 55 60 65 70 75 80
0 76 nan nan nan nan nan nan nan
1 68 nan nan nan nan nan nan nan
2 39 nan nan nan nan nan nan nan
the next steps with this code will do the trick:
# we want to get the numerical columns into the dataframe
df.iloc[0,1:] = df.columns[1:]
df = df.fillna(method ='ffill')
df =df.astype(np.float64) # cast type for next steps
df.iloc[:,1:] = df.iloc[:,1:].sub(df['avgdegf'],axis='index') # 1.) see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x: 0 if x > 0 else x) # 2.) set positve values to zero
df.iloc[:,1:] = df.iloc[:,1:].applymap(np.abs) # 3.) since we went reverse we now take np.abs()
df.set_index('avgdegf',inplace=True)
which produces:
50 55 60 65 70 75 80
avgdegf
76 26 21 16 11 6 1 0
68 18 13 8 3 0 0 0
39 0 0 0 0 0 0 0
This is perhaps cleaner syntax and demonstrates the cool "broadcasting" of numpy. Ultimately the same result as the other Answer.
df = pd.read_csv('data.csv')
df.fillna(1,inplace=True)
print df.head()
df = df.astype(np.int)
b = df.iloc[:,1:].values
a = df.columns[1:].values.astype(int)
print a.shape
print b.shape
print a*b
print df['avgdegf'].values
print df['avgdegf'].values[:,np.newaxis]
method1 = (a*b) - df['avgdegf'].values[:,np.newaxis]
#or
method2 = ((a*b).T - df['avgdegf'].values).T
df.iloc[:,1:] = method1
#df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x : np.abs(0) if x > 0 else np.abs(x))
#OR
df.iloc[:,1:] = df.iloc[:,1:].clip_upper(0).abs()

How to include strings with Pandas resample

I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:
DateTime Safe1 Safe2 Safe3 Measurement
1/8/2013 6:06 N Y N
1/8/2013 6:23 N Y N
1/8/2013 6:40 N N N 28
1/8/2013 6:57 N N N 31
I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.
However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.
How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?
Desired Output based upon above:
DateTime Safe1 Measurement
1/8/2013 6:00 Y
1/8/2013 6:30 N 29.5
I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.
First creating the data :
import pandas as pd
index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']
data = {'Safe1' : ['N', 'N', 'N', 'N'],
'Safe2': ['Y', 'Y', 'N', 'N'],
'Safe3': ['N', 'N', 'N', 'N'],
'Measurement': [0,0,28,31]}
df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df
output :
Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00 0 N Y N
2013-01-08 06:23:00 0 N Y N
2013-01-08 06:40:00 28 N N N
2013-01-08 06:57:00 31 N N N
Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.
df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df
output :
Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00 0 N Y N NYN
2013-01-08 06:23:00 0 N Y N NYN
2013-01-08 06:40:00 28 N N N NNN
2013-01-08 06:57:00 31 N N N NNN
finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.
That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :
def func(x):
x = ''.join(x.values)
return 'Y' if 'Y' in x else 'N'
df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })
output :
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
Here's an answer using pandas built-in resample function.
First combine the 3 Safe values into a single column:
df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3
Turn the 3-letter strings into a 0-1 variable:
df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)
Write a custom resampling function for the 'Safes' column:
def f(x):
if sum(x) > 0: return 'Y'
else: return 'N'
Finally, resample:
df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())
Output:
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
I manually resample the date (easy if it is rounding)....
Here is an example
from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd
def df_to_md(df):
print(tabulate(df, tablefmt="pipe",headers="keys"))
seed(42)
people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]
df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10%
# Tom should score well
df_to_md(df[df.Who=='tom'].head())
The table is in Markdown format - just to easy my cut and paste....
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 |
Harry scores badly
df_to_md(df[df.Who=='harry'].head())
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 2 | 2019-06-18 14:17:17.457124 | harry | 0.013 |
| 5 | 2019-06-18 14:32:17.457124 | harry | 0.038 |
| 8 | 2019-06-18 14:47:17.457124 | harry | 0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry | 0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry | 0.064 |
Lets get the average per hour per person
def round_to_hour(t):
# Rounds to nearest hour by adding a timedelta hour if minute >= 30
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
And generate a new column using this method.
df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())
This should be tom's data - showing original and rounded.
| | When | Who | Status | WhenRounded |
|---:|:---------------------------|:------|---------:|:--------------------|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 | 2019-06-18 14:00:00 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 | 2019-06-18 14:00:00 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 | 2019-06-18 15:00:00 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 | 2019-06-18 15:00:00 |
We can resample ... by grouping and using a grouping function
Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.
df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
| | WhenRounded | Who | Status |
|---:|:--------------------|:------|---------:|
| 2 | 2019-06-18 14:00:00 | tom | 0.873 |
| 5 | 2019-06-18 15:00:00 | tom | 0.83925 |
| 8 | 2019-06-18 16:00:00 | tom | 0.86175 |
| 11 | 2019-06-18 17:00:00 | tom | 0.84375 |
| 14 | 2019-06-18 18:00:00 | tom | 0.8505 |
Lets check the mean for tom # 14:00
print("Check tom 14:00 .86850 ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850 ... 0.86850
Hope this assists

Categories