Create multiple dataframe columns containing calculated values from an existing column - python

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.

One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

Related

a dynamic dataframe range

I am trying to loop through a dataframe creating a dynamic ranges that are limited to the last 6 months of every row index.
Because I am looking back 6 months, I start from the first index row that has a date >= the first date in row index 0 of the dataframe. The condition which I have managed to create is shown below:
for i in df.index:
if datetime.strptime(df['date'][i], '%Y-%m-%d %H:%M:%S') >= (datetime.strptime(df['date'].iloc[0], '%Y-%m-%d %H:%M:%S') + dateutil.relativedelta.relativedelta(months=6)):
However, this merely creates ranges that grow in size incorporating, all data that is indexed after
the first index row that has a date >= the first date in row index 0 of the dataframe.
How can I limit the condition statement to only the last 6 months of each row index?
I'm not sure what exactly you want to do once you have your "dynamic ranges".
You can obtain a list of intervals (t - 6mo, t) for each t in your DatetimeIndex):
intervals = [(t - pd.DateOffset(months=6), t) for t in df.index]
But doing selection operations in a big for-loop might be slow.
Instead, you might be interested in Pandas's rolling operations. It can even use a date offset (as long as it is fixed-frequency) instead of a fixed-sized int window width. However, "6 months" is a non-fixed frequency, and as such the regular rolling won't accept it.
Still, if you are ok with an approximation, say "182 days", then the following might work well.
# setup
n = 10
df = pd.DataFrame(
{'a': np.arange(n), 'b': np.ones(n)},
index=pd.date_range('2019-01-01', freq='M', periods=n))
# example: sum
df.rolling('182D', min_periods=0).sum()
# out:
a b
2019-01-31 0.0 1.0
2019-02-28 1.0 2.0
2019-03-31 3.0 3.0
2019-04-30 6.0 4.0
2019-05-31 10.0 5.0
2019-06-30 15.0 6.0
2019-07-31 21.0 7.0
2019-08-31 27.0 6.0
2019-09-30 33.0 6.0
2019-10-31 39.0 6.0
If you want to be strict on the 6 months windows, you can implement your own pandas.api.indexers.BaseIndexer and use that as arg of rolling.

Calculating date difference for pandas dataframe rows with changing baseline dates

Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64

Python - multiplying dataframes of different size

I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?
When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using
pd.rolling_mean(data["variable"]), 12, center=True)
but it just gives me all NaN values.
Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.
The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.
There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.
df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
'temp' : [65,50,45,np.nan,40,43] }).set_index('month')
You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in #user394430's answer.)
df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean()
Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).
df['update'] = df['rollmean3']
df['update'].update( df['temp'] ) # note: this is an inplace operation
There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.
df['ffill'] = df['temp'].ffill() # previous month
df['bfill'] = df['temp'].bfill() # next month
df['interp'] = df['temp'].interpolate() # mean of prev/next
In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question:
Interpolation on DataFrame in pandas
Here is the sample data with all the results:
temp rollmean12 rollmean3 update ffill bfill interp
month
10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0
11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0
12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0
1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5
2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0
3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0
In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.
The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be
data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Categories