ValueError: cannot reindex from a duplicate axis Pandas - python

So I have a an array of timeseries` that are generated based on a fund_id:
def get_adj_nav(self, fund_id):
df_nav = read_frame(
super(__class__, self).filter(fund__id=fund_id, nav__gt=0).exclude(fund__account_class=0).order_by(
'valuation_period_end_date'), coerce_float=True,
fieldnames=['income_payable', 'valuation_period_end_date', 'nav', 'outstanding_shares_par'],
index_col='valuation_period_end_date')
df_dvd, skip = self.get_dvd(fund_id=fund_id)
df_nav_adj = calculate_adjusted_prices(
df_nav.join(df_dvd).fillna(0).rename_axis({'payout_per_share': 'dividend'}, axis=1), column='nav')
return df_nav_adj
def json_total_return_table(request, fund_account_id):
ts_list = []
for fund_id in Fund.objects.get_fund_series(fund_account_id=fund_account_id):
if NAV.objects.filter(fund__id=fund_id, income_payable__lt=0).exists():
ts = NAV.objects.get_adj_nav(fund_id)['adj_nav']
ts.name = Fund.objects.get(id=fund_id).account_class_description
ts_list.append(ts.copy())
print(ts)
df_adj_nav = pd.concat(ts_list, axis=1) # ====> Throws error
cols_to_datetime(df_adj_nav, 'index')
df_adj_nav = ffn.core.calc_stats(df_adj_nav.dropna()).to_csv(sep=',')
So an example of how the time series look like is this:
valuation_period_end_date
2013-09-03 17.234000
2013-09-04 17.277000
2013-09-05 17.363000
2013-09-06 17.326900
2013-09-09 17.400800
2013-09-10 17.473000
2013-09-11 17.486800
2013-09-12 17.371600
....
Name: CLASS I, Length: 984, dtype: float64
Another timeseries:
valuation_period_end_date
2013-09-03 17.564700
2013-09-04 17.608500
2013-09-05 17.696100
2013-09-06 17.659300
2013-09-09 17.734700
2013-09-10 17.808300
2013-09-11 17.823100
2013-09-12 17.704900
....
Name: CLASS F, Length: 984, dtype: float64
For each timeseries the Lengths are different and I am wondering if that is the reason for the error I am getting: cannot reindex from a duplicate axis. I am new to pandas so I was wondering if you guys have any advice.
Thanks
EDIT: Also the indexes aren't supposed to be unique.

Perhaps something like this would work. I've added the fund_id to the dataframe and reindexed it to the valuation_period_end_date and fund_id.
# Only fourth line above error.
ts = (
NAV.objects.get_adj_nav(fund_id['adj_nav']
.to_frame()
.assign(fund_id=fund)
.reset_index()
.set_index(['valuation_period_end_date', 'fund_id']))
And then stack with axis=0, group on the date and fund_id (assuming there is only one unique value per date and fund_id, you can take the first value), then unstack fund_id to pivot it as columns:
df_adj_nav = (
pd.concat(ts_list, axis=0)
.groupby(['valuation_period_end_date', 'fund_id'])
.first()
.to_frame()
.unstack('fund_id'))

Related

Using a newly assigned column in a `groupby` statement? (method chaining with Pandas)

I'm an R (dplyr) user who's learning how to clean data using pandas. I am practicing using the wind turbines dataset, and I would like to be able to return a data frame with the count of manufacturers per year in British Colombia since the year 2000.
The chunk below returns an error NameError: name 'year' is not defined. Is there a way to pipe a newly generated column, year in this case, into a groupby statement within one chain?
import pandas as pd
wind_raw = pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-27/wind-turbine.csv"
)
(
wind_raw
.loc[:,['province_territory', 'manufacturer', 'commissioning_date']]
.assign(year = wind_raw.commissioning_date.str.replace(r'(\d{4})(\/\d{4})*', r'\1'))
.assign(year = lambda row: pd.to_datetime(row.year))
.query('province_territory == "British Columbia" and year >= 2000')
.groupby(wind_raw.manufacturer, year)
.size()
)
You almost got it, you only have to change the groupby parameters:
(
wind_raw
.loc[:,['province_territory', 'manufacturer', 'commissioning_date']]
.assign(year = wind_raw.commissioning_date.str.replace(r'(\d{4})(\/\d{4})*', r'\1'))
.assign(year = lambda row: pd.to_datetime(row.year))
.query('province_territory == "British Columbia" and year >= 2000')
.groupby(["manufacturer", "year"])
.size()
)
Output
manufacturer year
Enercon 2009-01-01 34
2019-01-01 4
GE 2017-01-01 61
Leitwind 2010-01-01 1
Senvion 2017-01-01 10
Vestas 2011-01-01 48
2012-01-01 79
2014-01-01 55
Also, there are a couple thing that can be simplified:
(
wind_raw[['province_territory', 'manufacturer']]
.assign(year = wind_raw.commissioning_date.str.extract("(\d{4})").astype(int))
.query('province_territory == "British Columbia" and year >= 2000')
.groupby(["manufacturer", "year"])
.size()
)
To directly answer your question, you can specify your column selection in the groupby method as strings: .groupby(["manufacturer", "year"]) (we use a list to groupby multiple columns).
And if you're interested, here's an extra tidbit:
.assign(year = lambda row: pd.to_datetime(row.year))
Passing a function to the assign method does not pass each row to the function. It actually passes the dataframe in its current form.
Moreover, you can use some fun method chaining even within the assign method
.assign(
year = (wind_raw.commissioning_date
.str.replace(r'(\d{4})(\/\d{4})*', r'\1'))
# use the pipe method to pass the regex extracted column to `pd.to_datetime`
.pipe(pd.to_datetime))
)

Python pandas: multiply 2 columns of 2 dataframes with different datetime index

I have 2 dataframes UsdBrlDSlice and indexesM. The first is in daily basis, and has and index in yyyy-mm-dd format, and the second is in monthly basis, and has an index in yyyy-mm format.
Example of UsdBrlDSlice:
USDBRL
date
1994-01-03 331.2200
1994-01-04 336.4900
1994-01-05 341.8300
1994-01-06 347.2350
1994-01-07 352.7300
...
2020-10-05 5.6299
2020-10-06 5.5205
2020-10-07 5.6018
2020-10-08 5.6200
2020-10-09 5.5393
I need to insert a new column in UsdBrlDSlice, multiplying it´s value USDBRL with a specific column in indexesM['c'], but matching the correct month of both indexes.
Something like excel´s vlookup multiplication. Thanks.
I solved 1) creating a new y-m column in the first dataframe, and then 2) applying the map() function:
UsdBrlDSlice['y-m'] = UsdBrlDSlice.index.to_period('M')
UsdBrlDSlice['new col'] = UsdBrlDSlice['USDBRL'] * UsdBrlDSlice['y-m'].map(indexesM.set_index(indexesM.index)['c'])
UsdBrlDSliceTmp = UsdBrlDSlice.copy()
UsdBrlDSliceTmp['date_col'] = UsdBrlDSliceTmp.index.values
indexesMTmp = indexesM.copy()
indexesMTmp['date_col'] = indexesMTmp.index.values
UsdBrlDSliceTmp['month'] = UsdBrlDSliceTmp['date_col'].apply(lambda x: x.month)
indexesMTmp['month'] = indexesMTmp['date_col'].apply(lambda x: x.month)
UsdBrlDSliceTmp = UsdBrlDSliceTmp.merge(indexesMTmp, on='month', how='left')
UsdBrlDSliceTmp['target'] = UsdBrlDSliceTmp['USDBRL']*UsdBrlDSliceTmp['c']
UsdBrlDSlice['new_col'] = UsdBrlDSliceTmp['target']

Generating monthly means for all columns without initializing a list for each column?

I have time series data I want to generate the mean for each month, for each column. I have successfully done so, but by creating a list for each column - which wouldn't be feasible for thousands of columns.
How can I adapt my code to auto-populate the column names and values into a dataframe with thousands of columns?
For context, this data has 20 observations per hour for 12 months.
Original data:
timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
Output:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
0 106.734147 16518.428734 16518.428734 7630.187992 45.992215
1 115.099825 18222.911023 18222.911023 9954.252911 47.334477
2 111.555504 19090.607211 19090.607211 9283.845649 48.939581
3 102.408996 18399.719852 18399.719852 7778.897037 48.130057
4 118.371951 20245.378742 20245.378742 9024.424210 64.796939
5 127.580516 21859.212675 21859.212675 9595.477455 70.952311
6 134.159082 22349.853561 22349.853561 10305.252112 75.195480
7 137.990638 21122.233427 21122.233427 10024.709142 74.755469
8 144.958318 18633.290818 18633.290818 11193.381098 66.776627
9 122.406489 20258.135923 20258.135923 10504.604420 61.793355
10 104.817850 18762.070668 18762.070668 9361.052983 51.802615
11 106.589672 20049.809554 20049.809554 9158.685383 51.611633
Successful code:
#separate data into months
v = list(range(1,13))
data_month = []
for i in v:
data_month.append(data[(data.index.month==i)])
# average per month for each sensor
mean_56TI1164 = []
mean_56FI1281 = []
mean_56TI1281 = []
mean_52FC1043 = []
mean_57TI1501 = []
for i in range(0,12):
mean_56TI1164.append(data_month[i]['56TI1164'].mean())
mean_56FI1281.append(data_month[i]['56FI1281'].mean())
mean_56TI1281.append(data_month[i]['56FI1281'].mean())
mean_52FC1043.append(data_month[i]['52FC1043'].mean())
mean_57TI1501.append(data_month[i]['57TI1501'].mean())
mean_df = {'56TI1164': mean_56TI1164, '56FI1281': mean_56FI1281, '56TI1281': mean_56TI1281, '52FC1043': mean_52FC1043, '57TI1501': mean_57TI1501}
mean_df = pd.DataFrame(mean_df, columns= ['56TI1164', '56FI1281', '56TI1281', '52FC1043', '57TI1501'])
mean_df
Unsuccessful attempt to condense:
col = list(data.columns)
mean_df = pd.DataFrame()
for i in range(0,12):
for j in col:
mean_df[j].append(data_month[i][j].mean())
mean_df
As suggested by G. Anderson, you can use groupby as in this example:
import pandas as pd
import io
csv="""timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 120.0000 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 124.0000 17967.33 39.11348 9348.223 48.94284"""
# The following lines read your data into a pandas dataframe;
# it may help if your data comes in the form you wrote in the question
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
data = pd.read_csv(io.StringIO(csv), sep='\s+(?!\d\d:\d\d:\d\d)', \
date_parser=dateparse, index_col=0, engine='python')
# Here is where your data is resampled by month and mean is calculated
data.groupby(pd.Grouper(freq='M')).mean()
# If you have missing months, use this instead:
#data.groupby(pd.Grouper(freq='M')).mean().dropna()
Result of data.groupby(pd.Grouper(freq='M')).mean().dropna() will be:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
timestamp
2016-12-31 117.96730 17876.270 39.100740 9302.815 49.239630
2017-01-31 117.94445 17732.405 39.110535 9335.498 48.961015
2018-01-31 122.00000 17967.330 39.113480 9348.223 48.942840
Please note that I used data.groupby(pd.Grouper(freq='M')).mean().dropna() to get rid of NaN for the missing months (I added some data for January 2018 skipping what's in between).
Also note that the convoluted read_csv uses a regular expression as a separator: \s+ means one or more whitespace characters, while (?!\d\d:\d\d:\d\d) means "skip this whitespace if followed by something like 23:55:00".
Last engine='python' avoids warnings when read_csv() is used with regular expression

Pandas - Lower of Dates when not null

I have a dataframe, it has many timestamps, what I'm trying to do is get the lower of two dates only if both columns are not null. For example.
Internal Review Imported Date Lower Date
1 2/9/2018 19:44
2 2/15/2018 1:20 2/13/2018 2:18 2/13/2018 2:18
3 2/7/2018 23:17 2/12/2018 9:34 2/7/2018 23:17
4 2/12/2018 9:25
5 2/1/2018 20:57 2/12/2018 9:24 2/1/2018 20:57
If I wanted the lower of Internal Review and Imported Date, row one and four would not return any value, but would return the lower dates because they both contain dates. I know the .min(axis=1) will return a date, but they can be null which is the problem.
I tried copying something similar to here:
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end)
result[~mask] = np.nan
return result
and tried
def GetLowestDays(col1, col2, df):
df = df.copy()
Start = col1.copy().notnull()
End = col2.copy().notnull()
Col3 = [Start, End].min(axis=1)
return col3
But simply get a "AttributeError: 'list' object has no attribute 'min'"
The following code should do the trick :
df['Lower Date'] = df[( df['Internal Review'].notnull() ) & ( df['Imported Date'].notnull() )][['Internal Review','Imported Date']].min(axis=1)
The new column will be filled by the minimum if both are not null.
Nicolas

Taking certain column values from one row in a Pandas Dataframe and adding them into another dataframe

I would like to copy certain column values from a specific row in my dataframe df to another dataframe called bestdf
Here I create an empty dataframe (called bestdf):
new_columns = ['DATE', 'PRICE1', 'PRICE2']
bestdf = pd.DataFrame(columns = new_columns)
bestdf.set_index(['DATE'])
.I've located a certain row out of df and assigned the row to a variable last_time:
last_time = df.iloc[-1]
print last_time
gives me
DATETIME PRC
2016-10-03 00:07:39.295000 335.82
I then want to take the 2016-10-03 from the DATETIME column and put that into the DATE column of my other dataframe (bestdf).
I also want to take the PRC and put it into the PRICE1 column of my empty dataframe. I want bestdf to look like this:
DATE PRICE1 PRICE2
2016-10-03 335.82
Here is what I've got so far?
sample_date = str(last_time).split()
best_price = sample_date[2]
sample_date = sample_date[0]
bestdf['DATE'] = sample_date
bestdf['PRICE1'] = best_price
This doesn't seem to work though. FYI I also want to put this into a loop (where last_time will be amended and each time the new values will be written to a new row). I'm just currently trying to get the functionality correct.
Please help!
Thanks
There are multiple ways to do what are you are looking to do:
Also you can break your problem down into multiple pieces. That way you will be able to apply different steps to solve them.
Here is an example:
import pandas as pd
from datetime import datetime
data = [{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 335.29},
{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 33.9},
{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 10.9}]
df = pd.DataFrame.from_dict(data, orient='columns')
df
output:
DATETIME PRC
0 2016-10-03 00:07:39.295000 335.29
1 2016-10-03 00:07:39.295000 33.90
2 2016-10-03 00:07:39.295000 10.90
code continue:
bestdf = df[df['PRC'] > 15].copy()
# we filter data from original df and make a copy
bestdf.columns = ['DATE','PRICE1']
# we change columns as we need
bestdf['PRICE2'] = None
bestdf
output:
DATE PRICE1 PRICE2
0 2016-10-03 00:07:39.295000 335.29 None
1 2016-10-03 00:07:39.295000 33.90 None
code continue:
bestdf['DATE'] = bestdf['DATE'].apply(lambda value: value.split(' ')[0])
# we change column format based on how we need it to be
bestdf
output:
DATE PRICE1 PRICE2
0 2016-10-03 335.29 None
1 2016-10-03 33.90 None
We can also do the same thing with datetime objects also. Doesn't have to be string necessarily.

Categories