Merge dataframe resulting in Series - python

I working with the Texas Hospital Discharge Dataset and I am trying to determine the top 100 most frequent Principal Surgery Procedures over a period of 4 years.
Do to this I need to go through each quarter of each year and count the procedures, but when I try to merge different quarters the result is a Series not a DataFrame.
top_procedures = None
for year in range(6, 10):
for quarter in range(1, 5):
quarter_data = pd.read_table(
filepath_or_buffer="/path/to/texas/data/PUDF_base"
+ str(quarter) + "q200" + str(year) + "_tab.txt",
)
quarter_data = quarter_data[quarter_data["THCIC_ID"] != 999999]
quarter_data = quarter_data[quarter_data["THCIC_ID"] != 999998]
quarter_procedures = quarter_data["PRINC_SURG_PROC_CODE"].value_counts()
quarter_procedures = pd.DataFrame(
{"PRINC_SURG_PROC_CODE": quarter_procedures.index, "count": quarter_procedures.values})
top_procedures = quarter_procedures if (top_procedures is None) else \
top_procedures.merge(
right=quarter_procedures,
how="outer",
on="PRINC_SURG_PROC_CODE"
).set_index(
["PRINC_SURG_PROC_CODE"]
).sum(
axis=1
)
Could you please tell me what am I doing wrong? From the documentation it looks like it should return a DataFrame.
Cheers,
Dan

the merge will indeed return a dataframe, but in your code you are summing on axis=1 (all values in one row) after merging which then gives you a series (since the values from all columns are summed together in one final column).
Hope that helps.

Related

Dataframe insert sum of other dataframe

i have 2 df
df_hist: daily data of share values
df_buy_data: date when share were bought
i want to add the share holdings to df_hist for each data, which calaculate from df_buy_data depending on the date. in my version i have to iterate over the dataframe which works but i guess not so nice...
hist_data={'Date':['2022-01-01','2022-01-02','2022-01-03','2022-01-04'],'Value':[23,22,21,24]}
df_hist=pd.DataFrame(hist_data)
buy_data={'Date':['2022-01-01','2022-01-04'],'Ticker': ['Index1', 'Index1'], 'NumberOfShares':[15,29]}
df_buy_data = pd.DataFrame(buy_data)
for i, historical_row in df_hist.iterrows():
ticker_count = df_buy_data.loc[(df_buy_data['Date'] <= historical_row['Date'])]\
.groupby('Ticker').sum()['NumberOfShares']
if(len(ticker_count)>0):
df_hist.at[i,'Index1_NumberOfShares'] = ticker_count.item()
else:
df_hist.at[i, 'Index1_NumberOfShares'] = 0
df_hist
how can i impove this?
thanks for the help!

Pandas capitalization of compound interests

I am writing an emulation of a bank deposit account in pandas.
I got stuck with Compound interest (It is the result of reinvesting interest, so that interest in the next period is then earned on the principal sum plus previously accumulated interest.)
So far I have the following code:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from datetime import datetime
# Create a date range
start = '21/11/2017'
now = datetime.now()
date_rng = pd.date_range(start=start, end=now, freq='d')
# Create an example data frame with the timestamp data
df = pd.DataFrame(date_rng, columns=['Date'])
# Add column (EndOfMonth) - shows the last day of the current month
df['LastDayOfMonth'] = pd.to_datetime(df['Date']) + MonthEnd(0)
# Add columns for interest, Sasha, Artem, Total, Description
df['Debit'] = 0
df['Credit'] = 0
df['Total'] = 0
df['Description'] = ''
# Iterate through the DataFrame to set "IsItLastDay" value
for i in df:
df['IsItLastDay'] = (df['LastDayOfMonth'] == df['Date'])
# Add the transaction of the first deposit
df.loc[df.Date == '2017-11-21', ['Debit', 'Description']] = 10000, "First deposit"
# Calculate the principal sum (It the summ of all deposits minus all withdrows plus all compaund interests)
df['Total'] = (df.Debit - df.Credit).cumsum()
# Calculate interest per day and Cumulative interest
# 11% is the interest rate per year
df['InterestPerDay'] = (df['Total'] * 0.11) / 365
df['InterestCumulative'] = ((df['Total'] * 0.11) / 365).cumsum()
# Change the order of columns
df = df[['Date', 'LastDayOfMonth', 'IsItLastDay', 'InterestPerDay', 'InterestCumulative', 'Debit', 'Credit', 'Total', 'Description']]
df.to_excel("results.xlsx")
The output file looks fine, but I need the following:
The "InterestCumulative" column adds to the "Total" column at the last day of each months (compounding the interests)
At the beggining of each month the "InterestCumulative" column should be cleared (Because the interest were added to the Principal sum).
How can I do this?
You're going to need to loop, as your total changes depending on previous rows, which then affects the later rows. As a result your current interest calculations are wrong.
total = 0
cumulative_interest = 0
total_per_day = []
interest_per_day = []
cumulative_per_day = []
for day in df.itertuples():
total += day.Debit - day.Credit
interest = total * 0.11 / 365
cumulative_interest += interest
if day.IsItLastDay:
total += cumulative_interest
total_per_day.append(total)
interest_per_day.append(interest)
cumulative_per_day.append(cumulative_interest)
if day.IsItLastDay:
cumulative_interest = 0
df.Total = total_per_day
df.InterestPerDay = interest_per_day
df.InterestCumulative = cumulative_per_day
This is unfortunately a lot more confusing looking, but that's what happens when values depend on previous values. Depending on your exact requirements there may be nice ways to simplify this using math, but otherwise this is what you've got.
I've written this directly into stackoverflow so it may not be perfect.

Matching names between two columns of two dataframes and adding new columns to one - long computing time

I have two dataframes:
df1 -> Dataframe of all german cities their names and more data.
df2 -> Dataframe of all german cities and their longitude and latitude
I wrote a function that searches for a city name in both dataframes and returns the longitude and latitude:
def ret_longlat(city_name):
if sum(df_cities["city"] == city_name) > 0:
long = df_cities["lon"][df_cities["city"] == city_name].iloc[0]
lat = df_cities["lat"][df_cities["city"] == city_name].iloc[0]
else:
long = 0
lat = 0
return long,lat
In the next step I apply this function to all city names of df1 and save the result in a new Column:
df_result["long"] = df_result["city_names"].apply(lambda x: ret_longlat(x)[0])
df_result["lat"] = df_result["city_names"].apply(lambda x: ret_longlat(x)[1])
This whole process takes relatively long (I'd say 5 minutes for 12162 rows).
Is there a way to improve the code?
Example Data:
df1
city
1 stadtA
2 stadtB
3 stadtu
4 stadty
5 stadtX
df2
city lat lon
14 stadtD 50.611879 12.135526
24 stadtA 48.698890 9.842890
25 stadtC 52.947222 12.849444
26 stadtB 52.867370 12.813750
27 stadtY 52.985000 12.854444
This is a merge problem. You can perform a left merge and then fill missing values:
res = pd.merge(df1.rename(columns={'city_names': 'city'}),
df2[['city', 'long', 'lat']].drop_duplicates('city'),
how='left', on='city')
res[['long', 'lat']] = res[['long', 'lat']].fillna(0)

calculating slope on a rolling basis in pandas df python

I have a dataframe :
CAT ^GSPC
Date
2012-01-06 80.435059 1277.810059
2012-01-09 81.560600 1280.699951
2012-01-10 83.962914 1292.079956
....
2017-09-16 144.56653 2230.567646
and I want to find the slope of the stock / and S&P index for the last 63 days for each period. I have tried :
x = 0
temp_dct = {}
for date in df.index:
x += 1
max(x, (len(df.index)-64))
temp_dct[str(date)] = np.polyfit(df['^GSPC'][0+x:63+x].values,
df['CAT'][0+x:63+x].values,
1)[0]
However I feel this is very "unpythonic" , but I've had trouble integrating rolling/shift functions into this.
My expected output is to have a column called "Beta" that has the slope of the S&P (x values) and stock (y values) for all dates available
# this will operate on series
def polyf(seri):
return np.polyfit(seri.index.values, seri.values, 1)[0]
# you can store the original index in a column in case you need to reset back to it after fitting
df.index = df['^GSPC']
df['slope'] = df['CAT'].rolling(63, min_periods=2).apply(polyf, raw=False)
After running this, there will be a new column store the fitting result.

Pandas row analysis for consecutive dates

Following a "chain" of rows and counting the consecutive months from a CSV file.
Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to count the number of consecutive months a Contract chain goes for.
Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.
Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.
CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.
Current code counts the number of active contracts for each year by editing the dataframe like so:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:
file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)
df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
Ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Those data points would then be graphed.
EDIT2: CSV file changed, might be easier now. Question updated.
Not sure if I totally undertand your requirement, but does something like this work?:
df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
)/np.timedelta64(1,'Y')
df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1
After a lot of trial and error I got it working!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code by far, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hopefully my code is understood and will help someone in the future.

Categories