Great thanks for your help! This is my code now: I just need to fire once, so 1 iteration, is this the solution? Column B leaves one open space and then adds '2'. Probably because it looks at the index:
import pandas as pd
from pathlib import Path
data_folder = Path(PATH)
file_to_open = data_folder / "excelbestand.xlsx"
df = pd.read_excel(file_to_open)
data_x = 4
data_y = 2
df.loc[df.index.max()+1, ['A']] = data_x
df.loc[df.index.max()+1,['B']] = data_y
df.to_excel(file_to_open, index = False)
First of all it looks like you first need to read the dataframe in pandas so:
import pandas as pd
df = pandas.read_excel("name_your_file.xlsx",sheet_name = 'name_your_sheet')
num_iterations = 10 # Number of times you want to perform action
while i <= num_iterations:
data_x = df['A'].count()
df.loc[df.index.max()+1] = [data_x]
As you can see pd.iloc is very powerful : df.iloc[ row number, column number] :
columns : A,B,C,D column number = 0,1,2,3
import pandas as pd
from pathlib import Path
data_folder = Path(PATH)
file_to_open = data_folder / "excelbestand.xlsx"
df = pd.read_excel(file_to_open)
num_iterations = 1
data_x = 4
i=0
##### In this while you are appending the count of column "A" to the end of column "A"
while i <= num_iterations:
df.loc[df.index.max()+1, 0] = [data_x]
i += 1
###########code for adding to las cell of column B
i2 = 0
while i2 <= num_iterations:
df.loc[df.index.max()+1,1] = [data_x]
i += 1
df.to_excel(file_to_open, index = False)
Related
My current code is extremely slow with the nested for loop setup. I would like to speed up the process, my assumption would be that the solution is the vectorization with Pandas or NumPy. I do not know how to transfer my current code into the new format.
I have created an example code below.
import pandas as pd
import numpy as np
balance = 10000
raw_data = [[1,2,4,1,3],[2,3,7,2,4],[3,4,5,3,4],[4,4,9,1,5],[5,5,6,4,5]]
raw_df = pd.DataFrame(raw_data, columns=['D','O','H','L','C'])
history_data = [[1,1,5,np.nan,4],[0,1,3,np.nan,4],[1,0,4,2,3],[1,0,1,6,0],[0,1,7,np.nan,8]]
history_df = pd.DataFrame(history_data, columns=['TY','ST','OP','CL','SL'])
for n in raw_df.index:
for p in history_df.index:
if history_df['ST'][p] == 1 and history_df['TY'][p] == 1 and history_df['SL'][p] >= raw_df['L'][n]:
history_df['CL'][p] = raw_df['L'][n]
history_df['ST'][p] = 0
balance = balance + 20
if raw_df['C'][n] > 4:
history_df = history_df.append({'TY':0,'ST':1,'OP':5,'CL':np.nan,'SL':9,},ignore_index = True)
Check out this example, see if it helps :
import numpy as np
# Use NumPy's where function to perform the check for each row of history_df and raw_df simultaneously
mask = np.where((history_df['ST'] == 1) & (history_df['TY'] == 1) & (history_df['SL'] >= raw_df['L']))
history_df.loc[mask, 'CL'] = raw_df.loc[mask, 'L']
history_df.loc[mask, 'ST'] = 0
# Calculate the balance change
balance_change = 20 * len(mask[0])
balance += balance_change
# Append rows to history_df where raw_df['C'] > 4
new_rows = raw_df[raw_df['C'] > 4]
new_rows['TY'] = 0
new_rows['ST'] = 1
new_rows['OP'] = 5
new_rows['CL'] = np.nan
new_rows['SL'] = 9
history_df = history_df.append(new_rows, ignore_index=True)
So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks
You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')
Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)
This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])
I tried to follow what was suggested here. But it seems that only a few columns get effected when I tried either of the following functions:
def get_col_widths(dataframe):
# First we find the maximum length of the index column
idx_max = max([len(str(s)) for s in dataframe.index.values] + [len(str(dataframe.index.name))])
# Then, we concatenate this to the max of the lengths of column name and its values for each column, left to right
return [idx_max] + [max([len(str(s)) for s in dataframe[col].values] + [len(col)]) for col in dataframe.columns]
for i, width in enumerate(get_col_widths(df)):
worksheet.set_column(i, i, width)
def set_column_width(df):
length_list = [len(x) for x in df.columns]
for i, width in enumerate(length_list):
worksheet.set_column(i, i, width)
I may not be using the functions above correctly. Thus here is my entire code:
import openpyxl
import sql_utils as sql
from datetime import date
from datetime import timedelta
from dateutil.relativedelta import relativedelta
from pandas.tseries.holiday import USFederalHolidayCalendar
import pyodbc
import pandas as pd
import datetime
import numpy as np
import xlsxwriter
from openpyxl import load_workbook
import sys
import win32com.client as win32
from win32com.client import Dispatch
cal = USFederalHolidayCalendar()
# pylint: disable=no-member
cnxn = pyodbc.connect(sql.connection_string)
# Set dates for the queries
cutoff_date = datetime.date(2019, 6, 30)
output_file = ".\\pa-jpm-reporting\\output\\jpm_morgans_report.xlsx"
if len(sys.argv) > 1:
cutoff_date = datetime.datetime.strptime(sys.argv[1], '%Y-%m-%d').date()
output_file = sys.argv[2]
writer = pd.ExcelWriter(output_file)
def get_first_business_day_of_month(start_date, end_date):
return [
get_business_day(d).date()
for d in pd.date_range(start_date, end_date, freq="BMS")
]
def get_business_day(date):
while date.isoweekday() > 5 or date in cal.holidays():
date += timedelta(days=1)
return date
# Get as_of_date and archive_date
archive_date = get_first_business_day_of_month(cutoff_date, cutoff_date + relativedelta(days=+10))[0]
as_of_date = datetime.date(archive_date.year, 1, 1)
# Pull date
def get_sql(cutoff_date, archive_date, as_of_date):
return sql.jpm_rate_query.format(cutoff_date, archive_date, as_of_date)
def get_sql2(utoff_date, archive_date, as_of_date):
return sql.jpm_query.format(cutoff_date, archive_date, as_of_date)
# Get data into a pandas dataframe
def get_dataframe(cutoff_date, archive_date, as_of_date):
cnxn.execute(get_sql(cutoff_date, archive_date, as_of_date))
data = pd.read_sql(get_sql2(cutoff_date, archive_date, as_of_date), cnxn)
return data
df = get_dataframe(cutoff_date, archive_date, as_of_date)
df['ModEffectiveDate'] = pd.to_datetime(df['ModEffectiveDate'])
df['StepDate1'] = pd.to_datetime(df['StepDate1'])
# Fix step date
def date_check(date1, date2):
if date1 > date2:
return 'X'
else:
return ' '
df['Step date prior to mod'] = df.apply(lambda x: date_check(x['ModEffectiveDate'], x['StepDate1']), axis=1)
# Fix step check
cols = df.filter(regex='StepRate').columns
df['Later Step Rate not higher than previous'] = ' '
for i, col in enumerate(cols):
if i <= len(cols) - 2:
df['Later Step Rate not higher than previous'] = np.where(df[col] > df[cols[i+1]],'X',df['Later Step Rate not higher than previous'])
else:
break
# Format the Dates
df['CutoffDate'].astype('datetime64[ns]')
df['CutoffDate'] = pd.to_datetime(df['CutoffDate']).dt.strftime("%m/%d/%Y")
df['ModEffectiveDate'].astype('datetime64[ns]')
df['ModEffectiveDate'] = pd.to_datetime(df['ModEffectiveDate']).dt.strftime("%m/%d/%Y")
df['StepDate1'].astype('datetime64[ns]')
df['StepDate1'] = pd.to_datetime(df['StepDate1']).dt.strftime("%m/%d/%Y")
df = df.replace('NaT', '', regex=True)
# clean up (remove) zero step rates that follow the last true step
cols = df.filter(regex='StepRate').columns
for i, col in enumerate(cols):
df[col] = df[col].replace(0, '', regex=True)
# remove repeated loan number shown immediately before the step dates
df = df.drop(['loanid'], axis=1)
# Add color to column
def highlight(s):
same = s == df['CutoffDate']
return ['background-color: lightblue' if x else '' for x in same]
df.style.highlight_null(null_color='green')
df.to_excel(writer, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Freeze 1st row
worksheet.freeze_panes(1,0)
# Place dollar amounts where relevant
num_format = workbook.add_format({'num_format': '#,##0.00'})
# Find the columns (numbers) with float64 type and set the number format
nametypemap = df.dtypes.apply(lambda x: x.name).to_dict()
for i, (k, v) in enumerate(nametypemap.items()):
# print("index: {}, key: {}, value: {}".format(i, k, v))
if v == 'float64':
worksheet.set_column(i, i, 12, num_format)
def get_col_widths(dataframe):
# First we find the maximum length of the index column
idx_max = max([len(str(s)) for s in dataframe.index.values] + [len(str(dataframe.index.name))])
# Then, we concatenate this to the max of the lengths of column name and its values for each column, left to right
return [idx_max] + [max([len(str(s)) for s in dataframe[col].values] + [len(col)]) for col in dataframe.columns]
for i, width in enumerate(get_col_widths(df)):
worksheet.set_column(i, i, width)
def set_column_width(df):
length_list = [len(x) for x in df.columns]
for i, width in enumerate(length_list):
worksheet.set_column(i, i, width)
set_column_width(df)
writer.save()
Please let me know if there is a way of doing this. I am also having a hard time trying to highlight a column in the dataframe. But I will make another separate post for that.
Good afternoon everyone,
I am currently writing a thesis on the KMV model in python. I took inspiration from the code here to solve the non-linear equations. Here is the link to the CSV file used to create the dataframe. And this is the code I have so far:
Importation of the required modules
from datetime import datetime
import pandas as pd
import numpy as np
import scipy.optimize as sco
from scipy.stats import norm
df = pd.DataFrame()
df = pd.read_csv("AREX.csv", sep=';', engine = "python", decimal=',')
Functions to prepare the file for the model to run
def clean():
# df.rename(columns ={"Date": "Date"}, inplace = True)
# df["Date"] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace = True)
df['AREX.O']=df['AREX.O'].astype(float)
df.drop(['Total Short Term debt'], axis =1, inplace = True)
return df
def preparation():
df['e']=df['AREX.O']*df['Share Outstanding']
df['Short Term Debt']=df['Debt']-df['Total Long term Debt']
df['f']=df['Short Term Debt']+df['Total Long term Debt']*0.5
df['log_ret'] = np.log(df['AREX.O']) - np.log(df['AREX.O'].shift(1))
# df['stdev']=df['log_ret'].rolling(252).std()*m.sqrt(252)
return df
Algorithm used to solve for a and sigma_a.
I only tried to adapt the code to my dataframe here
def algo1():
# formatting the vaules as required
df["f"] = df["f"].astype(float)
df["e"] = df["e"].astype(float)
# #computating of key input variable for the model
df['a'] = df['f'].add(df["e"])
#defining a function for the black Scholes equation
def bseqn(a, debug=False):
d1 = (np.log(a/f) + (r + 0.5*sigma_a**2)*T)/(sigma_a*np.sqrt(T))
d2 = d1 - sigma_a*np.sqrt(T)
y1 = e - (a*norm.cdf(d1) - np.exp(-r*T)*f*norm.cdf(d2))
if debug:
print("d1 = {:.6f}".format(d1))
print("d2 = {:.6f}".format(d2))
print("Error = {:.6f}".format('y1'))
return y1
#Solving the model
time_horizon=[1]
timesteps = range(1, len(df))
results = np.empty((df.shape[0],len(time_horizon)))
#looping to solve for each row
for i, years in enumerate(time_horizon):
T = 1
results[:,i] = df.loc[:,'a']
for i_t, t in enumerate(timesteps):
a = results[t-10:t,i]
ra =np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #gives initial value of sigma_a
if i_t == 0:
subset_timesteps = range(t-1, t+1)
print(subset_timesteps)
else:
subset_timesteps = [t]
n_its = 0
while n_its < 10:
n_its += 1
for t_sub in subset_timesteps:
r = df.iloc[t_sub]['r']
f = df.iloc[t_sub]['f']
e = df.iloc[t_sub]['e']
sol = sco.fsolve(bseqn, results[t_sub,i]) #if I replace newton with fsolve the code works properly
results[t_sub,i] = sol # stores the new values of a
# Update sigma_a based on new values of a
last_sigma_a = sigma_a
a = results[t-10:t,i]
ra = np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #new val of sigma
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
else:
pass
return df
Run function
def run():
clean()
preparation()
algo1()
print(df)
print(list(df))
# main_df = df.to_csv("AREX_D.csv")
The output should write the results of sigma_a on the created sigma_a column but instead of that it adds a row so instead of 1500 rows i end-up with 3000 rows most of it being Nan values. I do not understand where the code asks that...
I suspect it to come from these lines:
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
Does anyone has any insight on what is happening ?
Here is a picture of the output :
Thank you very much!
I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df
At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value
df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score