I have created a function that manipulates a couple of datasets and outputs a merged DataFrame. I have passed an array of variables in a loop, which outputs a merged DataFrame for each one - now I want all the results appended in a single DataFrame:
Function:
`
def backtest(ticker, data):
fin = si.get_data(ticker)
fin.index.rename('date', inplace=True)
fin = fin.reset_index(level=0)
fin = fin.drop(columns=['high', 'low', 'volume'])
fin['intraday_ch_usd'] = fin['close'] - fin['open']
fin['intraday_pct_ch'] = fin['intraday_ch_usd'] / fin['open'] * 100
fin['3d_pr'] = fin['close'].shift(-3)
fin['3d_del'] = fin['3d_pr'] - fin['open']
fin['3d_pct_ch'] = fin['3d_del'] / fin['open'] * 100
data = data[data['awardee_parent_ticker_symbol'].notna()]
data = data.rename(columns={'date_of_news_dispatch': 'date', 'awardee_parent_ticker_symbol': 'ticker'})
data["date"] = pd.to_datetime(data["date"])
data = data.merge(fin, on=['date','ticker'])
data = pd.DataFrame(data=data)
`
Loop:
`
output=pd.DataFrame()
for ticker in tickers:
try:
backtest(ticker, data)
except:
pass
output=output.append(data,ignore_index=True)
output
`
I can't figure out how to append results in a single DataFrame..
Long story short, you should not dynamically append rows to a DataFrame. To achieve the result you want, you can append everything in a list and then call pd.concat to create a DataFrame from it. Something like
output=[]
for ticker in tickers:
try:
backtest(ticker, data)
except:
pass
output.append(data)
output = pd.concat(output)
Related
I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows
I have been able to get the calculation to work but now I am having trouble appending the results back into the data frame e3. You can see from the picture that the values are printing out.
brand_list = list(e3["Brand Name"])
product_segment_list = list(e3['Product Segment'])
# Create a list of tuples: data
data = list(zip(brand_list, product_segment_list))
for i in data:
step1 = e3.loc[(e3['Brand Name']==i[0]) & (e3['Product Segment']==i[1])]
Delta_Price = (step1['Price'].diff(1).div(step1['Price'].shift(1),axis=0).mul(100.0))
print(Delta_Price)
it's easier to use groupby. In each loop 'r' will be just the grouped rows from e3 dataframe from each category and i an index.
new_df = []
for i,r in e3.groupby(['Brand Name','Product Segment']):
price_num = r["Price"].diff(1).values
price_den = r["Price"].shift(1).values
r['Price Delta'] = price_num/price_den
new_df.append(r)
e3_ = pd.concat(new_df, axis = 1)
I am new to pandas, I have a doubt in returning a data frame from a function. I have a function which creates three new data frames based on the parameters given to it, the function has to return only the data frames which are non-empty. How do I do that?
my code:
def df_r(df,colname,t1):
t1_df = pd.DataFrame()
t2_df = pd.DataFrame()
t3_df = pd.DataFrame()
if t1 :
for colname in df:
some code
some code
t1_df = some data
if t2 :
for colname in df:
some code
some code
t2_df = some data
if t3 :
for colname in df:
some code
some code
t3_df = some data
list = [t1_df,t2_df,t3_df]
Now it should return only the t1_df as the parameter was given t1. So I have inserted all three into a list
list = [t1_df,t2_df,t3_df]
how to check if which df is non-empty and return it?
Just check for empty attribute for each DataFrame
eg.
df = pd.DataFrame()
if df.empty:
print("DataFrame is empty")
output:
DataFrame is empty
pd.empty would return True if DataFrame is empty, else it would return False
This would work even if column names are present but are still missing the data.
So to answer specific to your case
list = [t1_df,t2_df,t3_df]
for df in list:
if not df.empty:
return df
assuming your case has only one of the DataFrame non-empty
if t1_df.empty != True:
return t1_df
elif t2_df.empty !=True:
return t2_df
else:
return t2_df
So I know my code isn't that close to right, but I am trying to loop through a list of csv's, line by line, to create a new csv where each line will list all csv's that met a condition. First column in all csv's is "date", I want to list the name of all csv's where data["entry"] > 3 on that date with date still being the 1st column.
Update: What I'm trying to do is for each csv, make a new list of each date the condition was met and on those days on the new csv append file_name to that row/rows.
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
complete_string = ' is complete'
listdrs_confirmation = [ x + complete_string for x in listdrs]
#print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
##print(listdr)
# Convert date to timestamp and make index
data.index = data["date"].apply(lambda x: pd.Timestamp(x))
data.drop("date", axis=1, inplace=True)
return data
##create new table and append data
data = data[data.Entry > 3]
for date in data.date:
new_table[date].append(file_path)
new_table_data = data.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print(new_table_data)
I would do something like this. You need to modify the following snippet according to your needs.
import pandas as pd
from glob import glob
from collections import defaultdict
# create and save some random data
df1 = pd.DataFrame({'date':[1,2,3], 'entry':[4,3,2]})
df2 = pd.DataFrame({'date':[1,2,3], 'entry':[1,2,4]})
df3 = pd.DataFrame({'date':[1,2,3], 'entry':[3,1,5]})
df1.to_csv('table1.csv')
df2.to_csv('table2.csv')
df3.to_csv('table3.csv')
# read all the csv
tables = glob('*.csv')
new_table = defaultdict(list)
# create new table
for table in tables:
df = pd.read_csv(table)
df = df[df.entry > 2]
for date in df.date:
new_table[date].append(table)
new_table_df = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print (new_table_df)
date table names
0 1 table3.csv,table1.csv
1 2 table1.csv
2 3 table2.csv,table3.csv
Had some issues with the other code, here is the final solution I was able to come up with.
if 'Entry' in data:
##create new table and append data
data = data[data.Entry > 3]
if 'date' in data:
for date in data.date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.date == date].Entry)]}))
new_table
elif 'Date' in data:
for date in data.Date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.Date == date].Entry)]}))
# sorted(new_table, key=lambda x: x[0])
def find_max(tbl):
new_table_data = {}
for date in sorted(tbl.keys()):
merged_dt = pd.concat(tbl[date])
max_entry_v = max(list(merged_dt.Entry))
tbl_names = list(merged_dt[merged_dt.Entry == max_entry_v].FileName)
new_table_data[date] = tbl_names
return new_table_data
new_table_data = find_max(tbl=new_table)
#df = pd.DataFrame(new_table, columns =['date', 'tickers'])
#df.to_csv(input_path, index = False, header = True)
# find_max(new_table)
# new_table_data = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())],
# columns=['date', 'table names'])
print(new_table_data)
I have read data in chunks over a pyodbc connection using something like this :
import pandas as pd
import pyodbc
conn = pyodbc.connect("Some connection Details")
sql = "SELECT * from TABLES;"
df1 = pd.read_sql(sql,conn,chunksize=10)
Now I want to read all these chunks into one single spark dataframe using something like:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
The problem is when i do a df2.count() i get the result as 10 which means only the i=0 case is working.Is this a bug with unionAll. Am i doing something wrong here??
The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.union(sqlContext.createDataFrame(chunk))
Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As #zero323 says let's not use .union() in a loop. Let's instead do something like:
def unionAll(*dfs):
' by #zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in df1:
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)