How to edit a dataframe row by row while itterating?

How to edit a dataframe row by row while itterating? - python

So I am using a script to read a CSV and create a data frame from which it then scrapes price data using tickers from said data frame. The original data frame has the following columns, note NO 'Price'.
df.columns = ['Ticker TV', 'Ticker YF', 'TV Name', 'Sector', 'Industry', 'URLTV']
I've printed the below first couple of outputs from my "updated" data frame
Ticker TV Ticker YF ... URLTV Price
1 100D 100D.L ... URL NaN
2 1GIS 1GIS.L ... URL NaN
3 1MCS 1MCS.L ... URL NaN
... ... ... ... ... ...
2442 ZYT ZYT.L ...URL NaN
100D.L NaN NaN .. NaN 9272.50
1GIS.L NaN NaN ...NaN 8838.50
1MCS.L NaN NaN ...NaN 5364.00
As you can see it's not working as intended I would like to create a new column with the name of Price and attach each price with the correct ticker so 100D.L should be 9272.50 then when the script iterates to the next ticker it adds the next price value to 1GIS and so forth.
tickerList = df['Ticker YF']
for tick in tickerList:
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
df.at[tick,'Price'] = currentPriceData

Assign price using apply method:
df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))

tick is just the value from your 'Ticker YF' column. so you can use enumerate to get also the index. And if you want to access the former price to add them up you can then just use idx-1
tickerList = df['Ticker YF']
for idx, tick in enumerate(tickerList):
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
if idx!=0:
df.at[idx+1,'Price'] = float(currentPriceData)+float(df.at[idx,'Price'])
else:
df.at[idx+1,'Price'] = float(currentPriceData)
A more "elegant" idea could be something like:
df["Single_Price"]=df["Ticker YF"].apply(lambda x: priceData(getSummary(x)))
to get the value of the single prices. And then create the next column with the added prices:
df["Price"]=df["Ticker"].apply(lambda x: df["Single_Price"][df["Ticker"]<x["Ticker"]].sum())
this will add up every Single_Price (df["Single_Price"]) from every row that is before your current row Ticker x (df["Ticker"] < x["Ticker"]) and creates a new column Price in your dataframe.
after that cou can simply delete the single prices if you don't need them with:
del df["Single_Price"]

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'

To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111

RSI in spyder using data in excel

So I have an excel file containing data on a specific stock.
My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file.
I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day).
Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods.
Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess.
Some code to maybe help understand what I'm looking for:
So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above.
dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'],
date_parser=lambda x: pd.to_datetime(x, utc=True))
dates = backtest.get_dates(dat.index)
#create a summary table
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index = dates, columns=cols)
# loop backtest by dates
This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below.
for d in dates:
this_dat = dat.loc[dat.index.date==d]
#find the number of observations in date d
summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0]
#get trading (i.e. position holding) signals
signals = backtest.SMA(this_dat['Close'].values, window=10)
#find the number of trades in date d
summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1)
#find PnLs for 100 shares
shares = 100
PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals))
if np.sum(np.diff(signals))>0:
#close position at market close
PnL += shares*this_dat['Close'].values[-1]
summary_table.loc[d]['PnL'] = PnL
#find the win ratio
ind_in = np.where(np.diff(signals)==1)[0]+1
ind_out = np.where(np.diff(signals)==-1)[0]+1
num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0)
if summary_table.loc[d]['Num. Trade']!=0:
summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade']
This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out.
def SMA(p,window=10,signal_type='buy only'):
#input price "p", look-back window "window",
#signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals
#return a list of signals = 1 for long position and -1 for short position
signals = np.zeros(len(p))
if len(p)<window:
#no signal if no sufficient data
return signals
sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values
sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)]
for i in np.arange(len(p)-1):
if np.isnan(sma[i]):
continue #skip the open market time window
if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'):
signals[i] = 1
elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'):
signals[i] = -1
return signals

I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this.
I first recreated the data:
import yfinance
import pandas as pd
# initially created similar data through yfinance,
# then copied this to Excel and changed the Datetime column to match yours.
df = yfinance.download("AAPL", period="60d", interval="5m")
# copied it and read it as a dataframe
df = pd.read_clipboard(sep=r'\s{2,}', engine="python")
df.head()
# Datetime Open High Low Close Adj Close Volume
#0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241
#1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734
#2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687
#3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957
#4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410
Then I formatted the data and created the summary_table:
df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date
# calculate percentage change from open and close of each row
df["gain"] = (df["Close"] / df["Open"]) - 1
# your summary table, slightly changing the index to use the dates above
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols)
Option 1:
# loop through each group, calculate the average gain and loss, then RSI
for grp, data in df.groupby("date"):
# average gain for gain greater than 0
average_gain = data[data["gain"] > 0]["gain"].mean()
# average loss for gain less than 0
average_loss = data[data["gain"] < 0]["gain"].mean()
# add to relevant cell of summary_table
summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss)))
Option 2:
# define a function to apply in the groupby
def rsi_calc(series):
avg_gain = series[series > 0].mean()
avg_loss = series[series < 0].mean()
return 100 - (100 / (1 + (avg_gain / avg_loss)))
summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x))
Output (same for each):
summary_table.head()
# Num. Obs. Num. Trade PnL Win. Ratio RSI
#2022-03-03 NaN NaN NaN NaN -981.214015
#2022-03-04 NaN NaN NaN NaN 501.950956
#2022-03-07 NaN NaN NaN NaN -228.379066
#2022-03-08 NaN NaN NaN NaN -2304.451654
#2022-03-09 NaN NaN NaN NaN -689.824739

How to get values from a dict into a new column, based on values in column

I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?

Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)

temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column

Efficient iteration over pandas dataframe rows to calculate values for new data frame

I'm trying to create a dataframe where columns relate to the ID of sold items and the row indices are IDs of the customers who bought those items. The cells should show how much every customer bought of every item. To get this information I read CSV file containing a row for every transaction made by customers.
The file is parsed into the frame_ variable. I retrieve the customer and article IDs using the unique() function on the corresponding columns and use them to create a new dataframe with those IDs as column headers and row indices.
with open(f"{file_path}") as file:
frame_ = pd.read_csv(file, sep="\t", header=None)
customer_ids = list(frame_[customer_index].unique())
item_ids = list(frame_[item_index].unique())
frame = pd.DataFrame.from_dict(
dict.fromkeys(item_ids, dict.fromkeys(customer_ids, 0)))
For the next step I want to iterate over frame_ to check every row for 3 values:
customer ID
item ID
amount of sold items
The amount should be added to the current value at frame.at[customer_id, item_id].
for index, row in frame_.iterrows():
customer = row[customer_index]
item = row[item_index]
amount = abs(float(row[2]))
frame.at[customer, item] += amount
This part is especially slow due to me using iterrows().
I looked through some questions but because I don't quite know what I'm looking for exactly I couldn't find any solution on how to perform my task more efficiently.
Thank you for your time and any suggestions you can offer.
Edit: The original file and the frame_ dataframe contain around ~2.5mil rows
Edit 2: added excerpt from frame_, "..." contain other information not relevant for this part. Column headers are actually 0-8, "ID", "amount", "itemID" and "customerID" were added for readability:
ID ... amount ... ... itemID ... customerID ...
1 ... -5.0 ... ... 1258 ... 805214 ...
2 ... -10.0 ... ... 3658 ... 798125 ...
3 ... -7.5 ... ... 2056 ... 589012 ...
Edit 3: Expected output would look something like this:
1258 3658 2056
805214 5.0 0 0
798125 0 10.0 0
589012 0 0 7.5

Start by preparing another column of absolute values of amounts (though I do not fully understand what you need abs and float - aren't your amounts already positive and numeric?):
import numpy as np
frame_["amount1"] = np.abs(frame_["amount"].astype(float))
Then aggregate by customer and item indexes:
frame = frame_.groupby(["customerID", "itemID"])["amount1"].sum()
No explicit iterations needed. You can convert the result to a "wide" format if you want:
frame.unstack().fillna(0)
#itemID 1258 2056 3658
#customerID
#589012 0.0 7.5 0.0
#798125 0.0 0.0 10.0
#805214 5.0 0.0 0.0

Extracting data from a column in a data frame in Python

I want to extract out the "A"(s) from this column. After doing that I want to be able to print the other data from other columns associated with "A" in the same row.
However, my code printed this instead:
outputs:
UniqueCarrier NaN
CancellationCode NaN
Name: CancellationCode, dtype: object
None
The column CancellationCode looks like this:
CancellationCode:
NaN
A
NaN
B
NaN
I want to get it to print in a data frame format with the filtered rows and columns.
Here is my code below:
cancellation_reason = (flight_data_finalcopy["CancellationCode"] == "A")
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))

try this
cancellation_reason=flight_data_finalcopy[flight_data_finalcopy["CancellationCode"] == "A"]
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to edit a dataframe row by row while itterating? - python

Assign price using apply method: df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

RSI in spyder using data in excel

How to get values from a dict into a new column, based on values in column

Efficient iteration over pandas dataframe rows to calculate values for new data frame

Extracting data from a column in a data frame in Python

Categories

Resources