Issue of creating a Dataframe

Issue of creating a Dataframe - python

I am trying to create a dataframe by using for loop. It works but the output of the dataframe is not correct. Each cell of the Dataframe contain all data. May I know how can I fix it?
Here is the code:
from pandas_datareader import data
import datetime
from math import exp, sqrt
import pandas as pd
records = []
test = ['AAPL','AAL']
for i in test:
stock_price = data.DataReader(test,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Adj Close'][-100:]
stock_volume = data.DataReader(test,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Volume'][-100:]
returns = stock_price.pct_change()
((1 + returns).cumprod() - 1)
records.append({
'underlyingSymbol' : i,
'last_price' : stock_price.iloc[-1],
'15d_highest' : stock_price.iloc[-15:].max(),
'15d_lowest' : stock_price.iloc[-15:].min(),
})
df = pd.DataFrame(records)
df

Since you're looping over symbols, you should change data.DataReader(test... to data.DataReader(i... (otherwise it reads data for both of them on every iteration):
for i in test:
stock_price = data.DataReader(i,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Adj Close'][-100:]
stock_volume = data.DataReader(i,
start='2021-01-01',
end='2021-04-01',
data_source='yahoo')['Volume'][-100:]
...
Output:
underlyingSymbol last_price 15d_highest 15d_lowest 15d_volume \
0 AAPL 123.000000 125.57 119.900002 92403800.0
1 AAL 23.860001 25.17 21.809999 93746800.0
30d_returns 15d_returns 7d_returns volatility
0 -0.047342 0.018057 0.024240 0.325800
1 0.266432 0.030475 0.092192 0.571564

Related

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.
import numpy as np
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))

You can use pandas.Series.between :
df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])
input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')
m = df['OFFENCE_MONTH'].between(input_date1, input_date2)
df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)

You can do this if it is per month:
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)
# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)
# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))
example result:
a 2
b 1
d 1
r 1
e 1
f 1

First you need to convert the date in OFFENCE_MONTH column to datetime :
from datetime import datetime
datetime.strptime(input_date1, "%Y-%m-%d")
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")
Then Selecting rows based on your conditions:
rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))

Why using statmodels.adfuller when using a for to iterate through a dataframe and selects the time series give me an error?

The problem here is to test if all stocks are integrated of order 1 , I(1), then search for cointegrated pairs. Until now, I'm just testing if they're I(1) using ADF test, but is not working correctly
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib as ptl
import statsmodels.tsa.stattools as ts
#List of the 50 companies with most % in IBOVESPA at 05-02-2022
Stocks = ["VALE3","PETR4","ITUB4","BBDC4","PETR3","B3SA3","ABEV3","JBSS3","BBAS3","WEGE3","ITSA4","HAPV3","SUZB3","RENT3"
,"GGBR4","BPAC11","RDOR3","EQTL3","CSAN3","VBBR3","LREN3","BBDC3","RADL3","PRIO3","VIVT3","RAIL3","ENEV3","BBSE3","KLBN11","TOTS3"
,"CMIG4","NTCO3","HYPE3","SBSP3","BRFS3","ELET3","AMER3","UGPA3","MGLU3","CCRO3","CSNA3","ASAI3","ENGI11","SANB11","TIMS3","CPLE6"
,"EGIE3","BRKM5","EMBR3","ELET6"]
Stocks_SA = []
for tickers in Stocks:
new_i = f'{tickers}.SA'
Stocks_SA.append(new_i)
def download_data(List):
data = pd.DataFrame()
names = []
for i in List:
df = pd.DataFrame(yf.download(i,start = "2020-04-30", end = "2021-04-30"))
df = df.dropna()
df["Adj Close"] = np.log(df["Adj Close"])
df2 = df.iloc[:,4]
data = pd.concat([data,df2],axis =1)
names.append(i)
data.columns = names
return data
s_data = download_data(Stocks_SA)
import statsmodels.tsa.stattools as ts
def Testing_ADF(data): #Test if all stocks are integrated in order one I(1)
names = data.columns.values.tolist()
n = data.shape[1]
I_one = []
keys = data.keys()
for n in names:
series = data[n]
result_adf = ts.adfuller(series)
if result_adf[1]> 0.05:
I_one.append(n)
return I_one
I_one_list = Testing_ADF(s_data)
I_one_list
When i run Testing_ADF(s_data) I get MissingDataError: exog contains inf or nansbut if I run just this code it works perfectly:
df = pd.DataFrame(yf.download("VALE3.SA",start = "2020-04-30", end = "2021-04-30"))
#df2 = ts.adfuller(df)
df["Adj Close"] = np.log(df["Adj Close"])
df2 = df.iloc[:,4]
df2.dropna()
adfuller = ts.adfuller(df2)
adfuller
So, why it works in one and not in the other one? and How can I fix it?

Fetch consecutive rows from python data frame and store in list

I am trying to fetch 4 consecutive rows from a data frame and store them in a list. And would like to stop when "i" reaches its value.
Method1
dfList = []
for index, row in df.iterrows():
tempDF = df.iloc[row[i]:row[i+3]].reset_index(drop=True)
dfList.append(tempDF)
error: TypeError: can only concatenate str (not "int") to str
Method2:
dfList = []
for i in df:
tempDF = df.iloc[row[i]:row[i+3]].reset_index(drop=True)
dfList.append(tempDF
dfList = []
for index, row in df.iterrows():
tempDF = df[(df.loc[(row[i], row[i+3]))]].reset_index(drop=True)
dfList.append(tempDF)
Nothing is working.
original dataframe:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date, timedelta, datetime
import time
from google.colab import files
import warnings
warnings.filterwarnings("ignore")
# fix_yahoo_finance to fetch detch of GME directly from the yahoo finance website
import fix_yahoo_finance as yf
yf.pdr_override()
# input values
symbol = 'GME' # GME ticker symbol
start_date = '2018-01-01'
end_date = '2020-12-31'
# Read data from the website
df = yf.download(symbol,start_date,end_date)
# View data related information
print(df.tail())
print(len(df))
print(df.info()) # check for whether any null values
print(df.index) #type of index
df.reset_index(inplace=True)
df.head()
Date Open High Low Close Adj Close Volume
0 2018-01-02 17.959999 18.290001 17.780001 18.260000 15.953856 2832700
1 2018-01-03 18.290001 18.370001 17.920000 18.200001 15.901433 3789200
2 2018-01-04 18.200001 18.379999 17.959999 18.320000 16.006279 2781300
3 2018-01-05 18.379999 18.730000 18.219999 18.680000 16.320812 3019000
4 2018-01-08 18.799999 19.400000 18.799999 19.230000 16.801352 3668400
I want output like :
Output

Hey I think you need to pass "index" instead of "row['index']" in your iloc parameters.
dfList = []
for index, row in df.iterrows():
# As long as there are 3 more rows to concatenate
if(index < df.shape[0] - 3):
tempDF = df.iloc[index:index+4].reset_index(drop=True)
dfList.append(tempDF)

Concatenate on specific condition python

EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data

You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)

Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19

Pandas DF nested loop to find value matching value from loop1

I'm new to python/pandas.. so please don't judge:)
I have a DF with stock data (i.e., Date, Close Value, ...).
Now I want to see if a given Close value will hit a target value (e.g. Close+50€, Close-50€).
I wrote a nested loop to check every close value with the following close values of that day:
def calc_zv(_df, _distance):
_df['ZV_C'] = 0
_df['ZV_P'] = 0
for i in range(0, len(_df)):
_date = _df.iloc[i].get('Date')
target_put = _df.iloc[i].get('Close') - _distance
target_call = _df.iloc[i].get('Close') + _distance
for x in range(i, len(_df)-1):
a = _df.iloc[x+1].get('Close')
_date2 = _df.iloc[x+1].get('Date')
if(target_call <= a and _date == _date2):
_df.ix[i,'ZV_C'] = 1
break
elif(target_put >= a and _date == _date2):
_df.ix[i,'ZV_P'] = 1
break
elif (_date != _date2):
break
This works fine.. but I wonder if there is a "better" (Faster, more panda-like) solution?
Thanks and best wishes.
M.
EDIT
hi again,
here is some sample data generator:
import numpy as np
import pandas as pd
from PX.indicator_macros import calc_zv
import datetime
abc = datetime.datetime.now()
print(abc)
df2 = pd.DataFrame({'DateTime' : pd.Timestamp('20130102'),
'Close' : pd.Series(np.random.randn(5000))})
#print(df2.to_string())
calc_zv(df2, 2)
#print(df2.to_string())
abc = datetime.datetime.now()
print(abc)
for 5000 rows i need approx. 10s.
I have stock data for 3 years (in 15min intervall) which takes some minutes.
cheers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue of creating a Dataframe - python

Related

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Why using statmodels.adfuller when using a for to iterate through a dataframe and selects the time series give me an error?

Fetch consecutive rows from python data frame and store in list

Concatenate on specific condition python

Pandas DF nested loop to find value matching value from loop1

Categories

Resources