Import CSV and analyse in Python

Import CSV and analyse in Python - python

Looking for some help on a small project. I am trying to learn Python and I'm totally lost on a problem. Please let me explain.
I have a csv file that contains 'Apple share prices', so far i can import into Python using the csv module, however, I need to analyse the data and generate monthly averages and determine the best and worst 6 months. My csv columns are Date, Price.
Help is much appreciated.
"Date","Open","High","Low","Close","Volume","Adj Close"
"2012-11-14",660.66,662.18,650.5,652.55,1668400,652.55
"2012-11-13",663,667.6,658.23,659.05,1594200,659.05
"2012-11-12",663.75,669.8,660.87,665.9,1405900,665.9
"2012-11-09",654.65,668.34,650.3,663.03,3114100,663.03
"2012-11-08",670.2,671.49,651.23,652.29,2597000,652.29
"2012-11-07",675,678.23,666.49,667.12,2232300,667.12
"2012-11-06",685.48,686.5,677.55,681.72,1582800,681.72
"2012-11-05",684.5,686.86,675.56,682.96,1635900,682.96
"2012-11-02",694.79,695.55,687.37,687.92,2324400,687.92
"2012-11-01",679.5,690.9,678.72,687.59,2050100,687.59
"2012-10-31",679.86,681,675,680.3,1537000,680.3
"2012-10-26",676.5,683.03,671.2,675.15,1950800,675.15
"2012-10-25",680,682,673.51,677.76,2401100,677.76
"2012-10-24",686.8,687,675.27,677.3,2496500,677.3
etc...

With pandas this would be
In [28]: df = pd.read_csv('my_data.csv', parse_dates=True, index_col=0, sep=',')
In [29]: df
Out[29]:
Open High Low Close Volume Adj Close
Date
2012-11-14 660.66 662.18 650.50 652.55 1668400 652.55
2012-11-13 663.00 667.60 658.23 659.05 1594200 659.05
2012-11-12 663.75 669.80 660.87 665.90 1405900 665.90
2012-11-09 654.65 668.34 650.30 663.03 3114100 663.03
2012-11-08 670.20 671.49 651.23 652.29 2597000 652.29
2012-11-07 675.00 678.23 666.49 667.12 2232300 667.12
2012-11-06 685.48 686.50 677.55 681.72 1582800 681.72
2012-11-05 684.50 686.86 675.56 682.96 1635900 682.96
2012-11-02 694.79 695.55 687.37 687.92 2324400 687.92
2012-11-01 679.50 690.90 678.72 687.59 2050100 687.59
2012-10-31 679.86 681.00 675.00 680.30 1537000 680.30
2012-10-26 676.50 683.03 671.20 675.15 1950800 675.15
2012-10-25 680.00 682.00 673.51 677.76 2401100 677.76
2012-10-24 686.80 687.00 675.27 677.30 2496500 677.30
In [30]: monthly = df.resample('1M')
In [31]: monthly
Out[30]:
Open High Low Close Volume Adj Close
Date
2012-10-31 680.790 683.2575 673.745 677.6275 2096350 677.6275
2012-11-30 673.153 677.7450 665.682 670.0130 2020510 670.0130
You can than sort for the column you want
In [33]: monthly.sort('Close')
Out[33]:
Open High Low Close Volume Adj Close
Date
2012-11-30 673.153 677.7450 665.682 670.0130 2020510 670.0130
2012-10-31 680.790 683.2575 673.745 677.6275 2096350 677.6275
You can even fetch the data from Yahoo finance:
In [37]: from pandas.io import data as pddata
In [40]: df = pddata.DataReader('AAPL', data_source='yahoo', start='2012-01-01')
In [41]: df.resample('1M').sort('Close')
Out[44]:
Open High Low Close Volume Adj Close
Date
2012-01-31 428.760000 431.008500 425.810500 428.578000 12249740.000000 424.804500
2012-02-29 494.803000 500.849000 491.437500 497.571000 20300990.000000 493.191000
2012-11-30 560.365385 566.118462 548.523846 555.789231 24861884.615385 554.970769
2012-05-31 565.785000 572.141364 558.397273 564.673182 18029781.818182 559.702273
2012-06-30 574.660952 578.889048 569.213333 574.562381 13360247.619048 569.504762
2012-03-31 576.858182 582.064545 570.245909 577.507727 25299250.000000 572.424545
2012-07-31 599.610000 604.920952 594.680476 601.068095 15152466.666667 595.776667
2012-04-30 609.607500 615.487500 598.650000 606.003000 27855340.000000 600.668500
2012-10-31 638.667143 643.650476 628.213810 634.714286 20651071.428571 631.828571
2012-08-31 641.527826 646.655217 637.138261 642.696087 12851252.173913 639.090870
2012-09-30 682.118421 687.007895 676.095263 681.568421 17291363.157895 678.470526

After you have read the items and saved the [month, mean_price] pairs in a list, you can sort the list:
import operator
values_list.sort(key=operator.itemgetter(1))
This will sort the values by price. To get the top n values:
print values_list[-n:]
Or the bottom n:
print values_list[:n]

Related

Name dataSet as value from row pandas

I am trying to download stock data from yfinance. I have a ticker list called ticker_list_1 with a bunch of symbols. The structure is:
array(['A', 'AAP', 'AAPL', 'ABB', 'ABBV', 'ABC', 'ABEV', 'ABMD', 'ABNB',....
Now I am using the following code to download the data:
i = 0
xrz_data = {}
for i, len in enumerate (ticker_list_1):
data = yf.download(ticker_list_1[i], start ="2019-01-01", end="2022-10-20")
xrz_data[i] = data
i=i+1
The problem I am having is that once downloaded the data is saved as xrz_data. I can access a specific dataset through the code xrz_data[I] with the corresponding number. However I want to be able to access the data through xrz_data[tickername].
How can I achieve this ?
I have tried using xrz_data[i].values = data which gives me the error KeyError: 0
EDIT:
Here is the current output of the for loop:
{0: Open High Low Close Adj Close \
Date
2018-12-31 66.339996 67.480003 66.339996 67.459999 65.660202
2019-01-02 66.500000 66.570000 65.300003 65.690002 63.937435
2019-01-03 65.529999 65.779999 62.000000 63.270000 61.582008
2019-01-04 64.089996 65.949997 64.089996 65.459999 63.713589
2019-01-07 65.639999 67.430000 65.610001 66.849998 65.066483
... ... ... ... ... ...
2022-10-13 123.000000 128.830002 122.349998 127.900002 127.900002
2022-10-14 129.000000 130.220001 125.470001 125.699997 125.699997
2022-10-17 127.379997 131.089996 127.379997 130.559998 130.559998
2022-10-18 133.919998 134.679993 131.199997 132.300003 132.300003
2022-10-19 130.110001 130.270004 127.239998 128.960007 128.960007
[959 rows x 6 columns],
1: Open High Low Close Adj Close \
Date
2018-12-31 156.050003 157.679993 154.990005 157.460007 149.844055
2019-01-02 156.160004 159.919998 153.820007 157.919998 150.281815
........
My desired output would be:
AAP: Open High Low Close Adj Close \
Date
2018-12-31 156.050003 157.679993 154.990005 157.460007 149.844055
2019-01-02 156.160004 159.919998 153.820007 157.919998 150.281815
........

I was able to figure it out. With the following code I am able to download the data and access specific tickers through f.e. xrz_data['AAPL‘]
i = 0
xrz_data = {}
for ticker in ticker_list_1:
data = yf.download(ticker, start ="2019-01-01", end="2022-10-20")
xrz_data[ticker] = data

iterrate and save each stock historical data in dataframe without downloading in CSV

I would like to pull historical data from yfinance for a specific list of stocks. I want to store earch stock in a separate dataframes (each stock with its own df).
I can download it to multiple csv's through below code, but I couldn't find a way to store them in different dataframes (wihtout having to download them to csv)
import yfinance
stocks = ['TSLA','MSFT','NIO','AAPL','AMD','ADBE','ALGN','AMZN','AMGN','AEP','ADI','ANSS','AMAT','ASML','TEAM','ADSK']
for i in stocks:
df = yfinance.download(i, start='2015-01-01', end='2021-09-12')
df.to_csv( i + '.csv')
I want my end results to be a dataframe called "TSLA" for tsla historical data and another one called "MSFT" for msft data...and so on
I tried:
stock = ['TSLA','MSFT','NIO','AAPL','AMD']
df_ = {}
for i in stock:
df = yfinance.download(i, start='2015-01-01', end='2021-09-12')
df_["{}".format(i)] = df
And I have to call each dataframe by key to get it like df_["TSLA"] but this is not what I want. I need a datafram called only TSLA that have tsla data and so on. Is there a way to do it?

You don't need to download data multiple times. You just have to split whole data with groupby and create variables dynamically with locals():
stocks = ['TSLA', 'MSFT', 'NIO', 'AAPL', 'AMD', 'ADBE', 'ALGN', 'AMZN',
'AMGN', 'AEP', 'ADI', 'ANSS', 'AMAT', 'ASML', 'TEAM', 'ADSK']
data = yfinance.download(stocks, start='2015-01-01', end='2021-09-12')
for stock, df in data.groupby(level=1, axis=1):
locals()[stock] = df.droplevel(level=1, axis=1)
df.to_csv(f'{stock}.csv')
Output:
>>> TSLA
Adj Close Close High Low Open Volume
Date
2014-12-31 44.481998 44.481998 45.136002 44.450001 44.618000 11487500
2015-01-02 43.862000 43.862000 44.650002 42.652000 44.574001 23822000
2015-01-05 42.018002 42.018002 43.299999 41.431999 42.910000 26842500
2015-01-06 42.256001 42.256001 42.840000 40.841999 42.012001 31309500
2015-01-07 42.189999 42.189999 42.956001 41.956001 42.669998 14842000
... ... ... ... ... ... ...
2021-09-03 733.570007 733.570007 734.000000 724.200012 732.250000 15246100
2021-09-07 752.919983 752.919983 760.200012 739.260010 740.000000 20039800
2021-09-08 753.869995 753.869995 764.450012 740.770020 761.580017 18793000
2021-09-09 754.859985 754.859985 762.099976 751.630005 753.409973 14077700
2021-09-10 736.270020 736.270020 762.609985 734.520020 759.599976 15114300
[1686 rows x 6 columns]
>>> ANSS
Adj Close Close High Low Open Volume
Date
2014-12-31 82.000000 82.000000 83.480003 81.910004 83.080002 304600
2015-01-02 81.639999 81.639999 82.629997 81.019997 82.089996 282600
2015-01-05 80.860001 80.860001 82.070000 80.779999 81.290001 321500
2015-01-06 79.260002 79.260002 81.139999 78.760002 81.000000 344300
2015-01-07 79.709999 79.709999 80.900002 78.959999 79.919998 233300
... ... ... ... ... ... ...
2021-09-03 368.380005 368.380005 371.570007 366.079987 366.079987 293000
2021-09-07 372.070007 372.070007 372.410004 364.950012 369.609985 249500
2021-09-08 372.529999 372.529999 375.820007 369.880005 371.079987 325800
2021-09-09 371.970001 371.970001 375.799988 371.320007 372.519989 194900
2021-09-10 373.609985 373.609985 377.260010 372.470001 374.540009 278800
[1686 rows x 6 columns]

You can create global or local variable like
globals()["TSLA"] = "some value"
print(TSLA)
locals()["TSLA"] = "some value"
print(TSLA)
but frankly it is waste of time. It is much more useful to keep it as dictionary.
With dictionary you can use for-loop to run some code on all dataframes.
You can also seletect dataframes by name. etc.
Examples:
df_max = {}
for name, df in df_.items():
df_max[name] = df.max()
name = input("What to display: ")
df_[name].plot()

how to extract a certain value from a data frame?

I am scraping data from YAHOO and trying to pull a certain value from its data frame, using a value that is on another file.
I have managed to scrape the data and show it as a data frame. the thing is I am trying to extract a certain value from the data using another df.
this is the csv i got
df_earnings=pd.read_excel(r"C:Earnings to Update.xlsx",index_col=2)
stock_symbols = df_earnings.index
output:
Date E Time Company Name
Stock Symbol
CALM 2019-04-01 Before The Open Cal-Maine Foods
CTRA 2019-04-01 Before The Open Contura Energy
NVGS 2019-04-01 Before The Open Navigator Holdings
ANGO 2019-04-02 Before The Open AngioDynamics
LW 2019-04-02 Before The Open Lamb Weston`
then I download the csv for each stock with the data from yahoo finance:
driver.get(f'https://finance.yahoo.com/quote/{stock_symbol}/history?period1=0&period2=2597263000&interval=1d&filter=history&frequency=1d')
output:
Open High Low ... Adj Close Volume Stock Name
Date ...
1996-12-12 1.81250 1.8125 1.68750 ... 0.743409 1984400 CALM
1996-12-13 1.71875 1.8125 1.65625 ... 0.777510 996800 CALM
1996-12-16 1.81250 1.8125 1.71875 ... 0.750229 122000 CALM
1996-12-17 1.75000 1.8125 1.75000 ... 0.774094 239200 CALM
1996-12-18 1.81250 1.8125 1.75000 ... 0.791151 216400 CALM
my problem is here I don't know how to find the date form my data frame and extract it from the downloaded file.
now I don't want to insert a manual date like this :
df = pd.DataFrame.from_csv(file_path)
df['Stock Name'] = stock_symbol
print(df.head())
df = df.reset_index()
print(df.loc[df['Date'] == '2019-04-01'])
output:
Date Open High ... Adj Close Volume Stock Name
5610 2019-04-01 46.700001 47.0 ... 42.987827 846900 CALM
I want a condition that will run my data frame for each stock and pull the date needed
print(df.loc[df['Date'] == the date that is next to the symbol that i just downloaded the file for])

I suppose you could make use of a variable to hold the date.
for sy in stock_symbols:
# The value from the 'Date' column in df_earnings
dt = df_earnings.loc[df_earnings.index == sy, 'Date'][sy]
# From the second block of your code relating to 'manual' date
df = pd.DataFrame.from_csv(file_path)
df['Stock Name'] = sy
df = df.reset_index()
print(df.loc[df['Date'] == dt])

How can I process a DataFrame without a for loop?

My DataFrame is:
Date Open High Low Close Adj Close Volume
5932 2016-08-18 218.339996 218.899994 218.210007 218.860001 207.483215 52989300
5933 2016-08-19 218.309998 218.750000 217.740005 218.539993 207.179825 75443000
5934 2016-08-22 218.259995 218.800003 217.830002 218.529999 207.170364 61368800
5935 2016-08-23 219.250000 219.600006 218.899994 218.970001 207.587479 53399200
5936 2016-08-24 218.800003 218.910004 217.360001 217.850006 206.525711 71728900
5937 2016-08-25 217.399994 218.190002 217.220001 217.699997 206.383514 69224800
5938 2016-08-26 217.919998 219.119995 216.250000 217.289993 205.994827 122506300
5939 2016-08-29 217.440002 218.669998 217.399994 218.360001 207.009201 68606100
5940 2016-08-30 218.259995 218.589996 217.350006 218.000000 206.667908 58114500
5941 2016-08-31 217.610001 217.750000 216.470001 217.380005 206.080124 85269500
5942 2016-09-01 217.369995 217.729996 216.029999 217.389999 206.089645 97844200
5943 2016-09-02 218.389999 218.869995 217.699997 218.369995 207.018692 79293900
5944 2016-09-06 218.699997 219.119995 217.860001 219.029999 207.644394 56702100
5945 2016-09-07 218.839996 219.220001 218.300003 219.009995 207.625412 76554900
5946 2016-09-08 218.619995 218.940002 218.149994 218.509995 207.151398 73011600
5947 2016-09-09 216.970001 217.029999 213.250000 213.279999 202.193268 221589100
5948 2016-09-12 212.389999 216.809998 212.309998 216.339996 205.094223 168110900
5949 2016-09-13 214.839996 215.149994 212.500000 213.229996 202.145859 182828800
5950 2016-09-14 213.289993 214.699997 212.500000 213.149994 202.070023 134185500
5951 2016-09-15 212.960007 215.729996 212.750000 215.279999 204.089294 134427900
5952 2016-09-16 213.479996 213.690002 212.570007 213.369995 203.300430 155236400
Currently, I'm doing this:
state['open_price'] = lookback.Open.iloc[-1:].get_values()[0]
for ind, row in lookback.reset_index().iterrows():
if ind < self.LOOKBACK_DAYS:
state['close_' + str(self.LOOKBACK_DAYS - ind)] = row.Close
state['open_' + str(self.LOOKBACK_DAYS - ind)] = row.Open
state['volume_' + str(self.LOOKBACK_DAYS - ind)] = row.Volume
But this is exceedingly slow. Is there some more vectorized way to do this?
I am trying to convert this to:
cash 1.000000e+05
num_shares 0.000000e+00
cost_basis 0.000000e+00
open_price 1.316900e+02
close_20 1.301100e+02
open_20 1.302600e+02
volume_20 4.670420e+07
close_19 1.302100e+02
open_19 1.299900e+02
volume_19 4.320920e+07
close_18 1.300200e+02
open_18 1.300300e+02
volume_18 3.252300e+07
close_17 1.292200e+02
open_17 1.299300e+02
volume_17 8.207990e+07
close_16 1.300300e+02
open_16 1.294100e+02
volume_16 6.150570e+07
close_15 1.298000e+02
open_15 1.301100e+02
volume_15 7.057170e+07
close_14 1.298300e+02
open_14 1.300200e+02
volume_14 6.292560e+07
close_13 1.297300e+02
open_13 1.300700e+02
volume_13 6.162470e+07
close_12 1.305600e+02
open_12 1.297300e+02
...
close_10 1.308700e+02
open_10 1.308500e+02
volume_10 5.790620e+07
close_9 1.295400e+02
open_9 1.310600e+02
volume_9 8.018090e+07
close_8 1.297400e+02
open_8 1.297400e+02
volume_8 4.149650e+07
close_7 1.286400e+02
open_7 1.298500e+02
volume_7 7.279940e+07
close_6 1.288800e+02
open_6 1.287700e+02
volume_6 4.303370e+07
close_5 1.287100e+02
open_5 1.285900e+02
volume_5 5.105180e+07
close_4 1.286600e+02
open_4 1.288300e+02
volume_4 6.416770e+07
close_3 1.307000e+02
open_3 1.289300e+02
volume_3 9.253180e+07
close_2 1.309500e+02
open_2 1.307500e+02
volume_2 8.726900e+07
close_1 1.311300e+02
open_1 1.310000e+02
volume_1 8.600550e+07
Length: 64, dtype: float64

One way is to cheat and use the underlying arrays using .values
I'll add some steps that i took to create an equivalent example as well:
import pandas as pd
from itertools import product
initial = ['cash', 'num_shares', 'somethingsomething']
initial_series = pd.Series([1, 2, 3], index = initial)
print(initial_series)
#Output:
cash 1
num_shares 2
somethingsomething 3
dtype: int64
Okay, just some values at the start of your series in output, mocked for the example.
df = pd.read_clipboard(sep='\s\s+') #pure magic
print(df.head())
#Output:
Date Open ... Adj Close Volume
5932 2016-08-18 218.339996 ... 207.483215 52989300
5933 2016-08-19 218.309998 ... 207.179825 75443000
5934 2016-08-22 218.259995 ... 207.170364 61368800
5935 2016-08-23 219.250000 ... 207.587479 53399200
5936 2016-08-24 218.800003 ... 206.525711 71728900
[5 rows x 7 columns]
df is now essentially the dataframe you provided in the example. The clipboard trick comes from here and is a good read for pandas MCVEs.
to_select = ['Close', 'Open', 'Volume']
SOMELOOKBACK = 6000 #mocked
final_index = [f"{name}_{index}" for index, name in product((SOMELOOKBACK - df.index), to_select)]
This prepares the indexes and looks something like this
['Close_68',
'Open_68',
'Volume_68',
'Close_67',
'Open_67',
'Volume_67',
...
]
Now, just select the relevant columns from dataframe, use .values to get a 2d array then flatten, to get the final series.
final_series = pd.Series(df[to_select].values.flatten(), index = final_index)
result = initial_series.append(final_series)
#Output:
cash 1.000000e+00
num_shares 2.000000e+00
somethingsomething 3.000000e+00
Close_68 2.188600e+02
Open_68 2.183400e+02
Volume_68 5.298930e+07
Close_67 2.185400e+02
Open_67 2.183100e+02
Volume_67 7.544300e+07
Close_66 2.185300e+02
Open_66 2.182600e+02
Volume_66 6.136880e+07
...
Close_48 2.133700e+02
Open_48 2.134800e+02
Volume_48 1.552364e+08
Length: 66, dtype: float64

How to Reduce Running (for loop), Python

Following Code is taking too much running time (more than 5min)
Is there any good ways to reduce running time.
data.head() # more than 10 year data, Total iteration is around 4,500,000
Open High Low Close Volume Adj Close \
Date
2012-07-02 125500.0 126500.0 124000.0 125000.0 118500 104996.59
2012-07-03 126500.0 130000.0 125500.0 129500.0 239400 108776.47
2012-07-04 130000.0 132500.0 128500.0 131000.0 180800 110036.43
2012-07-05 129500.0 131000.0 127500.0 128500.0 118600 107936.50
2012-07-06 128500.0 129000.0 126000.0 127000.0 149000 106676.54
My Code is
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
import matplotlib.pylab as plt
from datetime import datetime
def DataReading(code):
start = datetime(2012,7,1)
end = pd.to_datetime('today')
data = DataReader(code,'yahoo',start=start,end=end)
data = data[data["Volume"] != 0]
return data
data['Cut_Off'] = 0
Cut_Pct = 0.85
for i in range(len(data['Open'])):
if i==0:
pass
for j in range(0,i):
if data['Close'][j]/data['Close'][i-1]<=Cut_Pct:
data['Cut_Off'][j] = 1
data['Cut_Off'][i] = 1
else
pass
Above Code takes more than 5 min.
Of course, there are "elif" are following(I didn't write above code)
I just tested above code.
Is there any good ways to reduce above code running time?
additional
buying list is
Open High Low Close Volume Adj Close \
Date
2012-07-02 125500.0 126500.0 124000.0 125000.0 118500 104996.59
2012-07-03 126500.0 130000.0 125500.0 129500.0 239400 108776.47
2012-07-04 130000.0 132500.0 128500.0 131000.0 180800 110036.43
2012-07-05 129500.0 131000.0 127500.0 128500.0 118600 107936.50
2012-07-06 128500.0 129000.0 126000.0 127000.0 149000 106676.54
2012-07-09 127000.0 133000.0 126500.0 131500.0 207500 110456.41
2012-07-10 131500.0 135000.0 130500.0 133000.0 240800 111716.37
2012-07-11 133500.0 136500.0 132500.0 136500.0 223800 114656.28
for exam, i bought 10 ea at 2012-07-02 with 125,500, and as times goes
daily, if the close price drop under 85% of buying price(125,500) then i
will sell out 10ea with 85% of buying price.
for reducing running time, i made buying list also(i didnt show in here)
but it also take more than 2 min with using for loop.

Rather than iterating over the 4.5MM rows in your data, use pandas' built-in indexing features. I've re-written the loop at the end of your code as below:
data.loc[data.Close/data.Close.shift(1) <= Cut_Pct,'Cut_Off'] = 1
.loc locates rows that meet the criteria in the first argument. .shift shifts the rows up or down depending on the argument passed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import CSV and analyse in Python - python

After you have read the items and saved the [month, mean_price] pairs in a list, you can sort the list: import operator values_list.sort(key=operator.itemgetter(1)) This will sort the values by price. To get the top n values: print values_list[-n:] Or the bottom n: print values_list[:n]

Related

Name dataSet as value from row pandas

iterrate and save each stock historical data in dataframe without downloading in CSV

how to extract a certain value from a data frame?

How can I process a DataFrame without a for loop?

How to Reduce Running (for loop), Python

Categories

Resources