So I am gathering data from the S&P 500,from a csv file. My question is how would I create one large dataframe, that has 500 columns and with all of the prices. The code is currently:
import pandas as pd
import pandas_datareader as web
import datetime as dt
from datetime import date
import numpy as np
def get_data():
start = dt.datetime(2020, 5, 30)
end = dt.datetime.now()
csv_file = pd.read_csv(os.path.expanduser("/Users/benitocano/Downloads/copyOfSandP500.csv"), delimiter = ',')
tickers = pd.read_csv("/Users/benitocano/Downloads/copyOfSandP500.csv", delimiter=',', names = ['Symbol', 'Name', 'Sector'])
for i in tickers['Symbol'][:5]:
df = web.DataReader(i, 'yahoo', start, end)
df.drop(['High', 'Low', 'Open', 'Close', 'Volume'], axis=1, inplace=True)
get_data()
So as the code shows right now it is just going yo create 500 individual dataframes, and so I am asking how to make it into one large dataframe. Thanks!
EDIT:
The CSV file link is:
https://datahub.io/core/s-and-p-500-companies
I have tried this to the above code:
for stock in data:
series = pd.Series(stock['Adj Close'])
df = pd.DataFrame()
df[ticker] = series
print(df)
Though the output is only one column like so:
ADM
Date
2020-06-01 38.574604
2020-06-02 39.348278
2020-06-03 40.181465
2020-06-04 40.806358
2020-06-05 42.175167
... ...
2020-11-05 47.910000
2020-11-06 48.270000
2020-11-09 49.290001
2020-11-10 50.150002
2020-11-11 50.090000
Why is printing only one column, rather than the rest if them?
The answer depends on the structure of the dataframes that your current code produces. As the code depends on some files on your local drive, we cannot run it so hard to be specific here. In general, there are many options, among the most common I would say are
Put dfs into a list and use pandas.concat(..., axis=1) on that list to concatenate dfs column by column, see here
Merge (merge or join) your dfs on the Date column that I assume each df has, see here
Related
Excuse rookie errors, I'm pretty new to Python and even newer to using pandas.
The script so far counts the rows of below csv file (generated by a different script) and that value is stored as totalDailyListings correctly. There is some formatting with the date as I want to get to a point whereby this script counts the day befores listings.
The current structure of CSV file that rows are being counted from:
Address
Price
123 Main St
£305,000
55 Chance Way
£200,000
from datetime import date, timedelta
oneday = timedelta(days=1)
yesterdayDate = date.today() - oneday
filename = 'ballymena-' + str(yesterdayDate) + '.csv'
results = pd.read_csv(filename)
totalDailyListings = (len(results))
# calculates the correct number of rows in csv
listings = pd.DataFrame(
[yesterdayDate, totalDailyListings], columns=['Date', 'Total'])
listings.to_csv('listings-count.csv')
I'm struggling however to create a new CSV file with just two columns: "Date" and "Total" and that would allow that CSV file to be updated (rows added) daily for long term monitoring. Can anyone offer some tips to point me back in the right direction?
Desired CSV format:
Date
Total
2022-10-04
305
2022-10-05
200
The important part is the structure of the resulting dataframe
df = pd.DataFrame({'Date':['2022-10-04', '2022-10-05'], 'Total': [305, 200]})
df.to_csv('listing-count-default_index.csv')
df.set_index('Date', inplace=True)
df.to_csv('listing-count.csv')
This should give you just two columns
I have some CSV files with exactly the same structure of stock quotes (timeframe is one day):
date,open,high,low,close
2001-10-15 00:00:00 UTC,56.11,59.8,55.0,57.9
2001-10-22 00:00:00 UTC,57.9,63.63,56.88,62.18
I want to merge them all into one DataFrame with only close price columns for each stock. The problem is different files has different history depth (they started from different dates in different years). I want to align them all by date in one DataFrame.
I'm trying to run the following code, but I have nonsense in the resulted df:
files = ['FB', 'MSFT', 'GM', 'IBM']
stock_d = {}
for file in files: #reading all files into one dictionary:
stock_d[file] = pd.read_csv(file + '.csv', parse_dates=['date'])
date_column = pd.Series() #the column with all dates from all CSV
for stock in stock_d:
date_column = date_column.append(stock_d[stock]['date'])
date_column = date_column.drop_duplicates().sort_values(ignore_index=True) #keeping only unique values, then sorting by date
df = pd.DataFrame(date_column, columns=['date']) #creating final DataFrame
for stock in stock_d:
stock_df = stock_d[stock] #this is one of CSV files, for example FB.csv
df[stock] = [stock_df.iloc[stock_df.index[stock_df['date'] == date]]['close'] for date in date_column] #for each date in date_column adding close price to resulting DF, or should be None if date not found
print(df.tail()) #something strange here - Series objects in every column
The idea is first to extract all dates from each file, then to distribute close prices among according columns and dates. But obviously I'm doing something wrong.
Can you help me please?
If I understand you correctly, what you are looking for is the pivot operation:
files = ['FB', 'MSFT', 'GM', 'IBM']
df = [] # this is a list, not a dictionary
for file in files:
# You only care about date and closing price
# so only keep those 2 columns to save memory
tmp = pd.read_csv(file + '.csv', parse_dates=['date'], usecols=['date', 'close']).assign(symbol=file)
df.append(tmp)
# A single `concat` is faster then sequential `append`s
df = pd.concat(df).pivot(index='date', columns='symbol')
I have a situation where my data in stored in S3 by date. So in a bucket named mydata, the folder 2020-01-01 contains one csv, then folder 2020-01-02 contains another csv, and so on. I want to write a function in which the user inputs a start date and end date, and the function reads all csv files between those dates and concatenates them into a single dataframe. One way I can do this is below, but this seems clunky and slow. Is there a better way?
# Load libraries
import pandas as pd
import dask.dataframe as dd
# Define bucket, and start and end dates
bucket = 'mydata'
start_date = '2019-07-09'
end_date = '2019-07-12'
def read_data(bucket, start_date, end_date):
# Initialize list of dataframes
dfs = []
# Get range of dates from which to read data
time_range = pd.date_range(start=start_date, end=end_date, freq='D')
# Read data for each date and append to dfs
for dte in time_range:
d = str(dte).split(' ')[0]
df = dd.read_csv('s3://{}/{}/*.csv'.format(bucket, d)).compute()
dfs.append(df)
# Concatenate dfs into one df
merged_df = pd.concat(dfs, 0)
return merged_df
This is how I would do it,
Create a list of Pathlib file objects
Create a dataframe and use regex to extract a date
Take an input of start and end date and filter the dataframe by that.
concat a list of files based on the above filter.
Naturally, you should add in some error handling to your datetime inputs.
import pandas as pd
from pathlib import Path
path = '\tmp\s3\bucket\files'
df = pd.DataFrame({'files' : [f for f in Path(path).glob('*.csv')],
'stem' : [f.stem for f in Path(path).glob('*.csv')]})
df['date'] = pd.to_datetime(df['stem'].str.extract('(\d{4}-\d{2}-\d{2})')[0])
print(df)
files stem date
0 \tmp\s3\bucket\files\file-2020-04-15 file-2020-04-15 2020-04-15
1 \tmp\s3\bucket\files\file-2020-04-16 file-2020-04-16 2020-04-16
2 \tmp\s3\bucket\files\file-2020-04-17 file-2020-04-17 2020-04-17
3 \tmp\s3\bucket\files\file-2020-04-18 file-2020-04-18 2020-04-18
4 \tmp\s3\bucket\files\file-2020-04-19 file-2020-04-19 2020-04-19
start_date = input('Enter your Start Date')
end_date = input('Enter your End Date')
assuming these dates are the '2020-04-15' and '2020-04-16' we can return a list based of that range
Concatenate DataFrame.
file_date_slice = df.set_index('date').loc[start_date:end_date]['files'].tolist()
concat_df = pd.concat([dd.read_csv(f).compute() for f in file_date_slice])
I am trying to read multiple csv stock price files all of which have following columns: Date,Time, Open, High, Low, Close. The code is:
import pandas as pd
tickers=['gmk','yandex','sberbank']
ohlc_intraday={}
ohlc_intraday['gmk']=pd.read_csv("gmk_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['yandex']=pd.read_csv("yndx_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['sberbank']=pd.read_csv("sber_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
df=copy.deepcopy(ohlc_intraday)
for i in range(len(tickers)):
df[tickers[i]] = df[tickers[i]].iloc[:, 2:]
df[tickers[i]].columns = ['Date','Time',"Open", "High", "Low", "Adj Close", "Volume"]
df[tickers[i]]['Time']=[x+':00' for x in df['Time']]
However, I am then faced with the KeyError: 'Time'. Seems like columns are not keys.
Is it possible to read or convert it to a DataFrame format with keys being stock tickers (gmk, yandex, sberbank) and column names, so I can easily extract value using following code
ohlc_intraday['sberbank']['Date'][1]
What you could do is create a DataFrame that has a column that specifies the market.
import pandas as pd
markets = ["gmk", "yandex", "sberbank"]
markets = ["gmk_15min.csv", "yndx_15min.csv", "sberbank.csv"]
dfs = [pd.read_csv(market, parse_dates=["<DATE>"], dayfirst=True)
for market in markets]
# add market column to each df
for df in dfs:
df['market'] = market
# concatenate in one dataframe
df = pd.concat(dfs)
Then access what you want in this manner
df[df['market'] == 'yandex']['Date'].iloc[1]
I am trying to pull data from Yahoo! Finance for analysis and am having trouble when I want to read from a CSV file instead of downloading from Yahoo! every time I run the program.
import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
def get(tickers, startdate, enddate):
def data(ticker):
return pdr.get_data_yahoo(ticker, start = startdate, end = enddate)
datas = map(data, tickers)
return(pd.concat(datas, keys = tickers, names = ['Ticker', 'Date']))
tickers = ['AAPL', 'MSFT', 'GOOG']
all_data = get(tickers, datetime.datetime(2006, 10,1), datetime.datetime(2018, 1, 7))
all_data.to_csv('data/alldata.csv')
#Open file
all_data_csv = pd.read_csv('data/alldata.csv', header = 0, index_col = 'Date', parse_dates = True)
daily_close = all_data[['Adj Close']].reset_index().pivot('Date', 'Ticker', 'Adj Close')
I'm having problems with the 'daily_close' section. The above code works as it is using 'all_data' which comes directly from the web. How do I alter the bottom line of code so that the data is being pulled from my csv file? I have tried daily_close = all_data_csv[['Adj Close']].reset_index().pivot('Date', 'Ticker', 'Adj Close') however this results in a KeyError due to 'Ticker'.
The csv data is in the following format, with the first column containing all of the tickers:
Your current code for all_data_csv will not work as it did for all_data. This is a consequence of the fact that all_data contains a MultiIndex with all the information needed to carry out the pivot.
However, in the case of all_data_csv, the only index is Date. So, we'd need to do a little extra in order to get this to work.
First, reset the Date index
Select only the columns you need - ['Date', 'Ticker', 'Adj Close']
Now, pivot on these columns
c = ['Date', 'Ticker', 'Adj Close']
daily_close = all_data_csv.reset_index('Date')[c].pivot(*c)
daily_close.head()
Ticker AAPL GOOG MSFT
Date
2006-10-02 9.586717 199.422943 20.971155
2006-10-03 9.486828 200.714539 20.978823
2006-10-04 9.653308 206.506866 21.415722
2006-10-05 9.582876 204.574448 21.400393
2006-10-06 9.504756 208.891357 21.362070