Reading multiple csv files into single DataFrame - python

I am trying to read multiple csv stock price files all of which have following columns: Date,Time, Open, High, Low, Close. The code is:
import pandas as pd
tickers=['gmk','yandex','sberbank']
ohlc_intraday={}
ohlc_intraday['gmk']=pd.read_csv("gmk_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['yandex']=pd.read_csv("yndx_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['sberbank']=pd.read_csv("sber_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
df=copy.deepcopy(ohlc_intraday)
for i in range(len(tickers)):
df[tickers[i]] = df[tickers[i]].iloc[:, 2:]
df[tickers[i]].columns = ['Date','Time',"Open", "High", "Low", "Adj Close", "Volume"]
df[tickers[i]]['Time']=[x+':00' for x in df['Time']]
However, I am then faced with the KeyError: 'Time'. Seems like columns are not keys.
Is it possible to read or convert it to a DataFrame format with keys being stock tickers (gmk, yandex, sberbank) and column names, so I can easily extract value using following code
ohlc_intraday['sberbank']['Date'][1]

What you could do is create a DataFrame that has a column that specifies the market.
import pandas as pd
markets = ["gmk", "yandex", "sberbank"]
markets = ["gmk_15min.csv", "yndx_15min.csv", "sberbank.csv"]
dfs = [pd.read_csv(market, parse_dates=["<DATE>"], dayfirst=True)
for market in markets]
# add market column to each df
for df in dfs:
df['market'] = market
# concatenate in one dataframe
df = pd.concat(dfs)
Then access what you want in this manner
df[df['market'] == 'yandex']['Date'].iloc[1]

Related

Stock historical data with Investpy

I'm new to python and I want to get the historical data of some stocks. I'm trying to use investpy, but it seems that it can only get one stock at a time.
Is this correct?
If so, how can I merge those single data into one dataframe?
I tried to run something like this, but got only one column (and without the company's name). yfinance doesn't work in my case.
import investpy as inv
stocks = ["WEGE3", "JHSF3"]
dfs = list()
for stock in stocks:
df = inv.get_stock_historical_data(stock=stock, country="Brazil", from_date="01/01/2020", to_date="01/01/2021")["Close"]
dfs.append(df)
import investpy as inv
import pandas as pd
stocks = ["WEGE3", "JHSF3"]
dfs = pd.DataFrame()
for stock in stocks:
df = inv.get_stock_historical_data(stock=stock, country="Brazil", from_date="01/01/2020", to_date="01/01/2021")["Close"]
dfs = dfs.append(df)
dfs = dfs.T
dfs.columns = stocks
dfs.head()

How to put stock prices from csv file into one single dataframe

So I am gathering data from the S&P 500,from a csv file. My question is how would I create one large dataframe, that has 500 columns and with all of the prices. The code is currently:
import pandas as pd
import pandas_datareader as web
import datetime as dt
from datetime import date
import numpy as np
def get_data():
start = dt.datetime(2020, 5, 30)
end = dt.datetime.now()
csv_file = pd.read_csv(os.path.expanduser("/Users/benitocano/Downloads/copyOfSandP500.csv"), delimiter = ',')
tickers = pd.read_csv("/Users/benitocano/Downloads/copyOfSandP500.csv", delimiter=',', names = ['Symbol', 'Name', 'Sector'])
for i in tickers['Symbol'][:5]:
df = web.DataReader(i, 'yahoo', start, end)
df.drop(['High', 'Low', 'Open', 'Close', 'Volume'], axis=1, inplace=True)
get_data()
So as the code shows right now it is just going yo create 500 individual dataframes, and so I am asking how to make it into one large dataframe. Thanks!
EDIT:
The CSV file link is:
https://datahub.io/core/s-and-p-500-companies
I have tried this to the above code:
for stock in data:
series = pd.Series(stock['Adj Close'])
df = pd.DataFrame()
df[ticker] = series
print(df)
Though the output is only one column like so:
ADM
Date
2020-06-01 38.574604
2020-06-02 39.348278
2020-06-03 40.181465
2020-06-04 40.806358
2020-06-05 42.175167
... ...
2020-11-05 47.910000
2020-11-06 48.270000
2020-11-09 49.290001
2020-11-10 50.150002
2020-11-11 50.090000
Why is printing only one column, rather than the rest if them?
The answer depends on the structure of the dataframes that your current code produces. As the code depends on some files on your local drive, we cannot run it so hard to be specific here. In general, there are many options, among the most common I would say are
Put dfs into a list and use pandas.concat(..., axis=1) on that list to concatenate dfs column by column, see here
Merge (merge or join) your dfs on the Date column that I assume each df has, see here

How to merge some CSV files into one DataFrame?

I have some CSV files with exactly the same structure of stock quotes (timeframe is one day):
date,open,high,low,close
2001-10-15 00:00:00 UTC,56.11,59.8,55.0,57.9
2001-10-22 00:00:00 UTC,57.9,63.63,56.88,62.18
I want to merge them all into one DataFrame with only close price columns for each stock. The problem is different files has different history depth (they started from different dates in different years). I want to align them all by date in one DataFrame.
I'm trying to run the following code, but I have nonsense in the resulted df:
files = ['FB', 'MSFT', 'GM', 'IBM']
stock_d = {}
for file in files: #reading all files into one dictionary:
stock_d[file] = pd.read_csv(file + '.csv', parse_dates=['date'])
date_column = pd.Series() #the column with all dates from all CSV
for stock in stock_d:
date_column = date_column.append(stock_d[stock]['date'])
date_column = date_column.drop_duplicates().sort_values(ignore_index=True) #keeping only unique values, then sorting by date
df = pd.DataFrame(date_column, columns=['date']) #creating final DataFrame
for stock in stock_d:
stock_df = stock_d[stock] #this is one of CSV files, for example FB.csv
df[stock] = [stock_df.iloc[stock_df.index[stock_df['date'] == date]]['close'] for date in date_column] #for each date in date_column adding close price to resulting DF, or should be None if date not found
print(df.tail()) #something strange here - Series objects in every column
The idea is first to extract all dates from each file, then to distribute close prices among according columns and dates. But obviously I'm doing something wrong.
Can you help me please?
If I understand you correctly, what you are looking for is the pivot operation:
files = ['FB', 'MSFT', 'GM', 'IBM']
df = [] # this is a list, not a dictionary
for file in files:
# You only care about date and closing price
# so only keep those 2 columns to save memory
tmp = pd.read_csv(file + '.csv', parse_dates=['date'], usecols=['date', 'close']).assign(symbol=file)
df.append(tmp)
# A single `concat` is faster then sequential `append`s
df = pd.concat(df).pivot(index='date', columns='symbol')

python pandas exported csv format different from imported issue

I have a strange issue on pandas.read_csv function. I exported my dataframe into a csv, but when I re-imported the same csv, the data that has been imported back does not work when I try to merge(The merge shows all the data on the left and none that I have tried to merge it with). If I use the original data before it was exported to the csv, it works completely fine.(The merge was perfect).
df = df.values_list('id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp', flat=False)
#inserting the collected data into a dateframe for manipulation
df = pd.DataFrame(list(df))
#giving the dataframe column names
df.columns = ['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp']
df = df[['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date']]
#rename required columns
df.rename(columns={'uniquecount':'Unique Views','nonuniquecount':'Views','msgcount':'Messages','ordercount':'Orders'}, inplace=True)
print df
print df.dtypes
# exporting df out to a csv
# df.to_csv('test.csv', header=True)
# importing the df back from a csv
df = pd.read_csv('test.csv', index_col=0)
print df
print df.dtypes
#insert dates
numdays = 14
base = datetime.datetime.today().date()
date_list = [base - datetime.timedelta(days=x) for x in range(0, numdays)]
dates = pd.DataFrame(date_list)
dates.columns = ['date']
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')
# print df
I have checked and compared that the dataframes look exactly the same before export and after importing from the csv.(I printed the output twice, once before export and one after) I have also checked and the datetypes are all the same.
I need to export the csv to work with an external environment because I cant attach my local database.
attached a copy of the cmdline print which shows that both dataframes are exactly similar
attached below is a sample of my exported csv
,id,teacher_id,Unique Views,Views,Messages,Orders,date
0,47,31,1,6,0,0,2017-05-09
1,56,31,1,9,0,0,2017-05-10
2,67,31,2,11,0,0,2017-05-14
3,71,31,3,15,0,0,2017-05-15
4,79,31,3,17,0,0,2017-06-12
5,83,31,3,18,0,1,2017-06-18
Does anyone have any idea on this strange issue?
Before calling merge, try converting both dates using to_datetime first as referred in answer here
df.date = pd.to_datetime(df.date)
dates.date = pd.to_datetime(dates.date)
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')

Python: outputting lists to excel

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Categories