How to merge some CSV files into one DataFrame? - python

I have some CSV files with exactly the same structure of stock quotes (timeframe is one day):
date,open,high,low,close
2001-10-15 00:00:00 UTC,56.11,59.8,55.0,57.9
2001-10-22 00:00:00 UTC,57.9,63.63,56.88,62.18
I want to merge them all into one DataFrame with only close price columns for each stock. The problem is different files has different history depth (they started from different dates in different years). I want to align them all by date in one DataFrame.
I'm trying to run the following code, but I have nonsense in the resulted df:
files = ['FB', 'MSFT', 'GM', 'IBM']
stock_d = {}
for file in files: #reading all files into one dictionary:
stock_d[file] = pd.read_csv(file + '.csv', parse_dates=['date'])
date_column = pd.Series() #the column with all dates from all CSV
for stock in stock_d:
date_column = date_column.append(stock_d[stock]['date'])
date_column = date_column.drop_duplicates().sort_values(ignore_index=True) #keeping only unique values, then sorting by date
df = pd.DataFrame(date_column, columns=['date']) #creating final DataFrame
for stock in stock_d:
stock_df = stock_d[stock] #this is one of CSV files, for example FB.csv
df[stock] = [stock_df.iloc[stock_df.index[stock_df['date'] == date]]['close'] for date in date_column] #for each date in date_column adding close price to resulting DF, or should be None if date not found
print(df.tail()) #something strange here - Series objects in every column
The idea is first to extract all dates from each file, then to distribute close prices among according columns and dates. But obviously I'm doing something wrong.
Can you help me please?

If I understand you correctly, what you are looking for is the pivot operation:
files = ['FB', 'MSFT', 'GM', 'IBM']
df = [] # this is a list, not a dictionary
for file in files:
# You only care about date and closing price
# so only keep those 2 columns to save memory
tmp = pd.read_csv(file + '.csv', parse_dates=['date'], usecols=['date', 'close']).assign(symbol=file)
df.append(tmp)
# A single `concat` is faster then sequential `append`s
df = pd.concat(df).pivot(index='date', columns='symbol')

Related

Populating script results into CSV file in correct format using pandas

Excuse rookie errors, I'm pretty new to Python and even newer to using pandas.
The script so far counts the rows of below csv file (generated by a different script) and that value is stored as totalDailyListings correctly. There is some formatting with the date as I want to get to a point whereby this script counts the day befores listings.
The current structure of CSV file that rows are being counted from:
Address
Price
123 Main St
£305,000
55 Chance Way
£200,000
from datetime import date, timedelta
oneday = timedelta(days=1)
yesterdayDate = date.today() - oneday
filename = 'ballymena-' + str(yesterdayDate) + '.csv'
results = pd.read_csv(filename)
totalDailyListings = (len(results))
# calculates the correct number of rows in csv
listings = pd.DataFrame(
[yesterdayDate, totalDailyListings], columns=['Date', 'Total'])
listings.to_csv('listings-count.csv')
I'm struggling however to create a new CSV file with just two columns: "Date" and "Total" and that would allow that CSV file to be updated (rows added) daily for long term monitoring. Can anyone offer some tips to point me back in the right direction?
Desired CSV format:
Date
Total
2022-10-04
305
2022-10-05
200
The important part is the structure of the resulting dataframe
df = pd.DataFrame({'Date':['2022-10-04', '2022-10-05'], 'Total': [305, 200]})
df.to_csv('listing-count-default_index.csv')
df.set_index('Date', inplace=True)
df.to_csv('listing-count.csv')
This should give you just two columns

Pandas - need to do multiple split operations on filename to create df with Ticker and Exchange for import to SQL

I have the following file:
H:\EOD_DATA_RECENT\DOWNLOADS\SPLIT_ADJUSTED\STC.V_2021-11-08.csv
where the first part of the name before the first period is the Ticker symbol ( STC )
the part after the period but before the _ is the Exchange ( V )
The columns in this file have only Date, Open, High, Low, Close, Volume
I am able to successfully use the commands below (in particular the line with the split operation) to create another column called "Ticker" based on the splitting of the
filename. I can then create the total_df at the end and eventually push to SQL and the new column is there with the proper ticker.
I need to be able to also split out the exchange by doing the same thing. In order to do that I have so somehow specify the part of the filename
after the first period, before the underscore. Not sure if both operations can be done in the same line, or do I need to specify a different dataframe
with the other split operation?
Thanks alot
here is the code snippet
import pandas as pd
import glob, os
files = glob.glob("H:\\EOD_DATA_RECENT\\DOWNLOADS\\SPLIT_ADJUSTED\\*.csv")
df = pd.concat([pd.read_csv(fp).assign(Ticker=os.path.basename(fp).split('.')[0]) for fp in
files])
Ticker = df['Ticker']
#Exchange = df['Exchange']
Date = df['Date']
Open = df['Open']
High = df['High']
Low = df['Low']
Close = df['Close']
Volume = df['Volume']
total_df = pd.concat([Ticker, Date, Open, High, Low, Close, Volume],
axis=1, keys=['Ticker','Date','Open','High','Low','Close','Volume'])
Assuming filenames like CELZ.US_2021-11-10.csv, to do multiple splits, you can create each column by:
splitting once at first period in the filename
splitting at the underscore in the filename
df = pd.concat([
pd.read_csv(fp).assign(
Ticker=os.path.basename(fp).split('.')[0],
Exchange=os.path.basename(fp).split('.')[1].split('_')[0]
) for fp in files
])

How to put stock prices from csv file into one single dataframe

So I am gathering data from the S&P 500,from a csv file. My question is how would I create one large dataframe, that has 500 columns and with all of the prices. The code is currently:
import pandas as pd
import pandas_datareader as web
import datetime as dt
from datetime import date
import numpy as np
def get_data():
start = dt.datetime(2020, 5, 30)
end = dt.datetime.now()
csv_file = pd.read_csv(os.path.expanduser("/Users/benitocano/Downloads/copyOfSandP500.csv"), delimiter = ',')
tickers = pd.read_csv("/Users/benitocano/Downloads/copyOfSandP500.csv", delimiter=',', names = ['Symbol', 'Name', 'Sector'])
for i in tickers['Symbol'][:5]:
df = web.DataReader(i, 'yahoo', start, end)
df.drop(['High', 'Low', 'Open', 'Close', 'Volume'], axis=1, inplace=True)
get_data()
So as the code shows right now it is just going yo create 500 individual dataframes, and so I am asking how to make it into one large dataframe. Thanks!
EDIT:
The CSV file link is:
https://datahub.io/core/s-and-p-500-companies
I have tried this to the above code:
for stock in data:
series = pd.Series(stock['Adj Close'])
df = pd.DataFrame()
df[ticker] = series
print(df)
Though the output is only one column like so:
ADM
Date
2020-06-01 38.574604
2020-06-02 39.348278
2020-06-03 40.181465
2020-06-04 40.806358
2020-06-05 42.175167
... ...
2020-11-05 47.910000
2020-11-06 48.270000
2020-11-09 49.290001
2020-11-10 50.150002
2020-11-11 50.090000
Why is printing only one column, rather than the rest if them?
The answer depends on the structure of the dataframes that your current code produces. As the code depends on some files on your local drive, we cannot run it so hard to be specific here. In general, there are many options, among the most common I would say are
Put dfs into a list and use pandas.concat(..., axis=1) on that list to concatenate dfs column by column, see here
Merge (merge or join) your dfs on the Date column that I assume each df has, see here

Reading multiple csv files into single DataFrame

I am trying to read multiple csv stock price files all of which have following columns: Date,Time, Open, High, Low, Close. The code is:
import pandas as pd
tickers=['gmk','yandex','sberbank']
ohlc_intraday={}
ohlc_intraday['gmk']=pd.read_csv("gmk_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['yandex']=pd.read_csv("yndx_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['sberbank']=pd.read_csv("sber_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
df=copy.deepcopy(ohlc_intraday)
for i in range(len(tickers)):
df[tickers[i]] = df[tickers[i]].iloc[:, 2:]
df[tickers[i]].columns = ['Date','Time',"Open", "High", "Low", "Adj Close", "Volume"]
df[tickers[i]]['Time']=[x+':00' for x in df['Time']]
However, I am then faced with the KeyError: 'Time'. Seems like columns are not keys.
Is it possible to read or convert it to a DataFrame format with keys being stock tickers (gmk, yandex, sberbank) and column names, so I can easily extract value using following code
ohlc_intraday['sberbank']['Date'][1]
What you could do is create a DataFrame that has a column that specifies the market.
import pandas as pd
markets = ["gmk", "yandex", "sberbank"]
markets = ["gmk_15min.csv", "yndx_15min.csv", "sberbank.csv"]
dfs = [pd.read_csv(market, parse_dates=["<DATE>"], dayfirst=True)
for market in markets]
# add market column to each df
for df in dfs:
df['market'] = market
# concatenate in one dataframe
df = pd.concat(dfs)
Then access what you want in this manner
df[df['market'] == 'yandex']['Date'].iloc[1]

Python: outputting lists to excel

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Categories