Reading multiple CSV files in Spark and make a DataFrame - python

I am using following code to read multiple csv files and and converting them to pandas df then concat it as a single pandas df. Finally converting again into spark DataFrame. I want to skip conversion to pandas df part and simply want to have spark DataFrame.
File Paths
abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=1/*.csv
abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=2/*.csv
......
Code
list = []
for month in range(1,3,1):
for day in range(1,31,1):
for hour in range(0,24,1):
file_location = "//xxxxxx/abc/year=2021/month="+str(month)+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"
try :
spark_df = spark.read.format("csv").option("header", "true").load(file_location)
pandas_df = spark_df.toPandas()
list.append(pandas_df)
except AnalysisException as e:
print(e)
final_pandas_df = pd.concat(list)
df = spark.createDataFrame(final_pandas_df)

You can load all the files and apply a filter on the partitioning columns:
df = spark.read.format("csv").option("header", "true").load("abfss://xxxxxx/abc/").filter(
'year = 2021 and month between 1 and 2 and day between 1 and 30 and hour between 0 and 23'
)

Related

How to merge some CSV files into one DataFrame?

I have some CSV files with exactly the same structure of stock quotes (timeframe is one day):
date,open,high,low,close
2001-10-15 00:00:00 UTC,56.11,59.8,55.0,57.9
2001-10-22 00:00:00 UTC,57.9,63.63,56.88,62.18
I want to merge them all into one DataFrame with only close price columns for each stock. The problem is different files has different history depth (they started from different dates in different years). I want to align them all by date in one DataFrame.
I'm trying to run the following code, but I have nonsense in the resulted df:
files = ['FB', 'MSFT', 'GM', 'IBM']
stock_d = {}
for file in files: #reading all files into one dictionary:
stock_d[file] = pd.read_csv(file + '.csv', parse_dates=['date'])
date_column = pd.Series() #the column with all dates from all CSV
for stock in stock_d:
date_column = date_column.append(stock_d[stock]['date'])
date_column = date_column.drop_duplicates().sort_values(ignore_index=True) #keeping only unique values, then sorting by date
df = pd.DataFrame(date_column, columns=['date']) #creating final DataFrame
for stock in stock_d:
stock_df = stock_d[stock] #this is one of CSV files, for example FB.csv
df[stock] = [stock_df.iloc[stock_df.index[stock_df['date'] == date]]['close'] for date in date_column] #for each date in date_column adding close price to resulting DF, or should be None if date not found
print(df.tail()) #something strange here - Series objects in every column
The idea is first to extract all dates from each file, then to distribute close prices among according columns and dates. But obviously I'm doing something wrong.
Can you help me please?
If I understand you correctly, what you are looking for is the pivot operation:
files = ['FB', 'MSFT', 'GM', 'IBM']
df = [] # this is a list, not a dictionary
for file in files:
# You only care about date and closing price
# so only keep those 2 columns to save memory
tmp = pd.read_csv(file + '.csv', parse_dates=['date'], usecols=['date', 'close']).assign(symbol=file)
df.append(tmp)
# A single `concat` is faster then sequential `append`s
df = pd.concat(df).pivot(index='date', columns='symbol')

Reading multiple csv files into single DataFrame

I am trying to read multiple csv stock price files all of which have following columns: Date,Time, Open, High, Low, Close. The code is:
import pandas as pd
tickers=['gmk','yandex','sberbank']
ohlc_intraday={}
ohlc_intraday['gmk']=pd.read_csv("gmk_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['yandex']=pd.read_csv("yndx_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
ohlc_intraday['sberbank']=pd.read_csv("sber_15min.csv",parse_dates=["<DATE>"],dayfirst=True)
df=copy.deepcopy(ohlc_intraday)
for i in range(len(tickers)):
df[tickers[i]] = df[tickers[i]].iloc[:, 2:]
df[tickers[i]].columns = ['Date','Time',"Open", "High", "Low", "Adj Close", "Volume"]
df[tickers[i]]['Time']=[x+':00' for x in df['Time']]
However, I am then faced with the KeyError: 'Time'. Seems like columns are not keys.
Is it possible to read or convert it to a DataFrame format with keys being stock tickers (gmk, yandex, sberbank) and column names, so I can easily extract value using following code
ohlc_intraday['sberbank']['Date'][1]
What you could do is create a DataFrame that has a column that specifies the market.
import pandas as pd
markets = ["gmk", "yandex", "sberbank"]
markets = ["gmk_15min.csv", "yndx_15min.csv", "sberbank.csv"]
dfs = [pd.read_csv(market, parse_dates=["<DATE>"], dayfirst=True)
for market in markets]
# add market column to each df
for df in dfs:
df['market'] = market
# concatenate in one dataframe
df = pd.concat(dfs)
Then access what you want in this manner
df[df['market'] == 'yandex']['Date'].iloc[1]

Python How to convert collections.OrderedDict to dataFrame

I have the following task:
1) I have an excel file with a few spreadsheets. From these spreadsheets I need information from columns "A:CU", rows 41 - 51
2) Then I need to collect information from from columns "A:CU", rows 41 - 51 from all spreadsheets (they have the same structure) and to create a database.
3) There should be a column that indicates from which spreadsheet data was collected
I did following:
import pandas as pd
file='January2020.xlsx'
#getting info from spreadsheets C(1), C(2) and so on
days = range(1,32)
sheets = []
for day in days:
sheets.append('C(' + str(day)+')')
#importing data
all_sales=pd.read_excel(file,header=None,skiprows=41, usecols="A:CU", sheet_name=sheets,
skipfooter=10)
Now I have collections.OrderedDict and struggle to put it into dataFrame.
What I need to have is a dataframe like this:
Try pd.concat
df = pd.concat(all_sales, ignore_index = True)
I used this code and it worked:
file='January2020.xlsx'
days = range(1,32)
all_sales=pd.DataFrame()
df = pd.DataFrame()
all_df = []
for day in days:
sheet_name = "C("+str(day)+")"
all_sales=pd.read_excel(file,header=None,skiprows=41,usecols="A:CU", sheet_name=sheet_name,
skipfooter=10)
all_sales["Date"] = sheet_name
all_df.append(all_sales)
df_final = pd.concat(all_df)

python pandas exported csv format different from imported issue

I have a strange issue on pandas.read_csv function. I exported my dataframe into a csv, but when I re-imported the same csv, the data that has been imported back does not work when I try to merge(The merge shows all the data on the left and none that I have tried to merge it with). If I use the original data before it was exported to the csv, it works completely fine.(The merge was perfect).
df = df.values_list('id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp', flat=False)
#inserting the collected data into a dateframe for manipulation
df = pd.DataFrame(list(df))
#giving the dataframe column names
df.columns = ['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp']
df = df[['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date']]
#rename required columns
df.rename(columns={'uniquecount':'Unique Views','nonuniquecount':'Views','msgcount':'Messages','ordercount':'Orders'}, inplace=True)
print df
print df.dtypes
# exporting df out to a csv
# df.to_csv('test.csv', header=True)
# importing the df back from a csv
df = pd.read_csv('test.csv', index_col=0)
print df
print df.dtypes
#insert dates
numdays = 14
base = datetime.datetime.today().date()
date_list = [base - datetime.timedelta(days=x) for x in range(0, numdays)]
dates = pd.DataFrame(date_list)
dates.columns = ['date']
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')
# print df
I have checked and compared that the dataframes look exactly the same before export and after importing from the csv.(I printed the output twice, once before export and one after) I have also checked and the datetypes are all the same.
I need to export the csv to work with an external environment because I cant attach my local database.
attached a copy of the cmdline print which shows that both dataframes are exactly similar
attached below is a sample of my exported csv
,id,teacher_id,Unique Views,Views,Messages,Orders,date
0,47,31,1,6,0,0,2017-05-09
1,56,31,1,9,0,0,2017-05-10
2,67,31,2,11,0,0,2017-05-14
3,71,31,3,15,0,0,2017-05-15
4,79,31,3,17,0,0,2017-06-12
5,83,31,3,18,0,1,2017-06-18
Does anyone have any idea on this strange issue?
Before calling merge, try converting both dates using to_datetime first as referred in answer here
df.date = pd.to_datetime(df.date)
dates.date = pd.to_datetime(dates.date)
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')

Python: outputting lists to excel

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Categories