Reading Multiple Spreadsheets - File Paths From Central Spreadsheet - python

I am attempting to load multiple (hundreds) of spreadsheets into one dataframe. The problem is these spreadsheets are located in different folders/paths. I am hoping to iterate through a central spreadsheet that lists all of the specific paths (each spreadsheet contains a tab named "Test" that I am hoping to pull, this tab has the same structure/layout across all spreadsheets) but am having some issues.
I have listed everything that might be helpful below, any insight would be greatly appreciated!
Existing Code Problems:
I receive a TypeError: cannot concatenate object of type class numpy.ndarray specific to the row where I am using concat(df.values) below
I would like to add a column that lists the "Identifier" value for each spreadsheet in the aggregated dataframe (so that I can group by specific company later on)
Current Code:
df_0 = pd.read_excel(r'PATH TO CENTRAL SPREADSHEET')
list_of_paths = df_0['Path'].tolist()
all_data = pd.DataFrame()
for itr in range(len(list_of_paths)):
df = pd.read_excel(list_of_paths[itr], sheet_name="Test", ignore_index=True)
cdf = pd.concat(df.values)
all_data = all_data.append(cdf,ignore_index=True)
Central Spreadsheet:
Identifier Path
AAPL PATH TO UNDERLYING AAPL FILE
GOOG PATH TO UNDERLYING GOOG FILE
Example of Underlying File ("Test" tab) Structure
Metric 2018 2017
Revenue 2mm 3mm
Expense 1mm 2mm
Desired Output
Metric Ticker 2018 2017
Revenue AAPL 2mm 3mm
Revenue GOOG 5mm 8mm
Expense AAPL 1mm 2mm
Expense GOOG 4mm 6mm

Doing it in steps:
Aim: load the spreadsheets into a list of df's
df_0 = pd.read_excel(r'PATH TO CENTRAL SPREADSHEET')
dict_of_paths = {}
for i,j in df_0.iterrows():
dict_of_paths[j['Identifier']] = j['Path']
df_list = []
for key in dict_of_paths.keys():
df = pd.read_excel(dict_of_paths[key], sheet_name="Test")
df['ticker'] = key
df_list.append(df)
Now all the df's are in the df_list
mdf = pd.concat(df_list,ignore_index=True)
As long as the columns are the same. This should work.

Related

How to get values from a dict into a new column, based on values in column

I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column

Import multiple excel files, create a column and get values from excel file's name

I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

Split a very large panda dataframe in smaller ones efficiently knowing the large one is ordered

I'm reading in a very large (15M lines) csv file into a panda dataframe. I then want to split it in smaller ones (ultimately creating smaller csv files, or a panda panel...).
I have working code but it's VERY slow. I believe it's not taking advantage of the fact that my dataframe is 'ordered'.
The df looks like:
ticker date open high low
0 AAPL 1999-11-18 45.50 50.0000 40.0000
1 AAPL 1999-11-19 42.94 43.0000 39.8100
2 AAPL 1999-11-22 41.31 44.0000 40.0600
...
1000 MSFT 1999-11-18 45.50 50.0000 40.0000
1001 MSFT 1999-11-19 42.94 43.0000 39.8100
1002 MSFT 1999-11-22 41.31 44.0000 40.0600
...
7663 IBM 1999-11-18 45.50 50.0000 40.0000
7664 IBM 1999-11-19 42.94 43.0000 39.8100
7665 IBM 1999-11-22 41.31 44.0000 40.0600
I want to take all rows where symbol=='AAPL', and make a dataframe with it. Then all rows where symbol=='MSFT', and so on. The number of rows for each symbol is NOT the same, and the code has to adapt. I might load in a new 'large' csv where everything is different.
This is what I came up with:
#Read database
alldata = pd.read_csv('./alldata.csv')
#get a list of all unique ticker present in the database
alltickers = alldata.iloc[:,0].unique();
#write data of each ticker in its own csv file
for ticker in alltickers:
print('Creating csv for '+ticker)
#get data for current ticker
tickerdata = alldata.loc[alldata['ticker'] == ticker]
#remove column with ticker symbol (will be the file name) and reindex as
#we're grabbing from somwhere in a large dataframe
tickerdata = tickerdata.iloc[:,1:13].reset_index(drop=True)
#write csv
tickerdata.to_csv('./split/'+ticker+'.csv')
this takes forever to run. I thought it was the file I/O, but I commented the write csv part in the for loop, and I see that this line is the problem:
tickerdata = alldata.loc[alldata['ticker'] == ticker]
I wonder if pandas is looking in the WHOLE dataframe every single time. I do know that the dataframe is in order of ticker. Is there a way to leverage that?
Thank you very much!
Dave
Easiest way to do this is to create a dictionary of the dataframes using a dictionary comprehension and pandas groupby
dodf = {ticker: sub_df for ticker, sub_df in alldata.groupby('ticker')}
dodf['IBM']
ticker date open high low
7663 IBM 1999-11-18 45.50 50.0 40.00
7664 IBM 1999-11-19 42.94 43.0 39.81
7665 IBM 1999-11-22 41.31 44.0 40.06
It makes sense that creating a boolean index of length 15 million, and doing it repeatedly, is going to take a little while. Honestly, for splitting the file into subfiles, I think Pandas is the wrong tool for the job. I'd just use a simple loop to iterate over the lines in the input file, writing them to the appropriate output file as they come. This doesn't even have to load the whole file at once, so it will be fairly fast.
import itertools as it
tickers = set()
with open('./alldata.csv') as f:
headers = next(f)
for ticker, lines in it.groupby(f, lambda s: s.split(',', 1)[0]):
with open('./split/{}.csv'.format(ticker), 'a') as w:
if ticker not in tickers:
w.writelines([headers])
tickers.add(ticker)
w.writelines(lines)
Then you can load each individual split file using pd.read_csv() and turn that into its own DataFrame.
If you know that the file is ordered by ticker, then you can skip everything involving the set tickers (which tracks which tickers have already been encountered). But that's a fairly cheap check.
Probably, the best approach is to use groupby. Suppose:
>>> df
ticker v1 v2
0 A 6 0.655625
1 A 2 0.573070
2 A 7 0.549985
3 B 32 0.155053
4 B 10 0.438095
5 B 26 0.310344
6 C 23 0.558831
7 C 15 0.930617
8 C 32 0.276483
Then group:
>>> grouped = df.groupby('ticker', as_index=False)
Finally, iterate over your groups:
>>> for g, df_g in grouped:
... print('creating csv for ', g)
... print(df_g.to_csv())
...
creating csv for A
,ticker,v1,v2
0,A,6,0.6556248347252436
1,A,2,0.5730698850517599
2,A,7,0.5499849530664374
creating csv for B
,ticker,v1,v2
3,B,32,0.15505313728451087
4,B,10,0.43809490694469133
5,B,26,0.31034386153099336
creating csv for C
,ticker,v1,v2
6,C,23,0.5588311692150466
7,C,15,0.930617426953476
8,C,32,0.2764826801584902
Of course, here I am printing a csv, but you can do whatever you want.
Using groupby is great, but it does not take advantage of the fact that the data is presorted and so will likely have more overhead compared to a solution that does. For a large dataset, this could be a noticeable slowdown.
Here is a method which is optimized for the sorted case:
import pandas as pd
import numpy as np
alldata = pd.read_csv("tickers.csv")
tickers = np.array(alldata.ticker)
# use numpy to compute change points, should
# be super fast and yield performance boost over groupby:
change_points = np.where(
tickers[1:] != tickers[:-1])[0].tolist()
# add last point in as well to get last ticker block
change_points += [tickers.size - 1]
prev_idx = 0
for idx in change_points:
ticker = alldata.ticker[idx]
print('Creating csv for ' + ticker)
# get data for current ticker
tickerdata = alldata.iloc[prev_idx: idx + 1]
tickerdata = tickerdata.iloc[:, 1:13].reset_index(drop=True)
tickerdata.to_csv('./split/' + ticker + '.csv')
prev_idx = idx + 1

python pandas - yahoo stock data reads into standalone df but not a panel item (which = df)

for some reason, I can get data into a standalone dataframe, but not a dataframe (item) within a panel.
df = pd.Dataframe()
df = pandas.io.data.DataReader('AAPL', 'yahoo', '2015-09-23', '2015-09-30')
pan = pd.Panel()
pan['AAPL'] = pandas.io.data.DataReader('AAPL', 'yahoo', '2015-09-23', '2015-09-30')
Running df yields the below stock data, but pan['AAPL'] returns an empty dataframe.
Date Open High Low Close Volume Adj Close
9/23/2015 113.629997 114.720001 113.300003 114.32 35645700 114.32
9/24/2015 113.25 115.5 112.370003 115 49810600 115
9/25/2015 116.440002 116.690002 114.019997 114.709999 55842200 114.709999
9/28/2015 113.849998 114.57 112.440002 112.440002 51723900 112.440002
9/29/2015 112.830002 113.510002 107.860001 109.059998 73135900 109.059998
9/30/2015 110.169998 111.540001 108.730003 110.300003 66105000 110.300003
Shouldn't a standalone df and a df which is an item inside a panel have the same possibilities ?
Update:
I am able to get the data into a dict with keys = a list of stocks, but i get the feeling that perf will slow down more quickly than if the data were housed in a panel.

Categories