I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column
I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)
I'm reading in a very large (15M lines) csv file into a panda dataframe. I then want to split it in smaller ones (ultimately creating smaller csv files, or a panda panel...).
I have working code but it's VERY slow. I believe it's not taking advantage of the fact that my dataframe is 'ordered'.
The df looks like:
ticker date open high low
0 AAPL 1999-11-18 45.50 50.0000 40.0000
1 AAPL 1999-11-19 42.94 43.0000 39.8100
2 AAPL 1999-11-22 41.31 44.0000 40.0600
...
1000 MSFT 1999-11-18 45.50 50.0000 40.0000
1001 MSFT 1999-11-19 42.94 43.0000 39.8100
1002 MSFT 1999-11-22 41.31 44.0000 40.0600
...
7663 IBM 1999-11-18 45.50 50.0000 40.0000
7664 IBM 1999-11-19 42.94 43.0000 39.8100
7665 IBM 1999-11-22 41.31 44.0000 40.0600
I want to take all rows where symbol=='AAPL', and make a dataframe with it. Then all rows where symbol=='MSFT', and so on. The number of rows for each symbol is NOT the same, and the code has to adapt. I might load in a new 'large' csv where everything is different.
This is what I came up with:
#Read database
alldata = pd.read_csv('./alldata.csv')
#get a list of all unique ticker present in the database
alltickers = alldata.iloc[:,0].unique();
#write data of each ticker in its own csv file
for ticker in alltickers:
print('Creating csv for '+ticker)
#get data for current ticker
tickerdata = alldata.loc[alldata['ticker'] == ticker]
#remove column with ticker symbol (will be the file name) and reindex as
#we're grabbing from somwhere in a large dataframe
tickerdata = tickerdata.iloc[:,1:13].reset_index(drop=True)
#write csv
tickerdata.to_csv('./split/'+ticker+'.csv')
this takes forever to run. I thought it was the file I/O, but I commented the write csv part in the for loop, and I see that this line is the problem:
tickerdata = alldata.loc[alldata['ticker'] == ticker]
I wonder if pandas is looking in the WHOLE dataframe every single time. I do know that the dataframe is in order of ticker. Is there a way to leverage that?
Thank you very much!
Dave
Easiest way to do this is to create a dictionary of the dataframes using a dictionary comprehension and pandas groupby
dodf = {ticker: sub_df for ticker, sub_df in alldata.groupby('ticker')}
dodf['IBM']
ticker date open high low
7663 IBM 1999-11-18 45.50 50.0 40.00
7664 IBM 1999-11-19 42.94 43.0 39.81
7665 IBM 1999-11-22 41.31 44.0 40.06
It makes sense that creating a boolean index of length 15 million, and doing it repeatedly, is going to take a little while. Honestly, for splitting the file into subfiles, I think Pandas is the wrong tool for the job. I'd just use a simple loop to iterate over the lines in the input file, writing them to the appropriate output file as they come. This doesn't even have to load the whole file at once, so it will be fairly fast.
import itertools as it
tickers = set()
with open('./alldata.csv') as f:
headers = next(f)
for ticker, lines in it.groupby(f, lambda s: s.split(',', 1)[0]):
with open('./split/{}.csv'.format(ticker), 'a') as w:
if ticker not in tickers:
w.writelines([headers])
tickers.add(ticker)
w.writelines(lines)
Then you can load each individual split file using pd.read_csv() and turn that into its own DataFrame.
If you know that the file is ordered by ticker, then you can skip everything involving the set tickers (which tracks which tickers have already been encountered). But that's a fairly cheap check.
Probably, the best approach is to use groupby. Suppose:
>>> df
ticker v1 v2
0 A 6 0.655625
1 A 2 0.573070
2 A 7 0.549985
3 B 32 0.155053
4 B 10 0.438095
5 B 26 0.310344
6 C 23 0.558831
7 C 15 0.930617
8 C 32 0.276483
Then group:
>>> grouped = df.groupby('ticker', as_index=False)
Finally, iterate over your groups:
>>> for g, df_g in grouped:
... print('creating csv for ', g)
... print(df_g.to_csv())
...
creating csv for A
,ticker,v1,v2
0,A,6,0.6556248347252436
1,A,2,0.5730698850517599
2,A,7,0.5499849530664374
creating csv for B
,ticker,v1,v2
3,B,32,0.15505313728451087
4,B,10,0.43809490694469133
5,B,26,0.31034386153099336
creating csv for C
,ticker,v1,v2
6,C,23,0.5588311692150466
7,C,15,0.930617426953476
8,C,32,0.2764826801584902
Of course, here I am printing a csv, but you can do whatever you want.
Using groupby is great, but it does not take advantage of the fact that the data is presorted and so will likely have more overhead compared to a solution that does. For a large dataset, this could be a noticeable slowdown.
Here is a method which is optimized for the sorted case:
import pandas as pd
import numpy as np
alldata = pd.read_csv("tickers.csv")
tickers = np.array(alldata.ticker)
# use numpy to compute change points, should
# be super fast and yield performance boost over groupby:
change_points = np.where(
tickers[1:] != tickers[:-1])[0].tolist()
# add last point in as well to get last ticker block
change_points += [tickers.size - 1]
prev_idx = 0
for idx in change_points:
ticker = alldata.ticker[idx]
print('Creating csv for ' + ticker)
# get data for current ticker
tickerdata = alldata.iloc[prev_idx: idx + 1]
tickerdata = tickerdata.iloc[:, 1:13].reset_index(drop=True)
tickerdata.to_csv('./split/' + ticker + '.csv')
prev_idx = idx + 1