Create different dataframes from 1 excel file using selected columns

Create different dataframes from 1 excel file using selected columns - python

I have a large data frame with dates and then stocks at the top with columns of price data.
Header 1 Header 2 Header 3 Header 4
======== ======== ======== ========
Date Stock 1 Stock 2 Stock 3
1/2/2001 2.77 6.00 11.00
1/3/2001 2.89 6.08 11.10
1/4/2001 2.86 6.33 11.97
1/5/2001 2.80 6.58 12.40
What I want to do is make multiple dataframes from this one file with the date and the stock price of each stock. So essentially in this example you would have 4 dataframes (the file has more than 1000 of them so this is just a sample). So the dataframes would be:
DF1 = Data and Stock 1
DF2 = Data and Stock 2
DF3 = Data and Stock 3
DF4 = Data and Stock 4
I am then going to take each dataframe and add more columns to each of them once they are created.
I was reading through previous questions and came up with usecols, but I can seem to get the syntax written properly. Can someone help me out? Also if there is a better way to do this please advise. Since I have more than 1000, speed is important in running through the file
This is what I have so far but I am not sure I am heading down the most efficient path. It gives the following error (among others it seems):
>>>> ValueError: The elements of 'usecols' must either be all strings or all integers`
df2 = pd.read_csv('file.csv')
# read in Exel file to get column headers from excel
for i in df2:
a = 0
# always want to have 1st (date column) as 1st column in DF
d = pd.read_csv('file.csv',usecols=[a,i])
# Read in file with proper columns, will always be first column
#and add column 1, next loop cols 0,2, next loop 0,3, etc.
dataf[i] = pd.DataFrame(d) #actually create DataFrame
It also seems to be inefficient to have to read in the file each time. Maybe there is a way to read in file once and then create the dataframes. Any help would be appreciated.

Consider building a list of integer pairings ([0,1], [0,2], [0,3], etc.) to slice master dataframe by columns. Then iteratively append dataframes to a list which is a preferred setup of one container (with similarly structured elements) to avoid 1000's of dfs flooding your global enviroment.
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y')
masterdf = pd.read_csv("DataFile.csv", parse_dates=[0], date_parser=dateparse)
colpairs = [[0, i] for i in range(1, len(masterdf.columns))]
dfList = []
for cols in colpairs:
dfList.append(masterdf[cols])
print(len(dfList))
print(dfList[0].head())
print(dfList[1].head())
Alternatively, consider a dictionary of dataframes with stock names as keys for a container, where colpairs carry string literal pairings as opposed to integers:
colpairs = [['Date', masterdf.columns[i]] for i in range(1, len(masterdf.columns))]
dfDict = {}
for cols in colpairs:
dfDict[cols[1]] = masterdf[cols]
print(len(dfDict))
print(dfDict['Stock 1'].head())
print(dfDict['Stock 2'].head())

Related

Speed up operations over Python Pandas dataframes

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

Reshape a long column csv file using pandas to get a proper dataframe table

I have data in a csv file in a single column which I want to convert into a table with column headers. The input file is of the type:
df1 = pd.DataFrame(['CompA','$200','$450','10.3x','50.0%'
,'CompB','$300','$50','13.2x','40.0%',
'CompC','$100','$150','2.8x','13.5%',
'CompD','$150','$250','3.8x','53.2%'
])
I want to convert this into a table dataframe with the headers
column_names = ['Company name','Revenues','Gross Profit','P/E Multiple','Operating Margin']
So, that the various companies (in the example above it is 4 companies CompA, CompB, CompC and CompD,
each have its data in its own row
I tried the following, but it is highly inelegant , not to mention, it involves manual counting of the data, and this still just adds the 'header column' data but still does not make a table:
arr1 = column_names*4
df1[1] = arr1
And then when I tried to pivot it, it was not putting the Revenues, and Gross Profit etc in one row, but creating a separate row for each. This is what I did:
df2 = df1.pivot(columns=1,values=0)
How do I fix this?

You can reshape the values in your dataframe using the column_names
pd.DataFrame(df1.to_numpy().reshape(-1, len(column_names)), columns=column_names)
Out:
Company name Revenues Gross Profit P/E Multiple Operating Margin
0 CompA $200 $450 10.3x 50.0%
1 CompB $300 $50 13.2x 40.0%
2 CompC $100 $150 2.8x 13.5%
3 CompD $150 $250 3.8x 53.2%

You are almost correct. Pivot can work this way, however, it requires three things, the values to be pivoted, the column to pivot on, and an index.
I don't see the need to count manually here.
# Get number of entities in long list
n_entities = int(len(df)/len(column_names))
# Generates n-repetitions of column_names and assign to df for pivot
df['col_name'] = column_names * n_entities
# Generate and assign an index column
index_vals = []
for i in range(n_entities):
index_vals.extend([str(i)]*len(column_names))
df['index_val'] = index_vals
df.pivot(index = 'index_val', columns='col_name', values=0)

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Add multiple columns to multiple data frames

I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.

Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})

how to unstack a pandas dataframe with two sets of variables

I have a table that looks like this. Read from a CSV file, so no levels, no fancy indices, etc.
ID date1 amount1 date2 amount2
x 15/1/2015 100 15/1/2016 80
The actual file I have goes up to date5 and amount 5.
How can I convert it to:
ID date amount
x 15/1/2015 100
x 15/1/2016 80
If I only had one variable, I would use pandas.melt(), but with two variables I really don't know how to do it quickly.
I could do it manually exporting to a sqlite3 database in memory, and doing a union. Doing unions in pandas is more annoying because, unlike, SQL, it requires all field names to be the same, so in pandas I'd have to create a temporary dataframe and rename all the fields: a dataframe for date1 and amount1, rename the field to just date and amount, then do the same for all the other events, and only then can I do pandas.concat.
Any suggestions? Thanks!

Here is one way:
>>> pandas.concat(
... [pandas.melt(x, id_vars='ID', value_vars=x.columns[1::2].tolist(), value_name='date'),
... pandas.melt(x, value_vars=x.columns[2::2].tolist(), value_name='amount')
... ],
... axis=1
... ).drop('variable', axis=1)
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
The idea is to do two melts, one for each set of columns, then concat them. This assumes that the two kinds of columns are in alternating order, so that the columns[1::2] and columns[2::2] select them correctly. If not, you'd have to modify that part of it to choose the columns you want.
You can also do it with the little-known lreshape:
>>> pandas.lreshape(x, {'date': x.columns[1::2], 'amount': x.columns[2::2]})
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
However, lreshape is not really documented and it's not clear if it's supposed to be used.

If I assume that the columns always repeat, a simple trick provides the solution you want.
The trick lies in making a list of lists of the columns that go together, then looping over the main list appending as necessary. It does involve a call to pd.DataFrame() each time the loop runs. I am kind of pressed for time right now to find a way to avoid that. But it does work like you would expect it to, and for a small file, you should not have any problems (that is, run time).
In [1]: columns = [['date1', 'amount1'], ['date2', 'amount2'], ...]
In [2]: df_clean = pd.DataFrame(columns=['date', 'amount'])
for cols in columns:
df_clean = df_clean.append(pd.DataFrame(df.loc[:,cols].values,
columns=['date', 'amount']),
ignore_index=True)
df_clean
Out[2]: date amount
0 15/1/2015 100
1 15/1/2016 80
The neat thing about this is that it only runs over the DataFrame once, picking all the rows under the columns it is looping over. So if you have 5 column pairs, with 'n' rows under it, the loop will only run 5 times. For each run, it will append all 'n' rows below the columns to the clean DataFrame to give you a consistent result. You can then eliminate any NaN values and sort by date, or do whatever you want to do with the clean DF.
What do you think, does this beat creating an in-memory sqlite3 database?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create different dataframes from 1 excel file using selected columns - python

Related

Speed up operations over Python Pandas dataframes

Reshape a long column csv file using pandas to get a proper dataframe table

Assign value to dataframe from another dataframe based on two conditions

Add multiple columns to multiple data frames

how to unstack a pandas dataframe with two sets of variables

Categories

Resources