What is the fastest way to build a DataFrame piece by piece? - python

I am downloading price data from bloomberg and want to build a DataFrame in the fastest and least memory intensive way. Let's say I submit a data request to bloomberg through python for the price data for all current S&P 500 stocks from 1-1-2000 to 1-1-2013. Data is returned by ticker and then date and value, one at a time. My current method is to create a list for the dates to be stored in and another list for the prices to be stored in, and to append a date and price to each list as they are read from the Bloomberg data request response. Then when all the dates and prices are read for the particular ticker, I create a DataFrame for the ticker using
ticker_df = pd.DataFrame(price_list, index = dates_list, columns= [ticker], dtype=float)
I do this for each ticker, appending each ticker dataframe to a list << df_list.append(ticker_df) >> after each ticker's data is read. When all the ticker dataframes are made, then I combine all the individual DataFrames into one DataFrame:
lg_index = []
for num in range(len(df_list)):
if len(lg_index) < len(df_list[num].index):
lg_index = df_list[num].index # Use the largest index for creating the result_df
result_df = pd.DataFrame(index= lg_index)
for num in range(len(df_list)):
result_df[df_list[num].columns[0]] = df_list[num]
The reason why I do it this way, is because the indexes for each ticker are not identical (if a stock only IPO'd last year, etc.)
I'm guessing there must be a better way to accomplish what I'm doing here using less memory and in a faster way, I just can't think of it. Thanks!

I'm not 100% sure which your after, but you can concat a list of DataFrames:
pd.concat(df_list)
For example:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]])
In [12]: pd.concat([df, df, df])
Out[12]:
0 1
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [13]: pd.concat([df, df, df], axis=1)
Out[13]:
0 1 0 1 0 1
0 1 2 1 2 1 2
1 3 4 3 4 3 4
or do an outer merge/join:
In [14]: df1 = pd.DataFrame([[1, 2]], columns=[0, 2])
In [15]: df.merge(df1, how='outer') # do several of these
Out[15]:
0 1 2
0 1 2 2
1 3 4 NaN
See the merge, join, concatenate section of the docs.

Related

How to save multiple values on different rows as a variable or list in a CSV using Python Pandas

I'm currently trying to iterate through a dataframe/csv and compare the dates of the rows with the same ID. If the dates are different or are a certain time-frame apart I want to create a '1' in another column (not shown) to mark that ID and row/s.
I'm looking to save the DATE values as variables and compare them against other DATE variables with the same ID. If the dates are set amount of time apart I'll create a 1 in another column on the same row.
ID
DATE
1
11/11/2011
1
11/11/2011
2
5/05/2011
2
20/06/2011
3
2/04/2011
3
10/08/2011
4
8/12/2011
4
1/02/2012
4
12/03/2012
For this post, I'm mainly looking to save the multiple values as variables or a list. I'm hoping to figure out the rest once this roadblock has been removed.
Here's what I got currently, but I don't think it'll be much help. Currently it iterates through and converts the date strings to dates. Which is what I want to happen AFTER getting a list of all the dates with the same ID value.
import pandas as pd
from datetime import *
filename = 'TestData.csv'
df = pd.read_csv(filename)
print (df.iloc[0,1])
x = 0
for i in df.iloc:
FixDate = df.iloc[x, 1]
d1, m1, y1 = FixDate.split('/')
d1 = int(d1)
m1 = int(m1)
y1 = int(y1)
finaldate = date(y1, m1, d1)
print(finaldate)
x = x + 1
Any help is appreciated, thank you!
In pandas for performance is best avoid loops, if need new column tested if same values in DATE per groups use GroupBy.transform with DataFrameGroupBy.nunique and then compare values by 1:
df = pd.read_csv(filename)
df['test'] = df.groupby('ID')['DATE'].transform('nunique').eq(1).astype(int)
print (df)
ID DATE test
0 1 11/11/2011 1
1 1 11/11/2011 1
2 2 5/05/2011 0
3 2 20/06/2011 0
4 3 2/04/2011 0
5 3 10/08/2011 0
6 4 8/12/2011 0
7 4 1/02/2012 0
8 4 12/03/2012 0
If need filter matched rows:
mask = df.groupby('ID')['DATE'].transform('nunique').eq(1)
df1 = df[mask]
print (df1)
ID DATE
0 1 11/11/2011
1 1 11/11/2011
In last step convert values to lists:
IDlist = df1['ID'].tolist()

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

How to merge two data frames based on nearest date

I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?
df = df1[column_names1].merge(df2[column_names2], on='Code')
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to df1 with the closest date from the appropriate group in df2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.
I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832
>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194
Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08
Finally, we can merge these together with a straightforward call to pd.merge:
>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.
Here's an alternative solution:
Merge on Code.
Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.
Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.
Code:
df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()

Categories