I have two dfs:
df_1 = pd.DataFrame([[5,6]], columns=['Jan15','Feb15'])
Jan15 Feb15
0 5 6
df_2 = pd.DataFrame([8,3]], columns=['Jan16','Feb16'])
Jan16 Feb16
0 8 3
Is there a way to sum both frames in order to come out with:
sum = Index, Jan, Feb
0 13 9
You'll need concat and then a groupby on the column headers.
pd.concat([df_1, df_2], axis=1).groupby(by=lambda x: x[:3], axis=1).sum()
Feb Jan
0 9 13
This works on the assumption that your column names all have the format MTHxx.
Use add():
df_1.add(df_2)
Index Jan15 Feb15
0 0 13 9
With different column names:
pd.DataFrame(df_1.values + df_2.values,
columns = df_1.columns.str.replace("\d", "")).reset_index()
This is a dummy way of doing this, #coldspeed's answer and #andrew_reece's are the best here:
new1 = df_1[:]
new1.columns = [i[:-2] for i in df_1.columns]
new2 = df_2[:]
new2.columns = [i[:-2] for i in df_2.columns]
final_df = new1+new2
indexes = list(new1.index)+list(new1.index)
final_df['Index'] = list(set(indexes))
print(final_df)
Output:
Jan Feb Index
0 13 9 0
Related
I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D
I have a time series DataFrame df1 with prices in a ticker column, from which a new DataFrame df2 is created by concatenating df1 with 3 other columns sharing the same DateTimeIndex, as shown:
Now I need to set up the ticker name "Equity(42950 [FB])" to become the new header and to nest the 3 other columns under it, and to have the ticker's prices replaced by the values in the "closePrice" column.
How to achieve this in Python?
pd.MultiIndex:
d = pd.DataFrame(np.arange(20).reshape(5,4), columns=['Equity', 'closePrice', 'mMb', 'mMv'])
arrays = [['Equity','Equity','Equity'],['closePrice', 'mMb','mMv']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(d.values[:, 1:], columns=index)
df
Equity
closePrice mMb mMv
0 1 2 3
1 5 6 7
2 9 10 11
3 13 14 15
4 17 18 19
Say I have a pandas.DataFrame with a MultiIndex and I know it has two levels and year is in the first one, and I want to keep particular years, I can do
df = df.loc[yearStart:, :]
If I know the index has only two levels, but not in which year is, I can hack some dirty
if df.index.names[0] == 'year':
df = df.loc[yearStart:, :]
else
df = df.loc[:, yearStart:]
What if I know it is in the index, but not which level, nor how many levels the index has? If year is not in the index, but a regular column, I can do
df = df.loc[df.year >= yearStart]
Is there something similar generic for the index?
You can use get_level_values to get a column-like view of an index level.
df = pd.DataFrame({'a': range(100)}, index=pd.MultiIndex.from_product([range(10), range(2010,2020)], names=['idx1', 'year']))
df.head()
Out[41]:
a
idx1 year
0 2010 0
2011 1
2012 2
2013 3
2014 4
df[df.index.get_level_values('year') >= 2015].head()
Out[42]:
a
idx1 year
0 2015 5
2016 6
2017 7
2018 8
2019 9
I have a dataframe containing date, and I would like to process the data as follow for feature engineering
df
date
2016/1/1
2015/2/10
2016/4/5
after process I would like to make the df looks like
date Jan Feb Mar Apr
2016/1/1 30 0 0 0 //date from 1/1 to 1/30 : the number of dates in jan
2015/2/10 0 19 11 0 //date from 2/10 to 3/11 : the number of dates in feb and no of dates in mar
2016/3/25 0 0 7 21 //date from 3/25 to 4/21 : the number of dates in mar and no of dates in apr
get 30 days after the df["date"]
df["date"] + timedelta(month=1)
count the frequency of months which belong to the specific 30 days
Is there any method to do this quickly?
Thanks.
Just go step by step. First you offset your original date by + pd.to_timedelta('30d'). Then create a column indicating the month only by df.date.dt.month. Then create a column with the end-of-month date for each date - some ideas for that are here: Want the last day of each month for a data frame in pandas. Finally, fill in a matrix where the columns are the 12 months, setting the values in the columns for the month and month+1.
By enriching your DataFrame one column at a time, you can easily move from your input to your desired output. There is not likely to be a magic method that does everything in a single call.
Read all about date/time functions in Pandas here: https://pandas.pydata.org/pandas-docs/stable/timeseries.html - there are a lot!
You can use custom function with date_range and groupby with size:
date = df[['date']]
names = ['Jan', 'Feb','Mar','Apr','May']
def f(x):
print (x['date'])
a = pd.date_range(x['date'], periods=30)
a = pd.Series(a).groupby(a.month).size()
return (a)
df = df.apply(f, axis=1).fillna(0).astype(int)
df = df.rename(columns = {k:v for k,v in enumerate(names)})
df = date.join(df)
print (df)
date Feb Mar Apr May
0 2016-01-01 30 0 0 0
1 2015-02-10 0 19 11 0
2 2016-03-25 0 0 7 23
Similar solution with value_counts:
date = df[['date']]
names = ['Jan', 'Feb','Mar','Apr','May']
df = df.apply(lambda x: pd.date_range(x['date'], periods=30).month.value_counts(), axis=1)
.fillna(0)
.astype(int)
df = df.rename(columns = {k:v for k,v in enumerate(names)})
df = date.join(df)
print (df)
Another solution:
names = ['Jan', 'Feb','Mar','Apr','May']
date = df[['date']]
df["date1"] = df["date"] + pd.Timedelta(days=29)
df = df.reset_index().melt(id_vars='index', value_name='date').set_index('date')
df = df.groupby('index').resample('D').asfreq()
df = df.groupby([df.index.get_level_values(0), df.index.get_level_values(1).month])
.size()
.unstack(fill_value=0)
df = df.rename(columns = {k+1:v for k,v in enumerate(names)})
df = date.join(df)
print (df)
date Jan Feb Mar Apr
0 2016-01-01 30 0 0 0
1 2015-02-10 0 19 11 0
2 2016-03-25 0 0 7 23
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1