I have the following dataframe df to process:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 211 1.13 122 2.53
2 A 242 1.22 211 1.13
3 A 245 3.87 242 1.38
4 A 311 3.13 243 4.00
5 A 312 7.11 311 2.07
6 A NaN NaN 312 7.11
7 A NaN NaN 324 1.06
As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.
I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 122 NaN 122 2.53
2 A 211 1.13 211 1.13
3 A 242 1.22 242 1.38
4 A 243 NaN 243 4.00
5 A 245 3.87 245 NaN
6 A 311 3.13 311 2.07
7 A 312 7.11 312 7.11
8 A 324 NaN 324 1.06
In order to do so, I thought of creating 2 subsets:
left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]
and I tried to merge them, manipulating the function merge:
left.merge(right, on=..., how=..., suffixes=...)
but I got lost in the parameters that should be used to achieve the result.
What do you think would be the best way to do it?
Appendix:
In order to create the initial dataframe, one could use:
names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})
You are almost there:
left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)
Output
Name_x
C1
Value_1
Name_y
C2
Value_2
A
112
2.36
A
112
3.77
0
0
0
A
122
2.53
A
211
1.13
A
211
1.13
A
242
1.22
A
242
1.38
0
0
0
A
243
4
A
311
3.13
A
311
2.07
A
312
7.11
A
312
7.11
0
0
0
A
324
1.06
IIUC, you can perform an outer merge, then dropna on the missing values:
(df[['Name', 'C1', 'Value_1']]
.merge(df[['Name', 'C2', 'Value_2']],
left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
.dropna(subset=['C1', 'C2'], how='all')
)
output:
Name C1 Value_1 C2 Value_2
0 A 112.0 2.36 112.0 3.77
1 A 211.0 1.13 211.0 1.13
2 A 242.0 1.22 242.0 1.38
3 A 245.0 3.87 NaN NaN
4 A 311.0 3.13 311.0 2.07
5 A 312.0 7.11 312.0 7.11
8 A NaN NaN 122.0 2.53
9 A NaN NaN 243.0 4.00
10 A NaN NaN 324.0 1.06
I have the following dataframe:
Timestamp userid Prices_USD
0 2016-12-01 6.270941895 1.08
1 2016-12-01 6.609813209 1.12
2 2016-12-01 6.632094115 9.70
3 2016-12-01 6.655789772 1.08
4 2016-12-01 6.764640751 9.33
... ... ... ...
1183 2017-03-27 6.529604089 1.08
1184 2017-03-27 6.682639674 6.72
1185 2017-03-27 6.773815105 10.0
I want to calculate, for each unique userid, their monthly spending.
I've tried the following:
sales_per_user.set_index('Timestamp',inplace=True)
sales_per_user.index = pd.to_datetime(sales_per_user.index)
m = sales_per_user.index.month
monthly_avg = sales_per_user.groupby(['userid', m]).Prices_USD.mean().to_frame()
But the resulting dataframe is this:
userid Timestamp Prices_USD
3.43964843 12 10.91
3.885813375 1 10.91
2 10.91
12 21.82
However, the timestamp column doesn't have the desired outcome. Ideally I would like
userid Timestamp Prices_USD
3.43964843 2016-12 10.91
3.885813375 2017-01 10.91
2017-02 10.91
2017-12 21.82
How do I fix that?
Try:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
res = df.groupby([df['userid'], df['Timestamp'].dt.to_period('M')])['Prices_USD'].sum()
print(res)
Output
userid Timestamp
6.270942 2016-12 1.08
6.529604 2017-03 1.08
6.609813 2016-12 1.12
6.632094 2016-12 9.70
6.655790 2016-12 1.08
6.682640 2017-03 6.72
6.764641 2016-12 9.33
6.773815 2017-03 10.00
Name: Prices_USD, dtype: float64
I have the following dataframe with 22 columns:
ID S0 S1 S2 .....
ABC 10.4 5.58
ABC 12.6
ABC 8.45
LMN 5.6
LMN 8.7
I have to ffill() the values based on groups. Intended result:
ID SS RR S2 ...
ABC 10.4 5.58
ABC 12.6 10.4 5.58
ABC 12.6 10.4 8.45
LMN 5.6
LMN 8.7 5.6
I am using the following code to get S0,S1... values:
df[['Resistance', 'cumcountR']].pivot(columns='cumcountR').droplevel(0, axis=1).add_prefix('R').drop(columns='R-1.0').ffill()
Little help will be appreciated. THANKS!
Try with GroupBy.ffill
out = df.groupby('ID').ffill()
I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices
If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0
Here is my df1, that has 365 values for 365 days in a year
Date Data_Value
0 01-01 33.3
1 01-02 30.6
2 01-03 31.1
3 01-04 30.0
4 01-05 30.0
5 01-06 31.1
6 01-07 31.1
And here is my df2 which has 7262 rows of data containing several rows for one date of the year.
mnth_day Data_Value
0 10-07 30.0
1 10-18 21.1
2 02-25 14.4
3 03-22 28.9
4 02-25 28.9
5 09-03 24.4
6 11-19 25.0
I need to check in how many days 'Data_Value' in df2 exceeds 'Data_Value' in df1. How can I do that?
My expected output will look like:
Date Data_Value
0 01-01 33.3
1 01-02 30.6
2 01-03 31.1
3 01-04 30.0
4 01-05 30.0
5 01-06 31.1
6 01-07 31.1
Here the 'Date' column will be in which date the 'Data_Value' in df2 exceeds the 'Data_Value' of df1. and 'Data_Value' column will be that biggest 'Data_Value'.