I have the following dataframe df to process:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 211 1.13 122 2.53
2 A 242 1.22 211 1.13
3 A 245 3.87 242 1.38
4 A 311 3.13 243 4.00
5 A 312 7.11 311 2.07
6 A NaN NaN 312 7.11
7 A NaN NaN 324 1.06
As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.
I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 122 NaN 122 2.53
2 A 211 1.13 211 1.13
3 A 242 1.22 242 1.38
4 A 243 NaN 243 4.00
5 A 245 3.87 245 NaN
6 A 311 3.13 311 2.07
7 A 312 7.11 312 7.11
8 A 324 NaN 324 1.06
In order to do so, I thought of creating 2 subsets:
left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]
and I tried to merge them, manipulating the function merge:
left.merge(right, on=..., how=..., suffixes=...)
but I got lost in the parameters that should be used to achieve the result.
What do you think would be the best way to do it?
Appendix:
In order to create the initial dataframe, one could use:
names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})
You are almost there:
left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)
Output
Name_x
C1
Value_1
Name_y
C2
Value_2
A
112
2.36
A
112
3.77
0
0
0
A
122
2.53
A
211
1.13
A
211
1.13
A
242
1.22
A
242
1.38
0
0
0
A
243
4
A
311
3.13
A
311
2.07
A
312
7.11
A
312
7.11
0
0
0
A
324
1.06
IIUC, you can perform an outer merge, then dropna on the missing values:
(df[['Name', 'C1', 'Value_1']]
.merge(df[['Name', 'C2', 'Value_2']],
left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
.dropna(subset=['C1', 'C2'], how='all')
)
output:
Name C1 Value_1 C2 Value_2
0 A 112.0 2.36 112.0 3.77
1 A 211.0 1.13 211.0 1.13
2 A 242.0 1.22 242.0 1.38
3 A 245.0 3.87 NaN NaN
4 A 311.0 3.13 311.0 2.07
5 A 312.0 7.11 312.0 7.11
8 A NaN NaN 122.0 2.53
9 A NaN NaN 243.0 4.00
10 A NaN NaN 324.0 1.06
Related
I've got trouble removing the index column in pandas after groupby and unstack a DataFrame.
My original DataFrame looks like this:
example = pd.DataFrame({'date': ['2016-12', '2016-12', '2017-01', '2017-01', '2017-02', '2017-02', '2017-02'], 'customer': [123, 456, 123, 456, 123, 456, 456], 'sales': [10.5, 25.2, 6.8, 23.4, 29.5, 23.5, 10.4]})
example.head(10)
output:
date
customer
sales
0
2016-12
123
10.5
1
2016-12
456
25.2
2
2017-01
123
6.8
3
2017-01
456
23.4
4
2017-2
123
29.5
5
2017-2
456
23.5
6
2017-2
456
10.4
Note that it's possible to have multiple sales for one customer per month (like in row 5 and 6).
My aim is to convert the DataFrame into an aggregated DataFrame like this:
customer
2016-12
2017-01
2017-02
123
10.5
6.8
29.5
234
25.2
23.4
33.9
My solution so far:
example = example[['date', 'customer', 'sales']].groupby(['date', 'customer']).sum().unstack('date')
example.head(10)
output:
sales
date
2016-12
2017-01
2017-02
customer
123
10.5
6.8
29.5
234
25.2
23.4
33.9
example = example['sales'].reset_index(level=[0])
example.head(10)
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
At this point I'm unable to remove the "date" column:
example.reset_index(drop = True)
example.head()
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
It just stays the same. Have you got any ideas?
An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
Why not directly go with pivot_table?
(example
.pivot_table('sales', index='customer', columns="date", aggfunc='sum')
.rename_axis(columns=None).reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]
I have 3 dataframes/series:
>>> max_q.head()
Out[16]:
130 1.00
143 2.00
146 2.00
324 10.00
327 6.00
dtype: float64|
>>> min_q.head()
Out[17]:
130 8.00
143 6.00
146 4.00
324 8.00
327 8.00
dtype: float64
>>> dirx.head()
Out[18]:
side
130 B
143 S
146 S
324 B
327 S
I want to create a new dataframe which equals max_q when dirx ='S' and min_q otherwise.
Is there a simple way to do this? the following have failed so far:
1. np.where(dirx=='S', max_q, min_q)
2. rp = pd.DataFrame(max_q, columns = ['id'])
rp.loc[dirx=='B', 'id'] = min_q
I have a dataframe as follows:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 NaN 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 NaN
07-JAN-17 104 NaN
What is the best way, to fill the missing values, using last available ones?
Following is the intended result:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 111 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 211
07-JAN-17 104 211
Use ffill function, what is same as fillna with method ffill:
df = df.ffill()
print (df)
A B
zDate
01-JAN-17 100.0 200.0
02-JAN-17 111.0 203.0
03-JAN-17 111.0 202.0
04-JAN-17 109.0 205.0
05-JAN-17 101.0 211.0
06-JAN-17 105.0 211.0
07-JAN-17 104.0 211.0
I have a Pandas Series as follow :
2014-05-24 23:59:49 1.3
2014-05-24 23:59:50 2.17
2014-05-24 23:59:50 1.28
2014-05-24 23:59:51 1.30
2014-05-24 23:59:51 2.17
2014-05-24 23:59:53 2.17
2014-05-24 23:59:58 2.17
Name: api_id, Length: 483677
I'm trying to count for each id the frequency per day.
For now I'm doing this :
count = {}
for x in apis.unique():
count[x] = apis[apis == x].resample('D','count')
count_df = pd.DataFrame(count)
That gives me what I want which is :
... 2.13 2.17 2.4 2.6 2.7 3.5(user) 3.9 4.2 5.1 5.6
timestamp ...
2014-05-22 ... 391 49962 3727 161 2 444 113 90 1398 90
2014-05-23 ... 450 49918 3861 187 1 450 170 90 629 90
2014-05-24 ... 396 46359 3603 172 3 513 171 89 622 90
But is there a way to do so without the for loop ?
You can use the value_counts function for this (docs), applying this after a groupby (which is similar to the resample('D') you did, but resample is expecting an aggregated output so we have to use the more general groupby in this case). With a small example:
In [16]: s = pd.Series([1,1,2,2,1,2,5,6,2,5,4,1], index=pd.date_range('2012-01-01', periods=12, freq='8H'))
In [17]: counts = s.groupby(pd.Grouper(freq='D')).value_counts()
In [18]: counts
Out[18]:
2012-01-01 1 2
2 1
2012-01-02 2 2
1 1
2012-01-03 2 1
6 1
5 1
2012-01-04 1 1
5 1
4 1
dtype: int64
To get this in the desired format, you can just unstack this (move the second level row indices to the columns):
In [19]: counts.unstack()
Out[19]:
1 2 4 5 6
2012-01-01 2 1 NaN NaN NaN
2012-01-02 1 2 NaN NaN NaN
2012-01-03 NaN 1 NaN 1 1
2012-01-04 1 NaN 1 1 NaN
Note: for the use of groupby(pd.Grouper(freq='D')) you need pandas 0.14. If you have al older version, you can use groupby(pd.TimeGrouper(freq='D')) to obtain exactly the same. This is also similar to doing groupby(s.index.date) (with the difference you have then datetime.date objects in the index).