I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?
Related
I need to assign to a new column the value 1 or 0 depending on what other columns have.
I have around 30 columns with binary values (1 or 0), but also other variables with numeric, continuous, values (e.g. 200). I would like to avoid the write a logical condition with many OR, so I was wondering if there is an easy and fast way to do it.
For example, creating a list with name of columns and assign 1 to the new column if there is at least a value 1 across all the columns for that corresponding row.
Example:
a1 b1 d4 ....
1 0 1
0 0 1
0 0 0
...
Expected:
a1 b1 d4 .... New
1 0 1 1
0 0 1 1
0 0 0 0
...
Many thanks for your help
Here is a simple solution:
df = pd.DataFrame({'a1':[1,0,0,1], 'b1':[0,0,0,1], 'd4':[1,1,0,0], 'num':[12,-2,0,3]})
df['New'] = df[['a1','b1','d4']].any(1).astype('int')
df
a1 b1 d4 num New
0 1 0 1 12 1
1 0 0 1 -2 1
2 0 0 0 0 0
3 1 1 0 3 1
I removed some duplicate columns by the following command.
columns = XY.columns[:-1].tolist()
XY1 = XY.drop_duplicates(subset=columns,keep='first').
The result is below:
Combined Series shape : (100, 4)
Combined Series: 1 222 223 0
0 0 0 0 1998.850000
1 0 0 0 0.947361
2 0 0 0 0.947361
3 0 0 0 0.947361
4 0 0 0 0.947361
Now the columns is labelled 1 222 223 0 (0 label at the end is because of concat with another df !!) I want the columns to be
re-labelled from index 0 onwards. How'll I do it?
So first create a dictionary with the mapping you want
trafo_dict = {x:y for x,y in zip( [1,222,223,0],np.linspace(0,3,4))}
Then you need to rename columns. This can be done with pd.DataFrame.rename:
XY1 = XY1.rename(columns=trafo_dict)
Edit: If you want it in a more general fashion use:
np.linspace(0,XY1.shape[1]-1,XY1.shape[1])
I have this dataset called 'event'
id event_type_1 event_type_2 event_type_3
234 0 1 0
234 1 0 0
345 0 0 0
and I want to produce this
id event_type_1 event_type_2 event_type_3
234 1 1 0
345 0 0 0
I tried using
event.groupby('id').sum()
but that just produced
id event_type_1 event_type_2 event_type_3
1 1 1 0
2 0 0 0
The id has has been replaced with an incremental value starting at '1'. Why? And how do I get my desired result?
Use as_index=False parameter:
In [163]: event.groupby('id', as_index=False).sum()
Out[163]:
id event_type_1 event_type_2 event_type_3
0 234 1 1 0
1 345 0 0 0
From the docs:
as_index : boolean, default True
For aggregated output, return object with group labels as the index.
Only relevant for DataFrame input. as_index=False is effectively
“SQL-style” grouped output
I have a df that looks like:
a b c d
0 0 0 0 0
1 0 0 0 0
2 1 292 0 0
3 0 500 1 406
4 1 335 0 0
I would like to find the sum of column b where a=1 for that row. So in my example I would want rows 2 and 4 added (just column b), but not row 3. If it makes any difference, there are only 0s and 1s. Thanks for any help!
You need to use .loc
>>> df.loc[df.a==1, 'b'].sum()
627
You can review the docs here for indexing and selecting data.
I have a sample data such as:
User_ID bucket brand_name
0 100 3_6months A
1 7543 6_9months A
2 100 9_12months A
3 7543 3_6months B
4 7542 first_3months C
Now I want to reshape this data to one row per userid such that my output data looks like:
User_ID A_first_3months A_3_6months a6_9months A_last_9_12months B_3_6months B6_9month (so on)
100 0 1 2 1
7543 2 0 1 1
7542 0 0 1 0
So here I basically want to pivot among two rows named, bucket and brand_name and aggregate it into one row per user. I know about pandas crosstabulate, pivot and stack functions. But not able to judge the right way as we have three columns.Any help would be highly appreciated. Here the entries can be more then one as we have are looking total count of brands in particular bucket for each user.
You could combine the brand and the bucket into a new column, and then apply crosstabs to it:
df['brand_bucket'] = df['brand_name'] + '_' + df['bucket']
pd.crosstab(index=[df['User_ID']], columns=[df['brand_bucket']])
yields
brand_bucket A_3_6months A_9_12months B_3_6months B_6_9months \
User_ID
100 1 1 0 0
7542 0 0 0 0
7543 0 0 1 1
brand_bucket C_last_3months
User_ID
100 0
7542 1
7543 0
Or, you could pass two columns to crosstab and obtain a DataFrame with a MultiIndex:
pd.crosstab(index=[df['User_ID']], columns=[df['bucket'], df['brand_name']])
yields
bucket 3_6months 6_9months 9_12months last_3months
brand_name A B B A C
User_ID
100 1 0 0 1 0
7542 0 0 0 0 1
7543 0 1 1 0 0
I like the latter better because preserves more of the structure of the data.