I'm almost zero experienced with python but I'm trying to learn it. I have a Pandas dataframe which came with some dummies. I want to convert them back to a single column but I simply can't figure out a way to that. Is there any way to that?
I have this:
ID var_1 var_2 var_3 var_4
231 1 0 0 0
220 0 1 0 0
303 0 0 1 0
324 0 0 0 1
I need to transform to it:
ID var
231 1
220 2
303 3
324 4
Assuming these really are one-hot-encodings, use np.argmax along the first axis:
pd.DataFrame({'ID' : df['ID'], 'var' : df.iloc[:, 1:].values.argmax(axis=1) + 1})
ID var
0 231 1
1 220 2
2 303 3
3 324 4
However, if "ID" is a part of the index, use this instead:
pd.DataFrame({'ID' : df.index, 'var' : df.values.argmax(axis=1)})
Try something new wide_to_long
s=pd.wide_to_long(df,['var'],i='ID',j='Var',sep='_')
s[s['var']==1].reset_index().drop('var',1)
Out[593]:
ID Var
0 231 1
1 220 2
2 303 3
3 324 4
Related
I have pandas dataframe:
id colA colB colC
194 1 0 1
194 1 1 0
194 2 1 3
195 1 1 2
195 0 1 0
197 1 1 2
i would to calculate occurrence of each value group by id. in my case, expected result is:
id countOfValue0 countOfValue1 countOfValue2 countOfValue3
194 2 3 1 1
195 1 2 1 0
197 0 1 1 0
if value appeared in same row - distinct value by row (this is why i have for id=194, value1 = 3)
i thought to separate the data to 3 data frames using group by id-colA, id-colB, id-colC
something like = df.groupby('id', 'colaA') but i can't find an proper way to calculate those dataframe values based on id. probably there is more efficient way for doing this
Try:
res=df.set_index("id", append=True).stack()\
.reset_index(level=0).reset_index(level=1,drop=True)\
.drop_duplicates().assign(_dummy=1)\
.rename(columns={0: "countOfValue"})\
.pivot_table(index="id", columns="countOfValue", values="_dummy", aggfunc="sum")\
.fillna(0).astype(int)
res=res.add_prefix("countOfValue")
del res.columns.name
Outputs:
countOfValue0 ... countOfValue3
id ...
194 2 ... 1
195 1 ... 0
197 0 ... 0
I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?
pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123
After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.
I have this dataset called 'event'
id event_type_1 event_type_2 event_type_3
234 0 1 0
234 1 0 0
345 0 0 0
and I want to produce this
id event_type_1 event_type_2 event_type_3
234 1 1 0
345 0 0 0
I tried using
event.groupby('id').sum()
but that just produced
id event_type_1 event_type_2 event_type_3
1 1 1 0
2 0 0 0
The id has has been replaced with an incremental value starting at '1'. Why? And how do I get my desired result?
Use as_index=False parameter:
In [163]: event.groupby('id', as_index=False).sum()
Out[163]:
id event_type_1 event_type_2 event_type_3
0 234 1 1 0
1 345 0 0 0
From the docs:
as_index : boolean, default True
For aggregated output, return object with group labels as the index.
Only relevant for DataFrame input. as_index=False is effectively
“SQL-style” grouped output
I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**
The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)
I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?