Turn Pandas muti-Index into columns - python

I have a similar dataframe:
action_type value
0 0 link_click 1
1 mobile_app_install 5
2 video_view 181
3 omni_view_content 2
1 0 post_reaction 32
1 link_click 124
2 mobile_app_install 190
3 video_view 6162
4 omni_custom 2420
5 omni_activate_app 4525
2 0 comment 1
1 link_click 53
2 post_reaction 23
3 video_view 2246
4 mobile_app_install 87
5 omni_view_content 24
6 post_engagement 2323
7 page_engagement 2323
I want to transpose so:

It looks like you can try:
(df.set_index('action_type', append=True)
.reset_index(level=1, drop=True)['value']
.unstack('action_type')
)

Related

Count consecutive numbers from a column of a dataframe in Python

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3

python replace string in a specific dataframe column

I would like to replace any string in a dataframe column by the string 'Chaudière', for any word that starts with the string "chaud". I would like the first and last name after each "Chaudiere" to disapper, to anonymize the NameDevice
My data frame is called df1 and the column name is NameDevice.
I have tried this:
df1.loc[df['NameDevice'].str.startswith('chaud'), 'NameDevice'] = df1['NameDevice'].str.replace("chaud","Chaudière") . I check with df1.head(), it returns:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation UuidAttributeDevice IdBox IsUpdateDevice
0 119 48 00001 Chaudière Maud Ferrand 4 NaN 4 0
1 120 48 00002 Chaudière Yvan Martinod 6 NaN 6 0
2 121 48 00006 Chaudière Anne-Sophie Premereur 7 NaN 7 0
3 122 48 00005 Chaudière Denis Fauser 8 NaN 8 0
4 123 48 00004 Chaudière Elariak Djilali 3 NaN 3 0
You can do the matching by calling str.lower first, then you can use str.startswith, and then just split on the spaces and take the first entry to anonymise the data:
In [14]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.split().str[0]
df
Out[14]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0
Another method is to use str.extract so it only takes Chaud...:
In [27]:
df.loc[df['NameDevice'].str.lower().str.startswith('chaud'), 'NameDevice'] = df['NameDevice'].str.extract('(Chaud\w+ )', expand=False)
df
Out[27]:
IdDevice IdDeviceType SerialDevice NameDevice IdLocation \
0 119 48 1 Chaudière 4
1 120 48 2 Chaudière 6
2 121 48 6 Chaudière 7
3 122 48 5 Chaudière 8
4 123 48 4 Chaudière 3
UuidAttributeDevice IdBox IsUpdateDevice
0 NaN 4 0
1 NaN 6 0
2 NaN 7 0
3 NaN 8 0
4 NaN 3 0

how to sort a column and group them on pandas?

I am new on pandas. I try to sort a column and group them by their numbers.
df = pd.read_csv("12Patients150526 mutations-ORIGINAL.txt", sep="\t", header=0)
samp=df["SAMPLE"]
samp
Out[3]:
0 11
1 2
2 9
3 1
4 8
5 2
6 1
7 3
8 10
9 4
10 5
..
53157 12
53158 3
53159 2
53160 10
53161 2
53162 3
53163 4
53164 11
53165 12
53166 11
Name: SAMPLE, dtype: int64
#sorting
grp=df.sort(samp)
This code does not work. Can somebody help me about my problem, please.
How can I sort and group them by their numbers?
To sort df based on a particular column, use df.sort() and pass column name as parameter.
import pandas as pd
import numpy as np
# data
# ===========================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,10,1000), columns=['SAMPLE'])
df
SAMPLE
0 6
1 1
2 4
3 4
4 8
5 4
6 6
7 3
.. ...
992 3
993 2
994 1
995 2
996 7
997 4
998 5
999 4
[1000 rows x 1 columns]
# sort
# ======================
df.sort('SAMPLE')
SAMPLE
310 1
710 1
935 1
463 1
462 1
136 1
141 1
144 1
.. ...
174 9
392 9
386 9
382 9
178 9
772 9
890 9
307 9
[1000 rows x 1 columns]

I have a csv file &would like to sort it by multiple columns as integers rather than strings as itemgetter does

x y xm ym mdb wat tile
43460 2095 623424.9213 -3891371.696 1 10324 38
43010 2599 638544.9213 -3877871.696 1 1 38
35871 2702 641634.9213 -3663701.696 1 0 50
43451 3296 659454.9213 -3891101.696 1 5951 38
40081 3330 660474.9213 -3790001.696 1 0 38
39084 3796 674454.9213 -3760091.696 1 0 50
6910 34119 1584144.921 -2794871.696 1 0 128
7040 29565 1447524.921 -2798771.696 1 0 127
7452 27335 1380624.921 -2811131.696 1 0 127
7976 34974 1609794.921 -2826851.696 1 0 128
pandas makes this easy.
import pandas as pd
# Read in the csv file
df = pd.read_csv('input_file.csv')
>>> df
x y xm ym mdb wat tile
0 43460 2095 623424.9213 -3891371.696 1 10324 38
1 43010 2599 638544.9213 -3877871.696 1 1 38
2 35871 2702 641634.9213 -3663701.696 1 0 50
3 43451 3296 659454.9213 -3891101.696 1 5951 38
4 40081 3330 660474.9213 -3790001.696 1 0 38
5 39084 3796 674454.9213 -3760091.696 1 0 50
6 6910 34119 1584144.9210 -2794871.696 1 0 128
7 7040 29565 1447524.9210 -2798771.696 1 0 127
8 7452 27335 1380624.9210 -2811131.696 1 0 127
9 7976 34974 1609794.9210 -2826851.696 1 0 128
# Do the actual sort. I've chosen x and y for sorting here, arbitrarily as an example
df_sorted = df.sort(['x','y'], ascending=[1,0]) # This sorts column x ascending and y descending
>>> df_sorted
x y xm ym mdb wat tile
6 6910 34119 1584144.9210 -2794871.696 1 0 128
7 7040 29565 1447524.9210 -2798771.696 1 0 127
8 7452 27335 1380624.9210 -2811131.696 1 0 127
9 7976 34974 1609794.9210 -2826851.696 1 0 128
2 35871 2702 641634.9213 -3663701.696 1 0 50
5 39084 3796 674454.9213 -3760091.696 1 0 50
4 40081 3330 660474.9213 -3790001.696 1 0 38
1 43010 2599 638544.9213 -3877871.696 1 1 38
3 43451 3296 659454.9213 -3891101.696 1 5951 38
0 43460 2095 623424.9213 -3891371.696 1 10324 38
# Write output to csv if you want
df_sorted.to_csv('output_csv.csv', index=False)

Python Pandas operate on row

Hi my dataframe look like:
Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264
Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.
Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.
You can first convert it to strings (the integer columns) before concatenating with +:
In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1_1_2010-02-05
1 1 1 2010-02-12 449 1_1_2010-02-12
2 1 1 2010-02-19 455 1_1_2010-02-19
3 1 1 2010-02-26 154 1_1_2010-02-26
4 1 1 2010-03-05 29 1_1_2010-03-05
5 1 1 2010-03-12 239 1_1_2010-03-12
6 1 1 2010-03-19 264 1_1_2010-03-19
For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):
In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1_1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1_1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1_1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1_1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1_1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1_1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1_1_2010-03-19 5.575949
Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.
In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
Store Dept Date Sales
0 1 1 2010-02-05 245
1 1 1 2010-02-12 449
2 1 1 2010-02-19 455
3 1 1 2010-02-26 154
4 1 1 2010-03-05 29
5 1 1 2010-03-12 239
6 1 1 2010-03-19 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1 1_2010-02-05
1 1 1 2010-02-12 449 1 1_2010-02-12
2 1 1 2010-02-19 455 1 1_2010-02-19
3 1 1 2010-02-26 154 1 1_2010-02-26
4 1 1 2010-03-05 29 1 1_2010-03-05
5 1 1 2010-03-12 239 1 1_2010-03-12
6 1 1 2010-03-19 264 1 1_2010-03-19
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1 1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1 1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1 1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1 1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1 1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1 1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1 1_2010-03-19 5.575949
[7 rows x 6 columns]

Categories