Python Pandas operate on row - python

Hi my dataframe look like:
Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264
Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.
Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.

You can first convert it to strings (the integer columns) before concatenating with +:
In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1_1_2010-02-05
1 1 1 2010-02-12 449 1_1_2010-02-12
2 1 1 2010-02-19 455 1_1_2010-02-19
3 1 1 2010-02-26 154 1_1_2010-02-26
4 1 1 2010-03-05 29 1_1_2010-03-05
5 1 1 2010-03-12 239 1_1_2010-03-12
6 1 1 2010-03-19 264 1_1_2010-03-19
For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):
In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1_1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1_1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1_1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1_1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1_1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1_1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1_1_2010-03-19 5.575949
Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.

In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
Store Dept Date Sales
0 1 1 2010-02-05 245
1 1 1 2010-02-12 449
2 1 1 2010-02-19 455
3 1 1 2010-02-26 154
4 1 1 2010-03-05 29
5 1 1 2010-03-12 239
6 1 1 2010-03-19 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1 1_2010-02-05
1 1 1 2010-02-12 449 1 1_2010-02-12
2 1 1 2010-02-19 455 1 1_2010-02-19
3 1 1 2010-02-26 154 1 1_2010-02-26
4 1 1 2010-03-05 29 1 1_2010-03-05
5 1 1 2010-03-12 239 1 1_2010-03-12
6 1 1 2010-03-19 264 1 1_2010-03-19
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1 1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1 1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1 1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1 1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1 1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1 1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1 1_2010-03-19 5.575949
[7 rows x 6 columns]

Related

Rule based sorting for each subset in a pandas dataframe

I have a sorting rule for a column of a dataframe I am working on. The rule is the position of two consecutive rows will change 50% of the time if the ratio of the values of two consecutive values of a specific column is within a defined range. Here is the coded
def randomized_sort(df):
"""
:param df: dataframe
:return: sorted dataframe based on the condition
"""
length = len(df) if len(df) % 2 == 0 else len(df) - 1
for i in range(0, length, 2):
if random.random() < 0.5:
if (0.7 < (df.iloc[i, :].weight) / (df.iloc[i + 1, :].weight) < 1.3):
a, b = df.iloc[i, :].copy(), df.iloc[i + 1, :].copy()
df.iloc[i, :], df.iloc[i + 1, :] = b, a
return df
However, I have a new dataframe in which I have to perform this operation in each subset /group. Please see below the data. The above operation needs to be done for each subset grouped by order column.
How can this be done?
From your question it is not clear, what you mean by subset/group.
Assuming, you want to treat each unique value in the order column as its own subset/group, you could simply filter your DataFrame for a given order value and process it with your method.
Afterwards, you can then concatenate all your individual DataFrames back together.
Example with dummy DataFrame:
df = pd.DataFrame()
number_of_rows = 20
df["order"]=[random.randint(0,3) for x in range(number_of_rows)]
df["weight"]=[random.randint(300,900) for x in range(number_of_rows)]
df.sort_values(by="order",inplace=True)
index
order
weight
0
0
629
1
0
842
3
0
326
5
0
533
6
0
621
17
1
772
11
1
333
10
1
399
18
1
369
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
14
2
411
15
2
862
16
2
888
9
2
526
12
3
345
13
3
430
Now filter the DataFrame for subset with order value of 1:
df[df["order"]==1]
index
order
weight
17
1
772
11
1
333
10
1
399
18
1
369
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
And then run your method with this subset DataFrame:
subset_df = df[df["order"]==1].copy()
sorted_df = randomized_sort(subset_df)
sorted_df
index
order
weight
17
1
772
11
1
333
10
1
369
18
1
399
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
Now, do this in a loop for every subset:
ordered_subsets = sorted(df.order.unique())
overall_sorted_df=pd.DataFrame()
for order_value in ordered_subsets:
subset_df = df[df["order"]==order_value].copy()
sorted_df = randomized_sort(subset_df)
overall_sorted_df = pd.concat([overall_sorted_df,sorted_df])
overall_sorted_df
index
order
weight
0
0
842
1
0
629
3
0
326
5
0
533
6
0
621
17
1
772
11
1
333
10
1
399
18
1
369
19
1
414
7
1
380
4
1
640
2
1
800
8
1
670
14
2
411
15
2
862
16
2
888
9
2
526
12
3
345
13
3
430
Hope that helps!

Python-How to bin positive and negative values to get counts for time series plot

I'm trying to recreate a time-series plot similar to the one below (not including the 'HLMA Flashes' data)
This is what my datafile looks like, the polarity is in the "charge" column. I used pandas to load in the file and set up the table on jupyter notebook. The value of the charge does not matter, only whether it is positive or negative.
Once I get the count of the total/negative/positives, I know how to plot this against time, but I'm not sure how to approach binning to get the counts (or whatever is needed) to make the time series. Preferably I need this in 5-minute bin periods during the timeframe of my dataframe (0000-07000 UTC). Apologies if this question is worded poorly, but any leads would be appreciated.
Link to .txt file: https://drive.google.com/file/d/13XEc74LO3cZQhylAdSfhLeUn7GFgtiKT/view?usp=sharing
Here's a way to do what I believe you are asking:
df2 = ( pd.DataFrame( {
'Datetime' : pd.to_datetime(df.agg(lambda x: f"{x['Date']} {x['Time']}", axis=1)),
'Neg': df.Charge < 0,
'Pos': df.Charge > 0,
'Tot': [1] * len(df)} ) )
df2['minutes'] = (df2.Datetime.dt.hour * 60 + df2.Datetime.dt.minute) // 5 * 5
df3 = df2[['minutes','Neg','Pos','Tot']].groupby('minutes').sum()
Output:
Neg Pos Tot
minutes
45 0 1 1
55 0 1 1
65 0 2 2
85 0 2 2
90 0 2 2
95 0 1 1
100 0 3 3
105 1 4 5
110 2 11 13
115 0 10 10
120 0 6 6
125 1 13 14
130 3 70 73
135 2 20 22
140 1 5 6
165 0 2 2
170 3 1 4
175 2 5 7
180 2 12 14
185 3 26 29
190 1 11 12
195 0 4 4
200 1 14 15
205 1 4 5
210 0 1 1
215 0 1 1
220 0 1 1
225 3 0 3
230 1 5 6
235 0 4 4
240 1 2 3
245 0 3 3
260 0 1 1
265 0 1 1
Explanation:
create a 'Datetime' column from 'Date' and 'Time' columns using to_datetime()
create Neg and Pos columns based on sign of Charge, and create Tot column equal to 1 for each row
create minutes column to bin the rows into 5 minute intervals
use groupby() and sum() to aggregate Neg, Pos and Tot for each interval with at least one row.

Turn Pandas muti-Index into columns

I have a similar dataframe:
action_type value
0 0 link_click 1
1 mobile_app_install 5
2 video_view 181
3 omni_view_content 2
1 0 post_reaction 32
1 link_click 124
2 mobile_app_install 190
3 video_view 6162
4 omni_custom 2420
5 omni_activate_app 4525
2 0 comment 1
1 link_click 53
2 post_reaction 23
3 video_view 2246
4 mobile_app_install 87
5 omni_view_content 24
6 post_engagement 2323
7 page_engagement 2323
I want to transpose so:
It looks like you can try:
(df.set_index('action_type', append=True)
.reset_index(level=1, drop=True)['value']
.unstack('action_type')
)

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

how to sort a column and group them on pandas?

I am new on pandas. I try to sort a column and group them by their numbers.
df = pd.read_csv("12Patients150526 mutations-ORIGINAL.txt", sep="\t", header=0)
samp=df["SAMPLE"]
samp
Out[3]:
0 11
1 2
2 9
3 1
4 8
5 2
6 1
7 3
8 10
9 4
10 5
..
53157 12
53158 3
53159 2
53160 10
53161 2
53162 3
53163 4
53164 11
53165 12
53166 11
Name: SAMPLE, dtype: int64
#sorting
grp=df.sort(samp)
This code does not work. Can somebody help me about my problem, please.
How can I sort and group them by their numbers?
To sort df based on a particular column, use df.sort() and pass column name as parameter.
import pandas as pd
import numpy as np
# data
# ===========================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,10,1000), columns=['SAMPLE'])
df
SAMPLE
0 6
1 1
2 4
3 4
4 8
5 4
6 6
7 3
.. ...
992 3
993 2
994 1
995 2
996 7
997 4
998 5
999 4
[1000 rows x 1 columns]
# sort
# ======================
df.sort('SAMPLE')
SAMPLE
310 1
710 1
935 1
463 1
462 1
136 1
141 1
144 1
.. ...
174 9
392 9
386 9
382 9
178 9
772 9
890 9
307 9
[1000 rows x 1 columns]

Categories