I have a Target Table with two types of categories: stationID and Month. I need to standardise the Temperature values of that table against the values of another Reference Table (by matching the stationID). What would be the best way to do that with pandas?
For example:
Reference Table: it contains mean and standard deviation reference values for unique stations
stationID | Temp_mean | Temp_std |...
----------+-------------+----------+
A | 30.0 | 3.4 |
B | 31.1 | 4.5 |
C | 24.5 | 0.2 |
...
Target Table: it contains the raw data for each station and month
stationID | Mon | Temperature |...
----------+------+-------------+
A | 1 | 30.1 |
A | 2 | 31.2 |
A | 3 | 24.0 |
B | 1 | 30.3 |
C | 2 | 20.4 |
C | 1 | 24.3 |
C | 2 | 25.4 |
...
So, from the temperature values in the Target table, I need to subtract the mean and divide by the standard deviation of the reference table.
What I have so far is the code below
df['Temperature_Stdized']=df(['stationID','Mon'])['Temperature'].transform(lambda x: (x - x.mean()) / x.std())
But, instead of using the mean and std from "x", I would like to use the values from the Reference Table, by matching the stationID values.
Any help is appreciated. Thanks.
Considering your Reference Table to be ref and Target Table to be tar, you could do:
tar['Temprature'] = (ref.merge(tar, on = 'stationID')
.eval('(Temperature - Temp_mean) / Temp_std'))
stationID Mon Temperature
0 A 1 0.029412
1 A 2 0.352941
2 A 3 -1.764706
3 B 1 -0.177778
4 C 2 -20.500000
5 C 1 -1.000000
6 C 2 4.500000
Details
The first step is a merge of both dataframes on stationID:
x = ref.merge(tar, on = 'stationID')
print(x)
stationID Temp_mean Temp_std Mon Temperature
0 A 30.0 3.4 1 30.1
1 A 30.0 3.4 2 31.2
2 A 30.0 3.4 3 24.0
3 B 31.1 4.5 1 30.3
4 C 24.5 0.2 2 20.4
5 C 24.5 0.2 1 24.3
6 C 24.5 0.2 2 25.4
and then eval with the following expression to normalise each row:
x.eval('(Temperature - Temp_mean) / Temp_std')
0 0.029412
1 0.352941
2 -1.764706
3 -0.177778
4 -20.500000
5 -1.000000
6 4.500000
dtype: float64
Related
lets say I have a dataframe like below
+------+------+------+-------------+
| A | B | C | devisor_col |
+------+------+------+-------------+
| 2 | 4 | 10 | 2 |
| 3 | 3 | 9 | 3 |
| 10 | 25 | 40 | 10 |
+------+------+------+-------------+
what would be the best command to apply a formula using values from the devisor_col. Do note that I have thousand of column and rows.
the result should be like this:
+------+------+------+-------------+
| A | B | V | devisor_col |
+------+------+------+-------------+
| 1 | 2 | 5 | 2 |
| 1 | 1 | 3 | 3 |
| 1 | 1.5 | 4 | 10 |
+------+------+------+-------------+
I tried using apply map but I dont know why I cant apply it to all columns.
modResult = my_df.applymap(lambda x: x/x["devisor_col"]))
IIUC, use pandas.DataFrame.divide on axis=0 :
modResult= (
pd.concat(
[my_df, my_df.filter(like="Col") # selecting columns
.divide(my_df["devisor_col"], axis=0).add_suffix("_div")], axis=1)
)
# Output :
print(modResult)
Col1 Col2 Col3 devisor_col Col1_div Col2_div Col3_div
0 2 4 10 2 1.0 2.0 5.0
1 3 3 9 3 1.0 1.0 3.0
2 10 25 40 10 1.0 2.5 4.0
If you need only the result of the divide, use this :
modResult= my_df.filter(like="Col").divide(my_df["devisor_col"], axis=0)
print(modResult)
Col1 Col2 Col3
0 1.0 2.0 5.0
1 1.0 1.0 3.0
2 1.0 2.5 4.0
Or if you want to overwrite the old columns, use pandas.DataFrame.join:
modResult= (
my_df.filter(like="Col")
.divide(my_df["devisor_col"], axis=0)
.join(my_df["devisor_col"])
)
Col1 Col2 Col3 devisor_col
0 1.0 2.0 5.0 2
1 1.0 1.0 3.0 3
2 1.0 2.5 4.0 10
You can replace my_df.filter(like="Col") with my_df.loc[:, my_df.columns!="devisor_col"].
You can try using .loc
df = pd.DataFrame([[1,2,3,1],[2,3,4,5],[4,5,6,7]], columns=['col1', 'col2', 'col3', 'divisor'])
df.loc[:, df.columns != 'divisor'] = df.loc[:, df.columns != 'divisor'].divide(df['divisor'], axis=0)
Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a
I have a dataframe with subjects and dates for a certain measurement. For each subject I want to find if the date in each row of the group corresponds to the first (1), second (2), third (3)... unique date value for that subject.
To clarify this is what I am looking for:
|subject | date | order|
|A | 01.01.2020 | 1|
|A | 01.01.2020 | 1|
|A | 02.01.2020 | 2|
|B | 01.01.2020 | 1|
|B | 02.01.2020 | 2|
|B | 02.01.2020 | 2|
I though about something as bellow, but the for loop is not admissible in the apply function:
df['order']=df.groupby(['subject']).apply(lambda x: i if x['date']=value for i, value in enumerate(x['date'].unique()))
Is there a straightforward way to do this?
Use factorize in GroupBy.transform :
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
print (df)
subject date order order1
0 A 01.01.2020 1 1
1 A 01.01.2020 1 1
2 A 02.01.2020 2 2
3 B 01.01.2020 1 1
4 B 02.01.2020 2 2
5 B 02.01.2020 2 2
Or you can use GroupBy.rank, but is necessary convert column date to datetimes:
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1
0 A 2020-01-01 1 1.0
1 A 2020-01-01 1 1.0
2 A 2020-02-01 2 2.0
3 B 2020-01-01 1 1.0
4 B 2020-02-01 2 2.0
5 B 2020-02-01 2 2.0
Difference of solution is if changed order of datetimes:
print (df)
subject date order (disregarding temporal order of date)
0 A 2020-01-01 1
1 A 2020-03-01 2 <- changed datetime for sample
2 A 2020-02-01 3
3 B 2020-01-01 1
4 B 2020-02-01 2
5 B 2020-02-01 2
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1 order2
0 A 2020-01-01 1 1 1.0
1 A 2020-03-01 1 2 3.0
2 A 2020-02-01 2 3 2.0
3 B 2020-01-01 1 1 1.0
4 B 2020-02-01 2 2 2.0
5 B 2020-02-01 2 2 2.0
In summary: use the first method if you don't care about the temporal order of date being reflected in the order output, or the second method if the temporal order matters and should reflect in the order output.
I am having a hard time figuring out how to get "rolling weights" based off of one of my columns, then factor these weights onto another column.
I've tried groupby.rolling.apply (function) on my data but the main problem is just conceptualizing how I'm going to take a running/rolling average of the column I'm going to turn into weights, and then factor this "window" of weights onto another column that isn't rolled.
I'm also purposely setting min_period to 1, so you'll notice my first two rows in each group final output "rwag" mirror the original.
W is the rolling column to derive the weights from.
B is the column to apply the rolled weights to.
Grouping is only done on column a.
df is already sorted by a and yr.
def wavg(w,x):
return (x * w).sum() / w.sum()
n=df.groupby(['a1'])[['w']].rolling(window=3,min_periods=1).apply(lambda x: wavg(df['w'],df['b']))
Input:
id | yr | a | b | w
---------------------------------
0 | 1990 | a1 | 50 | 3000
1 | 1991 | a1 | 40 | 2000
2 | 1992 | a1 | 10 | 1000
3 | 1993 | a1 | 20 | 8000
4 | 1990 | b1 | 10 | 500
5 | 1991 | b1 | 20 | 1000
6 | 1992 | b1 | 30 | 500
7 | 1993 | b1 | 40 | 4000
Desired output:
id | yr | a | b | rwavg
---------------------------------
0 1990 a1 50 50
1 1991 a1 40 40
2 1992 a1 10 39.96
3 1993 a1 20 22.72
4 1990 b1 10 10
5 1991 b1 20 20
6 1992 b1 30 20
7 1993 b1 40 35.45
apply with rolling usually have some wired behavior
df['Weight']=df.b*df.w
g=df.groupby(['a']).rolling(window=3,min_periods=1)
g['Weight'].sum()/g['w'].sum()
df['rwavg']=(g['Weight'].sum()/g['w'].sum()).values
Out[277]:
a
a1 0 50.000000
1 46.000000
2 40.000000
3 22.727273
b1 4 10.000000
5 16.666667
6 20.000000
7 35.454545
dtype: float64
I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.