Divide row values in a pandas DataFrame by a specific column - python

I have a dataframe which looks like this:
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
What I need to do is to calculate the new rows based on the condition made on the row:
DESIRED OUTPUT
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
Quartz_v 5.06 8.69 8.84 10.69 2.648
Plagioclase_v ...
So the process is basically I need to the following steps:
1) Define the new row, for example, Quartz_v
2) and then perform the following calculation Quartz_v = each column value of Quartz divided by the Density value of Quartz_v
I have already loaded the data as a two dataframes, the density and mineral ones, and merged them, so the each mineral will have the correct density in front of it.

Use
DataFrame.div with axis=0 to perform division,
rename to rename the index, and
append to concatenate the result to the original (you can also use pd.concat instead).
d = df['Density']
result = df.append(df.div(d, axis=0).assign(Density=d).rename(lambda x: x+'_v'))
result
1 2 3 4 Density
Mineral
Quartz 13.400000 23.000000 23.400000 28.300000 2.648
Plagioclase 5.200000 8.200000 8.500000 11.700000 2.620
K-feldspar 2.300000 2.400000 2.600000 3.100000 2.750
Quartz_v 5.060423 8.685801 8.836858 10.687311 2.648
Plagioclase_v 1.984733 3.129771 3.244275 4.465649 2.620
K-feldspar_v 0.836364 0.872727 0.945455 1.127273 2.750

Related

how to pass percentile inputs to `describe` when using `groupby` and `agg` in PANDAS

How can I combine describe with custom percentiles and sum (or any other function) using agg?
To get percentiles and other statistics for columns with groupby, one can do:
df.groupby('A')['revenue'].describe(percentiles=[0.95])
If I want sum I can do the following, but I have no idea how to pass the arguments percentiles to agg method. :(
df.groupby('A',dropna=False)['Revenue'].agg(['sum','describe'])
You can try this:
df = pd.DataFrame({'A':range(10), 'B':range(10,20)})
pd.concat([df.describe(percentiles=[.2, .4, .6, .8]), df.sum().to_frame('sum').T], axis=0).T
Output:
count mean std min 20% 40% 50% 60% 80% max sum
A 10.0 4.5 3.02765 0.0 1.8 3.6 4.5 5.4 7.2 9.0 45.0
B 10.0 14.5 3.02765 10.0 11.8 13.6 14.5 15.4 17.2 19.0 145.0

Python: Rolling Minimum by date interval

thanks for taking the time to read this question.
I am using time series data which is reported weekly. I am trying to calculate the minimum value of each row over 3 years which I have done using the code below. Since the data is reported weekly for each row it would be the minimum value of 156 rows (3yrs before). The column Spec_Min details the minimum value for each row over 3 years.
However, halfway through my data, it begins to be reported twice a month but I still need to have the minimum values over 3 years therefore no longer 156 rows later. I was wondering if there was a more simple way of doing this?
Perhaps doing it via date rather than rows but I am not sure how to do that.
df1['Spec_Min']=df1['Spec_NET'].rolling(156).min()
df1
Date Spec_NET Hed_NET Spec_Min
1995-10-31 9.0 -13.5 -49.7
1995-11-07 11.9 -23.5 -49.7
1995-11-14 9.8 -19.4 -49.7
1995-11-21 9.7 -25.4 -49.7
1995-11-28 10.4 -20.3 -49.7
1995-12-05 1.6 -15.3 -49.7
1995-12-12 -17.0 14.2 -49.7
1995-12-19 -16.6 15.2 -49.7
1995-12-26 4.7 -15.2 -49.7
1996-01-02 5.3 -22.7 -49.7
1996-01-16 7.3 -21.0 -49.7
1996-01-23 1.3 -20.4 -49.7
Pandas allows you to operate with a datetime aware rolling window. You'll need to structure your code to operate in terms of the number of days (365 * 3 for 3 years).
I used your provided sample DataFrame
df['Spec_Min'] = df.rolling(f'{365 * 3}D', on='Date')['Spec_NET'].min()
print(df)
Date Spec_NET Hed_NET Spec_Min
0 1995-10-31 9.0 -13.5 9.0
1 1995-11-07 11.9 -23.5 9.0
2 1995-11-14 9.8 -19.4 9.0
3 1995-11-21 9.7 -25.4 9.0
4 1995-11-28 10.4 -20.3 9.0
5 1995-12-05 1.6 -15.3 1.6
6 1995-12-12 -17.0 14.2 -17.0
7 1995-12-19 -16.6 15.2 -17.0
8 1995-12-26 4.7 -15.2 -17.0
9 1996-01-02 5.3 -22.7 -17.0
10 1996-01-16 7.3 -21.0 -17.0
11 1996-01-23 1.3 -20.4 -17.0
Try something like this:
(if your index is already a datetimeindex, skip the first two rows)
df.set_index('Date',inplace = True,drop = True)
df.index = pd.to_datetime(df.index)
# resample your dataframe in weekly frequency, and interpolate missing values
conformed = df.resample('W-MON').mean().interpolate(method = 'nearest')
n_weeks = 3 # the length of the rolling window (in weeks)
result = conformed.rolling(n_weeks).min()
Note that, you mention that you want the minimum of each row. But it seems like you are calculating the rolling minimum of each column...

Using pandas dataframe columns and cells to feed into a request

I have a dataframe like this:
userName _2643698_1 _2643699_1 _2643700_1 _2643701_1 _2643702_1
_test2 5.0 4.8 3.75 3.6 2.2
_test3 4.0 5.0 4.40 5.0 5.0
_test4 5.0 4.4 5.00 5.0 4.0
Three unique users, 5 unique columns that correspond to the users, and a unique score per column/per user.
I need to feed this data into a patch request with this logic:
Per username, update each key (column title) with the score for that user.
Example:
patch = change_data(userName, colId, score)
The goal being to update the data for all three users, each having a score in the same 5 columns (the column headers like _263698_1, with the score the user has in that column).
The real dataset I'm wrestling with has 78 users and 14 unique columns with scores for each user.
I have been playing around with a lot of options suggested on the web to get the logic I need as efficiently as possible, and any suggestions would be greatly appreciated.
Thank you.
Use melt()
new_df = pd.melt(id_vars='userName',
var_name='colId',
value_vars=[c for c in df.columns if c != 'userName']
)
So new_df looks like this
userName colId value
0 _test2 _2643698_1 5.00
1 _test3 _2643698_1 4.00
2 _test4 _2643698_1 5.00
3 _test2 _2643699_1 4.80
4 _test3 _2643699_1 5.00
5 _test4 _2643699_1 4.40
6 _test2 _2643700_1 3.75
7 _test3 _2643700_1 4.40
8 _test4 _2643700_1 5.00
9 _test2 _2643701_1 3.60
10 _test3 _2643701_1 5.00
11 _test4 _2643701_1 5.00
12 _test2 _2643702_1 2.20
13 _test3 _2643702_1 5.00
14 _test4 _2643702_1 4.00
Then you can iterate over new_df and call change_data on each row
for row in new_df.itertuples(index=False):
patch = change_data(row.userName, row.colId, row.value)
# do something with patch

Pandas data manipulation - multiple measurements per line to one per line [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!
Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1
Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Categories