How to sum all rows from multiple columns - python

I want to do several operations that are repeated for several columns but I can't do it with a list-comprehension or with a loop.
The dataframe I have is concern_polls and I want to rescale the percentages and the total amounts.
text very somewhat \
0 How concerned are you that the coronavirus wil... 19.00 33.00
1 How concerned are you that the coronavirus wil... 26.00 32.00
2 Taking into consideration both your risk of co... 13.00 26.00
3 How concerned are you that the coronavirus wil... 23.00 32.00
4 How concerned are you that you or someone in y... 11.00 24.00
.. ... ... ...
625 How worried are you personally about experienc... 33.09 36.55
626 How do you feel about the possibility that you... 30.00 31.00
627 Are you concerned, or not concerned about your... 34.00 35.00
628 Are you personally afraid of contracting the C... 28.00 32.00
629 Taking into consideration both your risk of co... 22.00 40.00
not_very not_at_all url
0 23.00 11.00 https://morningconsult.com/wp-content/uploads/...
1 25.00 7.00 https://morningconsult.com/wp-content/uploads/...
2 43.00 18.00 https://d25d2506sfb94s.cloudfront.net/cumulus_...
3 24.00 9.00 https://morningconsult.com/wp-content/uploads/...
4 33.00 20.00 https://projects.fivethirtyeight.com/polls/202...
.. ... ... ...
625 14.92 12.78 https://docs.google.com/spreadsheets/d/1cIEEkz...
626 14.00 16.00 https://www.washingtonpost.com/context/jan-10-...
627 19.00 12.00 https://drive.google.com/file/d/1H3uFRD7X0Qttk...
628 16.00 15.00 https://leger360.com/wp-content/uploads/2021/0...
629 21.00 16.00 https://docs.cdn.yougov.com/4k61xul7y7/econTab...
[630 rows x 15 columns]
Variables very, somewhat, not_very and not_at_all they are represented as percentages of the column SAMPLE_SIZE, not shown in the sample share. The percentages don't always add up to 100% so I want to rescale it
To do this, I take the following steps: I calculate the sum of the columns -> variable I sum calculate the amount per %. This step could leave it as a variable and not create a new column in it df. I calculate the final amounts
The code I have so far is this:
sums = concern_polls['very'] + concern_polls['somewhat'] + concern_polls['not_very'] + concern_polls['not_at_all']
concern_polls['Very'] = concern_polls['very'] / sums * 100
concern_polls['Somewhat'] = concern_polls['somewhat'] / sums * 100
concern_polls['Not_very'] = concern_polls['not_very'] / sums * 100
concern_polls['Not_at_all'] = concern_polls['not_at_all'] / sums * 100
concern_polls['Total_Very'] = concern_polls['Very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Somewhat'] = concern_polls['Somewhat'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_very'] = concern_polls['Not_very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_at_all'] = concern_polls['Not_at_all'] / 100 * concern_polls['sample_size']
I have tried to raise the function with "list comprehension" but I can't.
Could someone make me a suggestion?
The problems that I find is that I want to add all the rows of several columns, but they are not all of the df doing repetitive operations on several columns, but they are not all of the df
Thank you.

df[newcolumn] = df.apply(lambda row : function(row), axis=1)
is your friend here I think.
"axis=1" means it does it row by row.
As an example :
concern_polls['Very'] = concern_polls.apply(lambda row: row['very'] / sums * 100, axis=1)
And if you want sums to be the total of each of those df columns it'll be
sums = concern_polls[['very', 'somewhat', 'not_very', 'not_at_all']].sum().sum()

Related

multiplying column of the file by exponential function

I,m struggling with multiplying one column file by an exponential function
so my equation is
y=10.43^(-x/3.0678)+0.654
The first values in the column are my X in the equation, so far I was able to multiply only by scalars but with exponential functions not
the file looks like this
8.09
5.7
5.1713
4.74
4.41
4.14
3.29
3.16
2.85
2.52
2.25
2.027
1.7
1.509
0.76
0.3
0.1
So after the calculations, my Y should get these values
8.7 0.655294908
8.09 0.656064021
5.7 0.6668238549
5.1713 0.6732091509
4.74 0.6807096436
4.41 0.6883719253
4.14 0.6962497391
3.29 0.734902438
3.16 0.7433536016
2.85 0.7672424605
2.52 0.7997286905
2.25 0.8331287249
2.027 0.8664148415
1.7 0.926724933
1.509 0.9695896976
0.76 1.213417197
0.3 1.449100509
0.1 1.580418766````
So far this code is working for me but it´s far away from what i want
from scipy.optimize import minimize_scalar
import math
col_list = ["Position"]
df = pd.read_csv("force.dat", usecols=col_list)
print(df)
A = df["Position"]
X = ((-A/3.0678+0.0.654)
print(X)
If I understand it correctly you just want to apply a function to a column in a pandas dataframe, right? If so, you can define the function:
def foo(x):
y = 10.43 ** (-x/3.0678)+0.654
return y
and apply it to df in a new column. If A is the column with the x values, then y will be
df['y'] = df.apply(foo,axis=1)
Now print(df) should give you the example result in your question.
You can do it in one line:
>>> df['y'] = 10.43 ** (- df['x']/3.0678)+0.654
>>> print(df)
x y
0 8.0900 0.656064
1 5.7000 0.666824
2 5.1713 0.673209
3 4.7400 0.680710
4 4.4100 0.688372
5 4.1400 0.696250
6 3.2900 0.734902
7 3.1600 0.743354
8 2.8500 0.767242
9 2.5200 0.799729
10 2.2500 0.833129
11 2.0270 0.866415
12 1.7000 0.926725
13 1.5090 0.969590
14 0.7600 1.213417
15 0.3000 1.449101
16 0.1000 1.580419

Mean of values in some columns with Pandas/Numpy

I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)

Find shared sub-ranges defined by start and endpoints in pandas dataframe

I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100

Selecting min values in SQL table when column < 0, and max values when column > 0

I have a SQL table like this:
Ticker Return Shares
AGJ 2.20 1265
ATA 1.78 698
ARS 9.78 10939
ARE -7.51 -26389
AIM 0.91 1758
ABT 10.02 -5893
AC -5.73 -2548
ATD 6.51 7850
AP 1.98 256
ALA -9.58 8524
So essentially, a table of stocks I've longed/shorted.
I want to find the top 4 best performers in this table, so the shorts (shares < 0) who have the lowest return, and the longs (shares > 0) who have the highest return.
Essentially, returning this:
Ticker Return Shares
ARS 9.78 10939
ARE -7.51 -26389
AC -5.73 -2548
ATD 6.51 7850
How would I be able to write the query that lets me do this?
Or, if it's easier, if there are any pandas functions that would do the same thing if I turned this table into a pandas dataframe.
Something like this:
select top (4) t.*
from t
order by (case when shares < 0 then - [return] else [return] end) desc;
Pandas solution:
In [134]: df.loc[(np.sign(df.Shares)*df.Return).nlargest(4).index]
Out[134]:
Ticker Return Shares
2 ARS 9.78 10939
3 ARE -7.51 -26389
7 ATD 6.51 7850
6 AC -5.73 -2548
Explanation:
In [137]: (np.sign(df.Shares)*df.Return)
Out[137]:
0 2.20
1 1.78
2 9.78
3 7.51
4 0.91
5 -10.02
6 5.73
7 6.51
8 1.98
9 -9.58
dtype: float64
In [138]: (np.sign(df.Shares)*df.Return).nlargest(4)
Out[138]:
2 9.78
3 7.51
7 6.51
6 5.73
dtype: float64

Displaying unique name with total of column value in a group with additional variables in python

I'm learning Python and thought working on a project might be the best way to learn it. I have about 200,000 rows of data in which the data shows list of medication for the patient. Here's a sample of the data.
PTID PTNAME MME DRNAME DRUGNAME SPLY STR QTY FACTOR
1 PATIENT, A 2700 DR, A OXYCODONE HCL 15 MG 30 15 120 1.5
1 PATIENT, A 2700 DR, B OXYCODONE HCL 15 MG 30 15 120 1.5
2 PATIENT, B 4050 DR, C MORPHINE SULFATE ER 15 MG 30 15 270 1
2 PATIENT, B 4050 DR, C MORPHINE SULFATE ER 15 MG 30 15 270 1
2 PATIENT, B 840 DR, A OXYCODONE-ACETAMINOPHE 10MG-32 14 10 56 1.5
2 PATIENT, B 1350 DR, C OXYCODONE-ACETAMINOPHE 5 MG-32 15 5 180 1.5
3 PATIENT, C 1350 DR, C OXYCODONE-ACETAMINOPHE 5 MG-32 15 5 180 1.5
3 PATIENT, C 1800 DR, D OXYCODONE-ACETAMINOPHE 10MG-32 30 10 120 1.5
I've been thinking about this a lot and have tried many ways but none of the code produce any results or makes any sense. Honestly, I don't even know where to begin. A little help would be highly appreciated.
So, what I want to do is consolidate the data for each patients and calculate the Total MME for each patient. The DRUGNAME should show the one that has higher MME. In other words, the dataframe should only have one row for each patient.
One thing I did try is
groupby_ptname = semp.groupby('PTNAME').apply(lambda x: x.MME.sum())
which shows unique patient names with total MME, but I'm not sure how to add other variables in this new dataframe.
IIUC you can do it this way:
In [62]: df.sort_values('MME').groupby('PTNAME').agg({'MME':'sum', 'DRUGNAME':'last'})
Out[62]:
DRUGNAME MME
PTNAME
PATIENT, A OXYCODONE HCL 15 MG 5400
PATIENT, B MORPHINE SULFATE ER 15 MG 10290
PATIENT, C OXYCODONE-ACETAMINOPHE 10MG-32 3150
or with .reset_index():
In [64]: df.sort_values('MME').groupby('PTNAME').agg({'MME':'sum', 'DRUGNAME':'last'}).reset_index()
Out[64]:
PTNAME DRUGNAME MME
0 PATIENT, A OXYCODONE HCL 15 MG 5400
1 PATIENT, B MORPHINE SULFATE ER 15 MG 10290
2 PATIENT, C OXYCODONE-ACETAMINOPHE 10MG-32 3150
UPDATE: more fun with agg() function
In [84]: agg_funcs = {
...: 'MME':{'MME_max':'last',
...: 'MME_total':'sum'},
...: 'DRUGNAME':{'DRUGNAME_max_MME':'last'}
...: }
...:
...: rslt = (df.sort_values('MME')
...: .groupby('PTNAME')
...: .agg(agg_funcs)
...: .reset_index()
...: )
...: rslt.columns = [tup[1] if tup[1] else tup[0] for tup in rslt.columns]
...:
In [85]: rslt
Out[85]:
PTNAME MME_total MME_max DRUGNAME_max_MME
0 PATIENT, A 5400 2700 OXYCODONE HCL 15 MG
1 PATIENT, B 10290 4050 MORPHINE SULFATE ER 15 MG
2 PATIENT, C 3150 1800 OXYCODONE-ACETAMINOPHE 10MG-32
Have another look at the documentation for the pandas groupby methods.
Here's something that could work for you:
#first get the total MME for each patient and drug combination
total_mme=semp.groupby(['PTNAME','DRUGNAME'])['MME'].sum()
#this will be a series object with index corresponding to PTNAME and DRUGNAME and values corresponding to the total MME
#now get the indices corresponding to the drug with the max MME total
max_drug_indices=total_mme.groupby(level='PTNAME').idxmax()
#index the total MME with these indices
out=total_mme[max_drug_indices]

Categories