AS the title suggest, I am trying to create confidence intervals based on a table with a ton of nan values. Here is an example of what I am working with.
Attendence% 2016-10 2016-11 2017-01 2017-02 2017-03 2017-04 ...
Name
Karl nan 0.2 0.4 0.5 0.2 1.0
Alice 1.0 0.7 0.6 nan nan nan
Ryan nan nan 1.0 0.1 0.9 0.2
Don nan 0.5 nan 0.2 nan nan
Becca nan 0.2 0.6 0 nan nan
For reference, in my actual dataframe there are more NaNs than not, and they represent months where they did not need to show up, so replacing the values with 0 will affect the result.
Now everytime I try applying a Confidence interval to each name, it it returns the mean as NaN, as well as both intervals.
Karl (nan, nan, nan)
Alice (nan, nan, nan)
Ryan (nan, nan, nan)
Don (nan, nan, nan)
Becca (nan, nan, nan)
Is there a way to filter out the NaN so it just applies the formula while not taking into account the NaN values. So far what I have been doing has been the following:
unstacked being the table i visually represented.
def mean_confidence_interval(unstacked, confidence=0.9):
a = 1.0 * np.array(unstacked)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
answer = unstacked.apply(mean_confidence_interval)
answer
Use np.nanmean instead of np.mean: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html
And for scipy.stats.sem(a), replace it with pass scipy.stats.sem(a, nan_policy='omit').
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html
Related
I am trying to build a function to use in a df.apply() that references 1) other rows, and 2) another DatetimeIndex.
dt_index = DatetimeIndex(['2022-09-16', '2022-12-16', '2023-03-10', '2023-06-16',
'2023-09-15', '2023-12-15', '2024-03-15', '2024-06-14'],
dtype='datetime64[ns]', freq=None)
In regards to the main df:
df.index = DatetimeIndex(['2022-08-30', '2022-08-31', '2022-09-01', '2022-09-02',
'2022-09-03', '2022-09-04', '2022-09-05', '2022-09-06',
'2022-09-07', '2022-09-08',
...
'2024-08-20', '2024-08-21', '2024-08-22', '2024-08-23',
'2024-08-24', '2024-08-25', '2024-08-26', '2024-08-27',
'2024-08-28', '2024-08-29'],
dtype='datetime64[ns]', name='index', length=731, freq=None)
df = 3M 1Y 2Y
2022-08-30 1.00 1.00 1.00 1.000000
2022-08-31 2.50 2.50 2.50 2.500000
2022-09-01 3.50 3.50 3.50 3.500000
2022-09-02 5.50 5.50 5.50 5.833333
2022-09-03 5.65 5.65 5.65 5.983333
... ... ... ... ...
2024-08-25 630.75 615.75 599.75 607.750000
2024-08-26 631.75 616.75 600.75 608.750000
2024-08-27 632.75 617.75 601.75 609.750000
2024-08-28 633.75 618.75 602.75 610.750000
2024-08-29 634.75 619.75 603.75 611.750000
My goal is to use a function that:
For each index value, x, in df, find the closest two values in dt_index (have this below)
Then, in df, return: (x - id_low) / (id_high - id_low)
def transform(x, dt_index):
id_low = dt_index.iloc[dt_index.get_loc(x, method ='ffill')]
id_high = dt_index.iloc[dt_index.get_loc(x, method ='bfill')]
It's part 2 that I dont know how to write, as it references other rows in df outside of the one the function is being applied to.
Any help appreciated!
After fixing inaccuracies in your code,
You can simply reference your dataframe df inside the function:
def transform(x, dt_index):
id_low = dt_index[dt_index.get_indexer([x.name], method ='ffill')][0]
id_high = dt_index[dt_index.get_indexer([x.name], method ='bfill')][0]
return (x - df.loc[id_low]) / (df.loc[id_high] - df.loc[id_low])
df.transform(transform, dt_index=dt_index, axis=1)
Example:
df = pd.DataFrame(np.arange(24).reshape(6, 4))
dt_index = pd.Index([0,2,5])
# Result:
0 1 2 3
0 NaN NaN NaN NaN
1 0.500000 0.500000 0.500000 0.500000
2 NaN NaN NaN NaN
3 0.333333 0.333333 0.333333 0.333333
4 0.666667 0.666667 0.666667 0.666667
5 NaN NaN NaN NaN
Note:
NaN values are due to the mathematically undefined result for 0/0:
when id_low == id_high == x.name.
I have a dataframe with this info:
I need to find a formula that calculates, for each of the 4 months of 2023, the real variation of column A against the same months of 2022. For example, in the case of 2023-04, the calculation is
x = 140 (value of 2022-04) * 1,66 (accumulated inflation from 2022-04 to 2023-04)
x= 232,27
Real variation 2023-04= (150 (value of 2023-04) - x)/x
Real variation 2023-04 = -0,35
The value 1,66, that is the accumulated inflation from 2022-04 to 2023-04, comes from this calculation: starting from the number 1 in 2022-04, for every month until 2023-04, apply the formula = previous row value*(1+inflation column value). For example, in the case 2023-04 the value 1,66 is the the last one of the calculation (the accumulated inflation of the 12 months) 1 1,06 1,09 1,15 1,19 1,28 1,35 1,39 1,46 1,58 1,64 1,66.
Thanks
your data is realy bad. You have missing values. ColumnB is in [%] I think.
here is my suggestion
Dataframe:
Time columnA columnB
0 2022-01-31 100 0.3
1 2022-02-28 120 0.5
2 2022-03-31 150 0.4
3 2022-04-30 140 0.7
Code of your calculations
df['vals'] = np.nan
df.loc[3, 'vals'] = 1
k = 1
arr = []
for i in df['columnB'].loc[4:].values:
k = k*(1+i/10)
arr.append(k)
df.loc[4:, 'vals'] = arr
df['Month'] = df['Time'].dt.month
df['Year'] = df['Time'].dt.year
year = 2023
for month in range(1,13):
v1 = df['vals'].loc[(df['Month'] == month)&(df['Year'] == year)].values[0]
v2 = df['columnA'].loc[(df['Month'] == month)&(df['Year'] == year-1)].values[0]
x = v1['vals']*v2
print(f'{year}-{month}', (v1['columnA']-x)/x)
Output would be:
2023-4 -0.354193
The code could be perhabs optimized, but I am not sure, if your input is correct.
cheers
Here is a completely vectorized solution using pure pandas (in other words: it is fast).
It is relatively straightforward if you have a DataFrame with the proper index. Also, your value "INFLATION" is in undefined units. In order to match your example, I have to divide it by 10 (so it is neither fraction nor percentage).
Step 1: sample data as reproducible example
df = pd.DataFrame({
'TIME': ['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06', '2022-07', '2022-08',
'2022-09', '2022-10', '2022-11', '2023-01', '2023-02', '2023-03', '2023-04'],
'A': [100, 120, 150, 140, 180, 200, 100, 120, 150, 140, 180, 200, 100, 120, 150],
'INFLATION': [0.3, 0.5, 0.4, 0.7, 0.6, 0.3, 0.5, 0.4, 0.7, 0.6, 0.3, 0.5, 0.8, 0.4, 0.1],
})
Step 2: Your calculation
Convert the time column into a PeriodIndex
df = df.assign(TIME=df['TIME'].apply(pd.Period)).set_index('TIME')
Solution
ci = (1 + df['INFLATION']/10).cumprod() # cumulative inflation
ci12 = ci / ci.shift(12, 'M') # 12-month variation
x = df['A'].shift(12, 'M') * ci12
real_var = (df['A'] - x) / x # your "real variation"
# putting it all together in a new df
res = df.assign(ci12=ci12, x=x, real_var=real_var)
>>> res
A INFLATION ci12 x real_var
TIME
2022-01 100 0.3 NaN NaN NaN
2022-02 120 0.5 NaN NaN NaN
2022-03 150 0.4 NaN NaN NaN
2022-04 140 0.7 NaN NaN NaN
2022-05 180 0.6 NaN NaN NaN
2022-06 200 0.3 NaN NaN NaN
2022-07 100 0.5 NaN NaN NaN
2022-08 120 0.4 NaN NaN NaN
2022-09 150 0.7 NaN NaN NaN
2022-10 140 0.6 NaN NaN NaN
2022-11 180 0.3 NaN NaN NaN
2023-01 200 0.5 1.708788 170.878849 0.170420
2023-02 100 0.8 1.757611 210.913323 -0.525872
2023-03 120 0.4 1.757611 263.641653 -0.544837
2023-04 150 0.1 1.659053 232.267475 -0.354193
I have the following data frame:
Cluster OPS(4) mean(ln) std(ln)
0 5-894 5-894a 2.203 0.775
1 5-894 5-894b 2.203 0.775
2 5-894 5-894c 2.203 0.775
3 5-894 5-894d 2.203 0.775
4 5-894 5-894e 2.203 0.775
For each surgery type (in column OPS(4)) I would like to generate 10.000 scenarios which should be stored in another data frame.
I know, that I can create scenarios with:
num_reps = 10.000
scenarios = np.ceil(np.random.lognormal(mean, std, num_reps))
And the new data frame should look like this whith 10,000 scenarios in each column:
scen_per_surg = pd.DataFrame(index=range(num_reps), columns=merged_information['OPS(4)'])
OPS(4) 5-894a 5-894b 5-894c 5-894d 5-894e
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
...
Unfortunately, I don't know how to iterate over the rows of the first data frame to create the scenarios.
Can somebody help me?
Best regards
Create some experimenting data
import pandas as pd
df = pd.DataFrame(data=[
[ '5-894' , '5-894a' , 2.0 , 0.70],
[ '5-894' , '5-894b' , 2.1 , 0.71],
[ '5-894' , '5-894c' , 2.2 , 0.72],
[ '5-894' , '5-894d' , 2.3 , 0.73],
[ '5-894' , '5-894e' , 2.4 , 0.74] ], columns =['Cluster', 'OPS(4)', 'mean(ln)', 'std(ln)'])
print(df)
create an empty dataframe
new_df = pd.DataFrame()
Define a function that will be applied to each row of the original df and generates the random values required and assign it to a column in new df
import numpy as np
def geb_scenarios(row):
# print(row)
col, mean, std = row[1:]
new_df[col] = np.ceil(np.random.lognormal(mean, std, 10))
Apply the function
df.apply(geb_scenarios, axis=1)
print(new_df)
I want to extract a 12º x 12º region from lat/long/conductivity grids and calculate the mean conductivity values in this region. I can successfully apply masks on the lat/long grids, but somehow the same process is not working for the conductivity grid.
I've tried masking with for loops and now I'm using numpy.ma.masked_where function. I can successfully plot masked results (i.e: I can see that the region is extracted when I plot global maps), but the calculated mean conductivity values are corresponding to non-masked data.
I did a simple example of what I want to do:
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
x = ma.masked_outside(x, xm-3, xm+3)
y = ma.masked_outside(x, ym-3, ym+3)
x = np.ma.filled(x.astype(float), np.nan)
y = np.ma.filled(y.astype(float), np.nan)
x, y = np.meshgrid(x, y)
z = 2*x + 3*y
z = np.ma.masked_where(np.ma.getmask(x), z)
plt.pcolor(x, y, z)
plt.colorbar()
print('Maximum z:', np.nanmax(z))
print('Minimum z:', np.nanmin(z))
print('Mean z:', np.nanmean(z))
My code is:
def Observatory_Cond_Plot(filename, ndcfile, obslon, obslat, obsname, date):
files = np.array(sorted(glob.glob(filename))) #sort txt files containing the 2-D conductivitiy arrays]
filenames = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'] #used for naming output plots and files
for i, fx in zip(filenames, files):
ndcdata = Dataset(ndcfile) #load netcdf file
lat = ndcdata.variables['latitude'][:] #import latitude data
long = ndcdata.variables['longitude'][:] #import longitude data
cond = np.genfromtxt(fx)
cond, long = shiftgrid(180., cond, long, start=False)
#Mask lat and long arrays and fill masks with nan values
lat = ma.masked_outside(lat, obslat-12, obslat+12)
long = ma.masked_outside(long, obslon-12, obslon+12)
lat = np.ma.filled(lat.astype(float), np.nan)
long = np.ma.filled(long.astype(float), np.nan)
longrid, latgrid = np.meshgrid(long, lat)
cond = np.ma.masked_where(np.ma.getmask(longrid), cond)
cond = np.ma.filled(cond.astype(float), np.nan)
condmean = np.nanmean(cond)
print('Mean Conductivity is:', condmean)
print('Minimum conductivity is:', np.nanmin(cond))
print('Maximum conductivity is:', np.nanmax(cond))
After that, the rest of the code just plots the data
My results are:
Mean Conductivity is: 3.5241649673154587
Minimum conductivity is: 0.497494528344129
Maximum conductivity is: 5.997825822915771
However, from tmy maps, it is clear that the conductivity in this region should not be lower than 3.2 S/m. Also, printing lat, long and cond grids:
long:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
lat:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
cond:
[[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[2.86749432 2.86743283 2.86746221 ... 2.87797247 2.87265508 2.87239185]
...
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]]
And it seems like the mask is not working properly.
The problem is that the call of np.ma.filled will de-mask the long variable. Also np.meshgrid doesn't preserve the masks.
You could save the masks directly after creation and also create the meshgrid from the masks. I adapted your example accordingly. What can be seen is, that all versions of numpy mean take the mask into account. I had to adapt the upper limit (changed to 2), because the mean has been equal.
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
# Note: changed limits
x = np.ma.masked_outside(x, xm-3, xm+2)
y = np.ma.masked_outside(x, ym-3, ym+2)
xmask = np.ma.getmask(x)
ymask = np.ma.getmask(y)
x, y = np.meshgrid(x, y)
xmask, ymask = np.meshgrid(xmask, ymask)
z = 2*x + 3*y
z1 = np.ma.masked_where(np.ma.getmask(x), z)
z2 = np.ma.masked_where(xmask | ymask, z)
print(z1)
print(z2)
print('Type z1, z2:', type(z1), type(z2))
print('Maximum z1, z2:', np.nanmax(z1), np.nanmax(z2))
print('Minimum z1, z2:', np.nanmin(z1), np.nanmin(z2))
print('Mean z1, z2:', np.mean(z1), np.mean(z2) )
print('nan Mean z1, z2:', np.nanmean(z1), np.nanmean(z2) )
print('masked Mean z1, z2:', z1.mean(), z2.mean())
Beware that any kind of simple mean calculation (summing and dividing by the total), such as np.mean will not give you the correct answer if you are averaging over the lat-lon grid, since the area changes as you move towards the poles. You need to take a weighted average, weighting by cos(lat).
As you say you have the data in netcdf format, I hope you will permit me to suggest an alternative solution from the command line using the utility climate data operators (cdo) (on ubuntu you can install with sudo apt install cdo).
to extract the region of interest:
cdo sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc
then you can work out the correct weighted mean with
cdo fldmean infile.nc outfile.nc
you can pipe the two together like this:
cdo fldmean -sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc
I am trying to skip some rows that have incorrect values in them.
Here is the data when i read it in from a file without using the skiprows argument.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
I want to skip rows 2194593, 4100689, and 5732515. I would expect to not see those rows in the table that I have read in.
>> df = pd.read_csv(file,sep='|',low_memory=False,
usecols= cols_to_use,
skiprows=[2194593,4100689,5732515])
Yet when I print it again, those rows are still there.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
Here is the data:
{'PersonIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'PersonTypeCde': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'UnitIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 2.0},
'\ufeffMstrRecNbrTxt': {2194593: 'P',
2194594: '300146901',
4100689: 'DAT',
4100690: '300170330',
5732515: 'DA',
5732516: '300174170'}}
What am I doing wrong?
My end goal is to get rid of the NaN values in my dataframe so that the data can be read in as integers and not as floats (because it makes it difficult to join this table to other non-float tables).
Working example... hope this helps!
from io import StringIO
import pandas as pd
import numpy as np
txt = """index,col1,col2
0,a,b
1,c,d
2,e,f
3,g,h
4,i,j
5,k,l
6,m,n
7,o,p
8,q,r
9,s,t
10,u,v
11,w,x
12,y,z"""
indices_to_skip = np.array([2, 6, 11])
# I offset `indices_to_skip` by one in order to account for header
df = pd.read_csv(StringIO(txt), index_col=0, skiprows=indices_to_skip + 1)
print(df)
col1 col2
index
0 a b
1 c d
3 g h
4 i j
5 k l
7 o p
8 q r
9 s t
10 u v
12 y z