I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64
Related
I have a dataframe like below.
time speed
0 1 0.20
1 2 0.40
2 3 2.00
3 4 3.00
4 5 0.40
5 6 0.43
6 7 6.00
I would like to find the first occurance of a number ( in 'Speed' Column) that is closest to an input value I enter.
For example :
input value = 0.43
Expected Output :
Speed : 0.40 & corresponding Time : 2
The speed column should not be sorted for this problem.
I tried the below,but not getting the expected output.
Any help on this would be appreciated.
absolute closest
You can compute the absolute difference to your reference and get the idxmin:
speed_input = 0.43
df.loc[abs(df['speed']-speed_input).idxmin()]
output:
time 6.00
speed 0.43
Name: 5, dtype: float64
first closest with threshold:
i = 0.43
thresh = 0.03
df.loc[abs(df['speed']-i).le(thresh).idxmax()]
output:
time 2.0
speed 0.4
Name: 1, dtype: float64
One idea is round both values:
df[[(df['speed'].round(1)-round(speed_input, 1)).abs().idxmin()]]
I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25
I have a pandas dataframe which looks like
Temperature_lim Factor
0 32 0.95
1 34 1.00
2 36 1.06
3 38 1.10
4 40 1.15
I need to extract factor value for any given temperature , if my current temperature is 31, my factor is 0.95. If my current temp is 33, factor is 1, if my current_temp is 38.5 factor is 1.15. So by giving my current temperature, i would like to know the factor for that temperature.
I can do this using multiple if else statements, but is there any effective way I can do it by creating bins/intervals in pandas or python.
Thank you
Use cut with add -np.inf to values of column Temperature_lim and missing values by last value of Factor value:
df1 = pd.DataFrame({'Temp':[31,33,38.5, 40, 41]})
b = [-np.inf] + df['Temperature_lim'].tolist()
lab = df['Factor']
df1['new'] = pd.cut(df1['Temp'], bins=b, labels=lab, right=False).fillna(lab.iat[-1])
print (df1)
Temp new
0 31.0 0.95
1 33.0 1.00
2 38.5 1.15
3 40.0 1.15
4 41.0 1.15
I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7
I'm having some trouble sorting and then resetting my Index in Pandas:
dfm = dfm.sort(['delt'],ascending=False)
dfm = dfm.reindex(index=range(1,len(dfm)))
The dataframe returns unsorted after I reindex. My ultimate goal is to have a sorted dataframe with index numbers from 1 --> len(dfm) so if there's a better way to do that, I wouldn't mind,
Thanks!
Instead of reindexing, just change the actual index:
dfm.index = range(1,len(dfm) + 1)
Then that wont change the order, just the index
I think you're misunderstanding what reindex does. It uses the passed index to select values along the axis passed, then fills with NaN wherever your passed index doesn't match up with the current index. What you're interested in is just setting the index to something else:
In [12]: df = DataFrame(randn(10, 2), columns=['a', 'delt'])
In [13]: df
Out[13]:
a delt
0 0.222 -0.964
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
In [14]: df.reindex(index=arange(1, len(df) + 1))
Out[14]:
a delt
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
10 NaN NaN
In [16]: df.index = arange(1, len(df) + 1)
In [17]: df
Out[17]:
a delt
1 0.222 -0.964
2 0.038 -0.367
3 0.293 1.349
4 0.604 -0.855
5 -0.455 -0.594
6 0.795 0.013
7 -0.080 -0.235
8 0.671 1.405
9 0.436 0.415
10 0.840 1.174
Remember, if you want len(df) to be in the index you have to add 1 to the endpoint since Python doesn't include endpoints when constructing ranges.