dict1 = {'0-10': -0.04,
'10-20': -0.01,
'20-30': -0.03,
'30-40': -0.04,
'40-50': -0.02,
'50-60': 0.01,
'60-70': 0.05,
'70-80': 0.01,
'80-90': 0.09,
'90-100': 0.04}
stat = pd.DataFrame()
for x,y in dict1.items():
stat[x] = y
I try to write dict values to my dataframe and associate the column name to the keys. But my output is this:
Empty DataFrame
Columns: [0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-100]
Index: []
Tried it multiple times. No syntax errors. What am I missing? Thanks.
Try this:
df = pd.DataFrame(dict1, index=[0])
or
df = pd.DataFrame([dict1])
print(df)
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
0 -0.042 -0.01 -0.03 -0.04 -0.02 0.01 0.05 0.01 0.09 0.04
Related
Original question/answer for more context can be found here.
Hello, I am working with a dataframe that looks like the following:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
I am assigning the year of "given" and then checking how many years to >=70%. The "given" value is mapped to the lower year if the value is less than 75% of the distance between the two years on either side of "given". I map the "given" column to a "year" column using the following lines:
thresholds = df + df.diff(-1, axis=1).abs() * 0.75
below_75 = (df['Given'].to_numpy()[:, None] - thresholds.to_numpy()) < 0
min_year = thresholds.where(below_75).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)
min_year = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float) - min_year
This works perfectly in most cases, except for in the case where "given" maps to a value already above 70%. In this case, a row such as the following
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.69 0.24 0.5 0.61 0.7 0.74
would map to "year 4", and then checks the next column (Year 5) and sees that it is above 0.7, so will output "1" (since it thinks it is 1 year until a value >70%).
But since it is originally mapped to "year 4" which is already above 70%, I would like it to output "done". I feel like this is an extremely easy fix but I am at a loss.
All help appreciated.
Quick summary:
Essentially I am trying to map the "given" value to a "year" column. If the "given" value is <= 3/4 of the way to the next year (i.e. if year 3 is 10%, year 4 is 20%, and "given" is 17%, it would map to year 3 since 17% < 17.5%). Then it calculates how many years until > 70%.
Most of the code was already solved in an earlier question, I am just trying to work on the part of the code that assigns to a specific year.
pretty new to python and pandas, i have a 15000 values in a column of my dataframe like this.
col1
col2
5
0.05964
19
0.00325
31
0.0225
12
0.03325
14
0.00525
I want to get in output a result like this :
0.00 to 0.01 = 55 values,
0.01 to 0.02 = 365 values,
0.02 to 0.03 = 5464 values etc... from 0.00 to 1.00
Im a bit lost with groupby or count.values etc...
thanks for the help !
IIUC, use pd.cut:
out = df.groupby(pd.cut(df['col2'], np.linspace(0, 1, 101)))['col1'].sum()
print(out)
# Output
col2
(0.0, 0.01] 33
(0.01, 0.02] 0
(0.02, 0.03] 31
(0.03, 0.04] 12
(0.04, 0.05] 0
..
(0.95, 0.96] 0
(0.96, 0.97] 0
(0.97, 0.98] 0
(0.98, 0.99] 0
(0.99, 1.0] 0
Name: col1, Length: 100, dtype: int64
data
data = [['john', 0.20, 0.0, 0.4, 0.40],['katty', 0.0, 1.0, 0.0, 0.0],['kent', 0.0, 0.51, 0.49, 0.0]]
df = pd.DataFrame(data, columns=['name','fruit', 'vegetable', 'softdrinks', 'icecream'])
df = df.set_index('name')
df.head()
desired outcome
data = [['john', 0.20, 0.0, 0.4, 0.40,'softdrinks','icecream'],['katty', 0.0, 1.0, 0.0, 0.0,'vegetable','NaN'],['kent', 0.0, 0.51, 0.49, 0.0,'vegetable','softdrinks']]
df = pd.DataFrame(data, columns=['name','fruit', 'vegetable', 'softdrinks', 'icecream', 'max_no1', 'max_no2'])
df = df.set_index('name')
df.head()
tried idxmax which only return highest value column name, i need to locate the second highest row value column name, how to achieve this?
thanks a lot
First set 0 to missing values by DataFrame.mask, then reshape by DataFrame.stack and for top2 use SeriesGroupBy.nlargest, last DataFrame.join reshaped data by DataFrame.pivot:
df1 = df.mask(df == 0).stack().groupby(level=0, group_keys=False).nlargest(2).reset_index()
df1 = df1.assign(a = df1.groupby('name').cumcount().add(1))
df = df.join(df1.pivot('name','a','level_1').add_prefix('max_no'))
print (df)
fruit vegetable softdrinks icecream max_no1 max_no2
name
john 0.2 0.00 0.40 0.4 softdrinks icecream
katty 0.0 1.00 0.00 0.0 vegetable NaN
kent 0.0 0.51 0.49 0.0 vegetable softdrinks
Or solution from comment with DataFrame.idxmax and again set missing values by compare with broadcasting in numpy:
df1 = df.mask(df == 0)
df['max_no1'] = df1.idxmax(axis=1)
m = df1.columns.to_numpy() == df['max_no1'].to_numpy()[:, None]
#pandas below 0.24
#m = df1.columns.values == df['max_no1'].values[:, None]
df1 = df1.mask(m)
df['max_no2'] = df1.idxmax(axis=1)
print (df)
fruit vegetable softdrinks icecream max_no1 max_no2
name
john 0.2 0.00 0.40 0.4 softdrinks icecream
katty 0.0 1.00 0.00 0.0 vegetable NaN
kent 0.0 0.51 0.49 0.0 vegetable softdrinks
I have a dictionary like this:
{'6DEC19': 0.61, '13DEC19': 0.58, '27DEC19': 0.63, '31JAN20': 0.66, '27MAR20': 0.69, '26JUN20': 0.71}
I'm very simply trying to turn this in to a DataFrame with the columns being 6DEC19, 13DEC19 etc, with the index then being set to the current date and hour, the code for which I would use as pd.Timestamp.now().floor('60min').
With the resulting df looking like this:
6DEC19 13DEC19 27DEC19 31JAN20 27MAR20 26JUN20
2019-12-04 20:00:00 0.61 0.58 0.63 0.66 0.69 0.71
My first step would just be to turn the dict in to a dataframe and as far as I'm concerned this code should do the trick:
df = pd.DataFrame.from_dict(dict)
But I get this error message: ValueError: If using all scalar values, you must pass an index.
I really have no idea what the problem is here? Any suggestions would be great, and if anyone can fit the problem of changing the index in to the bargin so much the better. Cheers
As the error message says you need to specify the index, so you can do the following:
import pandas as pd
d = {'6DEC19': 0.61, '13DEC19': 0.58, '27DEC19': 0.63, '31JAN20': 0.66, '27MAR20': 0.69, '26JUN20': 0.71}
df = pd.DataFrame(d, index=[pd.Timestamp.now().floor('60min')])
print(df)
Output
6DEC19 13DEC19 27DEC19 31JAN20 27MAR20 26JUN20
2019-12-04 17:00:00 0.61 0.58 0.63 0.66 0.69 0.71
try this:
import pandas as pd
a = {'6DEC19': [0.61], '13DEC19': [0.58], '27DEC19': [0.6], '31JAN20': [0.66], '27MAR20': [0.69], '26JUN20': [0.71]}
df = pd.DataFrame.from_dict(a)
print(df)
try this
newDF = pd.DataFrame(yourDictionary.items())
I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.
MWE:
import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan
Here is the baseline distribution:
df.quantile([0, 0.01, 0.5, 0.99, 1])
output:
one two three four
0.00 2.336618e-10 2.294259e-07 0.002437 2.305353
0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520
0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980
0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985
1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779
This is how I'm winsorizing:
def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])
wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which gives this:
Out[356]:
one two three four
0.00 0.000001 0.001060 1.536882 1003.820149
0.01 0.000001 0.001060 1.536882 1003.820149
0.25 0.000025 0.024975 25.200378 25099.994780
0.50 0.000050 0.049810 50.290946 50374.548980
0.75 0.000075 0.074842 74.794537 75217.343920
0.99 0.000099 0.098986 98.978263 98991.436957
1.00 0.000100 0.100000 99.996793 98991.436957
Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:
In [357]: df.count()
Out[357]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
In [358]: wins.count()
Out[358]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
This is how I can 'solve' it, but at the cost of losing a lot of my data:
wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
Out[360]:
one two three four
0.00 9.686203e-07 0.000928 0.965702 1005.209503
0.01 9.686203e-07 0.000928 0.965702 1005.209503
0.25 2.486052e-05 0.024829 25.204032 25210.837443
0.50 4.980946e-05 0.049894 50.299004 50622.227179
0.75 7.492750e-05 0.075059 74.837900 75299.906415
0.99 9.895563e-05 0.099014 98.972310 99014.311761
1.00 9.895563e-05 0.099014 98.972310 99014.311761
In [361]: wins2.count()
Out[361]:
one 51700
two 51700
three 51700
four 51700
dtype: int64
How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?
As often happens, simply creating the MWE helped clarify. I need to use clip() in combination with quantile() as below:
df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
one two three four
0.00 9.862626e-07 0.000974 0.975807 1003.814520
0.01 9.862666e-07 0.000974 0.975816 1003.820092
0.25 2.485043e-05 0.024975 25.200378 25099.994780
0.50 4.975859e-05 0.049810 50.290946 50374.548980
0.75 7.486737e-05 0.074842 74.794537 75217.343920
0.99 9.897462e-05 0.098986 98.978245 98991.436977
1.00 9.897463e-05 0.098986 98.978263 98991.438985
In [384]: df2.count()
Out[384]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
The numbers are different from above because I have maintained all of the data in each column that is not missing (NaN).