if i write this code:
train['id_03'].value_counts(dropna=False, normalize =True).head()
I am getting
NaN 0.887689233582822
0.0 0.108211128797372
1.0 0.001461374335354
3.0 0.001131168083449
2.0 0.000712906831036
Name: id_03, dtype: float64
If i changed dropna =True
I get
0.0 0.963497
1.0 0.013012
3.0 0.010072
2.0 0.006348
5.0 0.001643
Name: id_03, dtype: float64
I think the key is that you specified normalize =True It is: "If True then the object returned will contain the relative frequencies of the unique values." according to the documentations.
Before you removed Na's the Na's counts are used to calculate the relative frequencies, after you removed them, the denominator of the relative frequencies changed, hence values changed
You are normalising the result. The value for the NaN would appear to be very large with respect to the others. Hence the other indexes result in very small numbers
If you look at the relative ratio between indexes 1 & 2 you'll see that they are equivalent in both results.
Related
Let's take an example.
suppose we have a data frame that has column name "f1"
f1 : {2, 4, NaN, 1, NaN, 15}
and when we apply mean imputation to it then we write a code like this
dataframe['f1'].fillna(dataframe['f1'].mean())
so my doubt is when it computes the mean of f1 during dataframe['f1'].mean() I know that it excludes the NaN value during summation(in the numerator) because they can't be added but what I want to know is it can be included or excluded in the denominator when dividing by the total number of values.
mean is computes like this
mean(f1) = (2+4+1+15)/6(include NaN in total number of values)
or this way
mean(f1) = (2+4+1+15)/4(exclude NaN in total number of values)
also, explain why?
thanks in advance
pd.Series.mean calculates the mean only for non-NaN values, so for above data, mean is (2+4+1+15)/4=5.5, 4 is the number of non-NaN values, this is the default behavior for calculating mean. If you want to include the mean for the given Series using all the rows for denominator, you can fillna(0) before calling mean():
Calling mean() directly:
df['f1'].fillna(df['f1'].mean())
0 2.0
1 4.0
2 5.5 <------
3 1.0
4 5.5 <------
5 15.0
Name: f1, dtype: float64
calling mean() after fillna(0):
df['f1'].fillna(df['f1'].fillna(0).mean())
0 2.000000
1 4.000000
2 3.666667 <------
3 1.000000
4 3.666667 <------
5 15.000000
Name: f1, dtype: float64
According to the official documentation of pandas.DataFrame.mean "skipna" parameter excludes the NA/null values. If it was excluded from numerator but denominator this would be exclusively mentioned in the documentation. You could prove yourself that it is excluded from denominator by performing a simple experimentation with a dummy dataframe such as the one you have examplified in the question.
The reason NA/null values should be excluded from denominator is about being statistically correct. Mean is the sum of the numbers divided by total number of them. If you could not add a value to the summation, then it is pointless to make an extra count in the denominator for it. If you count it in the denominator, it equals behaving as though the NA/null value was 0. However, the value is not 0, it is unknown, unobserved, hidden etc.
If you are acknowledged about the nature of the distribution in practice, you could interpolate or fill NA/null values accordingly with the nature of the distribution, then take the mean of all the values. For instance, if you realize that the feature in question has a linear nature, you could interpolate missing values with "linear" approach.
A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.
Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")
The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)
I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.
For example, if I interpolate with the nearest approach here (I made up the values here):
Customer_id Lat Lon
A 1 1
A NaN NaN
A 2 2
B NaN NaN
B 4 4
I would like the NaN for B to be 4 and not 2.
I have tried this
series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))
And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.
I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,
Customer_id Lat
0. A 1
1. A NaN
2. A NaN
3. A NaN
4. A NaN
It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.
I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?
(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)
When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.
def my_interp(x):
if x.notnull().sum() > 1:
return x.interpolate(method='nearest').ffill().bfill()
else:
return x.ffill().bfill()
df.groupby('Customer_id').transform(my_interp)
# Lat Lon
#0 1.0 1.0
#1 1.0 1.0
#2 2.0 2.0
#3 4.0 4.0
#4 4.0 4.0
Let's say I have a dataframe with dates as index. Each row contains information about a certain event on that date. The problem is that there could be more than one event on said date.
This is an example DataFrame, df2:
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/4 3.0 3.0
1/5 NaN 4.0
I want to add missing dates to the dataframe, and I used to be able to do it with .loc. Now .loc raises the following warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
This is my code (it works but raises warning):
# I want to add any missing date- in this example, 1/3.
df2.loc[["1/2","1/3","1/4","1/5"]]
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/3 NaN NaN
1/4 3.0 3.0
1/5 NaN 4.0
I've tried using reindex as it suggests, but my index contains duplicated values so it doesn't work:
#This doesn't work
df2.reindex(["1/2","1/3","1/4","1/5"])
ValueError: cannot reindex from a duplicate axis
What can I do to replace the old loc?
One way from join
df.join(pd.DataFrame(index=["1/2","1/3","1/4","1/5"]),how='outer')
Out[193]:
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/3 NaN NaN
1/4 3.0 3.0
1/5 NaN 4.0