incorrect mean from PANDAS dataframe - python

So here's an interesting thing:
Using python 2.7:
I've got a dataframe of about 5,100 entries, each with a number (melting point) in a column titled 'Tm'. Using the code:
self.sort_df[['Tm']].mean(axis=0)
I get a mean of:
Tm 92.969204
dtype: float64
This doesn't make sense because no entry has a Tm of greater than 83.
Does .mean() not work for this many values? I've tried pairing down the dataset and it seems to work for ~1,000 entries but considering I have full dataset of 150,000 to run at once, I'd like to know if I need to find a different way to calculate the mean.

A more readable syntax would be :
sort_df['Tm'].mean()
Try to do a sort_df['Tm'].value_counts() or sort_df['Tm'].max() to see what values are present. Some unexpected values must have crept up.
The .mean function gives accurate result irrespective of the size.

Related

Pandas - value changes when adding a new column to a dataframe

I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?

How to replace only zeroes with some conditions on dataframe

I have searched many places but still can't come up neither with my own logic neither find on the internet ...
problem
I have students performance dataset while performing EDA , i came up with a small problem
like ,why students having zero 'absences' have zeroes in their final grades ..
that is practically impossible for a student to be present the whole year and still get a zero in their finals
So I decided to filter out all the rows with zeroes in those two columns using
dataset[(dataset['G3']==0)&(dataset['absences']==0)]
but this returned a dataframe
So i tried
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']]
which returned me two columns with the condition satisfied , what i wanted is to replace 'G3' column zeroes and 'absences' column zeroes to be replaced with their respective means and not disturb the dataframe too
i tried to replace them by
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']].replace(0,np.mean[dataset[['G3','absences']]])
which threw me error
function object cannot be subscriptable
I don't know what to do
I have tried many things but still can't get through this problem any solution may help
thanks in advance
In case you want to replace with the mean of subset of values != 0, the you can use
dataset = pd.DataFrame({'G3': np.random.randint(0,3,100),
'absences' : np.random.randint(0,3,100)})
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3', 'absences']] = [dataset.loc[(dataset['G3']!=0)]['G3'].mean(), dataset.loc[(dataset['absences']!=0)]['absences'].mean()]

I want to subtract one column from another in pandas, but I keep getting a copy error. Is there a better way to do this operation?

I have a data frame TB_greater_2018 that 3 columns: country, e_inc_100k_2000 and e_inc_100k_2018. I would like to subtract e_inc_100k_2000 from e_inc_100k_2018 and then use those values returned to create a new column of the differences and then sort by the countries with the largest difference. My current code is:
case_increase_per_100k = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018["case_increase_per_100k"] = case_increase_per_100k
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
When I run this, I get a SettingwithCopyWarning. Is there a way to do this without getting this warning? Or just overall a better way of accomplishing the task?
You can do
TB_greater_2018["case_increase_per_100k"] = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
It looks like the error is from finding the difference and using that as a column in separate operations, although tbh I'm not clear why that would be.

Data presentation difference in python

Hopefully a fairly simple answer to my issue.
When I run the following code:
print (data_1.iloc[1])
I get a nice, vertical presentation of the data, with each column value header, and its value presented on separate rows. This is very useful when looking at 2 sets of data, and trying to find discrepancies.
However, when I write the code as:
print (data_1.loc[data_1["Name"].isin(["John"])])
I get all the information arrayed across the screen, with the column header in 1 row, and the values in another row.
My question is:
Is there any way of using the second code, and getting the same vertical presentation of the data?
The difference is that data_1.iloc[1] returns a pandas Series whereas data_1.loc[data_1["Name"].isin(["John"])] returns a DataFrame. Pandas has different representations for these two data types (i.e. they print differently).
The reason iloc[1] gives you a Series is because you indexed it using a scalar. If you do data_1.iloc[[1]] you'll see you get a DataFrame instead. Conversely, I'm assuming that data_1["Name"].isin(["John"]) is returning a collection. If you wanted to get a Series instead you might try something like
print(data_1.loc[data_1["Name"].isin(["John"])[0]])
but only if you're sure you're getting one element back.

Ta-lib evaluation order for building series

I'm building indicator series based on market prices using ta-lib. I made a couple of implementations of the same concept but I found the same issue in any implementation. To obtain a correct series of values I must revert the input series and finally revert the resulting series. The python code that does the call to ta-lib library through a convenient wrapper is:
rsi1 = np.asarray(run_example( function_name,
arguments,
30,
weeklyNoFlatOpen[0],
weeklyNoFlatHigh[0],
weeklyNoFlatLow[0],
weeklyNoFlatClose[0],
weeklyNoFlatVolume[0][::-1]))
rsi2 = np.asarray(run_example( function_name,
arguments,
30,
weeklyNoFlatOpen[0][::-1],
weeklyNoFlatHigh[0][::-1],
weeklyNoFlatLow[0][::-1],
weeklyNoFlatClose[0][::-1],
weeklyNoFlatVolume[0][::-1]))[::-1]
The graphs of both series can be observed here (the indicator is really SMA):
The green line is clearly computed in reverse order (from n sample to 0) and the red one in the expected order. To achieve the red line I must reverse input series and output series.
The code of this test is available on: python code
Anybody observed the same behavior?
I found what's wrong with my approach to the problem. The simple answer is that the MA indicator puts the first valid value on the results array in the position zero, so the result series starts from zero and has N less samples than the input series (where N is the period value in this case). The reverted computation idea was completely wrong.
Here's the proof:
enter image description here
Adding 30 zeros at the beginning and removing the last ones the indicator fits over the input series nicely.
enter image description here

Categories