Pandas - Retrieve Value from df.loc - python

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.

A series requires .item() to retrieve its value.
print aresult.item()
1

Related

Python in enumerate not giving expected output

I'm having an issue with output from in enumerate function. It is adding parenthesis and commas into the data. I'm trying to use the list for a comparison loop. Can anyone tell me why the special characters are added resembling tuples? I'm going crazy here trying to finish this but this bug is causing issues.
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)
df.isnull().sum()/df.count()*100
df.dtypes
# Apply value_counts() on column LaunchSite
df[['LaunchSite']].value_counts()
# Apply value_counts on Orbit column
df[['Orbit']].value_counts()
#landing_outcomes = values on Outcome column
landing_outcomes = df[['Outcome']].value_counts()
print(landing_outcomes)
#following causes data issue
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
#following also causes an issue to the data
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = []
for value in df['Outcome'].items():
if value in bad_outcomes:
landing_class.append(0)
else:
landing_class.append(1)
df['Class']=landing_class
df[['Class']].head(8)
df.head(5)
df["Class"].mean()
The issue I'm having is
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
is changing my data and giving an output of
0 ('True ASDS',)
1 ('None None',)
2 ('True RTLS',)
3 ('False ASDS',)
4 ('True Ocean',)
5 ('False Ocean',)
6 ('None ASDS',)
7 ('False RTLS',)
additionally, when I run
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
my output is
{('False ASDS',),
('False Ocean',),
('False RTLS',),
('None ASDS',),
('None None',)}
I do not understand why my data return is far from expected and how to correct it.
Try this
for i, (outcome,) in enumerate(landing_outcomes.keys()):
print(i, outcome)
Or
for i, outcome in enumerate(landing_outcomes.keys()):
print(i, outcome[0])

How to avoid ValueError: could not convert string to float: '?'

This is ML code and I am beginner.
X and y are class and feature matrix
print(X.shape)
X.dtypes
output:
Age int64
Sex int64
chest pain type int64
Trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca object
thal object
dtype: object
from sklearn.feature_selection import SelectKBest, f_classif
#Using ANOVA to create the new dataset with only best three selected features
X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y) #<-------- get error
X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
print("The dataset with best three selected features after using ANOVA:")
print(X_new_anova.head())
kmeans_anova = KMeans(n_clusters = 3).fit(X_new_anova)
labels_anova = kmeans_anova.labels_
#Counting the number of the labels in each cluster and saving the data into clustering_classes
clustering_classes_anova = {
0: [0,0,0,0,0],
1: [0,0,0,0,0],
2: [0,0,0,0,0]
}
for i in range(len(y)):
clustering_classes_anova[labels_anova[i]][y[i]] += 1
###Finding the most appeared label in each cluster and computing the purity score
purity_score_anova = (max(clustering_classes_anova[0])+max(clustering_classes_anova[1])+max(clustering_classes_anova[2]))/len(y)
print(f"Purity score of the new data after using ANOVA {round(purity_score_anova*100, 2)}%")
This is the error I got:
#Using ANOVA to create the new dataset with only best three selected features
----> 4 X_new_anova = SelectKBest(f_classif, k=3).fit_transform(X,y)
5 X_new_anova = pd.DataFrame(X_new_anova, columns = ["Age", "Trestbps","chol"])
6 print("The dataset with best three selected features after using ANOVA:")
ValueError: could not convert string to float: '?'
I don't know what is the meaning of "?"
could you please tell me how to avoid this error?
The meaning of the '?' is that there is this string (?) somewhere within your datafile that it cannot convert. I would just check your datafile to make sure that everything checks out. I would guess whoever made it put a ? somewhere that data could not be found.
can Delete a row using
DataFrame=Dataframe.drop(labels=3,axis=0)
'''
With 3 being used as a placeholder for whatever
row holds the ? so if row 40 has the empty ?, you would do # 40
'''

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index

Parsing Data, Excel to Python

So I have a excel(.csv) data file that looks like:
Frequency Frequency error
0.00575678 17
0.315 2
0.003536329 13
0.00481 1
0.004040379 4
where the second column is the error in the first data column e.g. the value of the first entry is 0.00575678 +/- 0.0000000017 and the second is 0.315 +/- 0.002. So using python is there a way to parse the data using Python so that I can get two data arrays, the 1st being frequency and the 2nd the frequency error. Where the first entry in the 2nd array is in the format of 0.0000000017. If this was a small data file I'd do it manually but it has a few thousand entries so its not really an option. Thanks
Maybe not the fastest, but looks close.
sample = """\
0.00575678,17
0.315,2
0.003536329,13
0.00481,1
0.004040379,4"""
for line in sample.splitlines():
value,errordigits = line.split(',')
error = ''.join(c if c in '0.' else '0' for c in value)[:-1]
error += errordigits
print "%s,%s" % (value,error)
prints:
0.00575678,0.000000017
0.315,0.002
0.003536329,0.0000000013
0.00481,0.00001
0.004040379,0.000000004
i found pandas useful to get data from a csv.
f = pandas.read_csv("YOURFILE.csv");
dfw = pandas.DataFrame(data = df, columns=['COLUMNNAME1','COLUMNNAME2'])
y = df.COLUMNNAME1.values
x = df.COLUMNNAME2.values

Write a dataframe with only integer values

The title of this question might not be appropriate...
So let's suppose I have the following input.csv :
Division,id,name
1,3870,name1
1,4537,name2
1,5690,name3
I need to do some treatments based on the id row, that fetch like this :
>>> get_data(3870)
[{"matchId": 42, comment: "Awesome match"}, {"matchId": 43, comment: "StackOverflow is quite good"}]
My objective is to output a csv that is a join between the first one, and the related data retrieved through get_data :
Division,id,name,matchId,comment
1,3870,name1,42,Awesome match
1,3870,name1,43,StackOverflow is quite good
1,4537,name2,90,Random value
1,4537,name2,91,Still a random value
1,5690,name3,10,Guess what it is
1,5690,name3,11,A random value
However, for some reasons, in the process, the integer data are converted into float :
Division,id,name,matchId,comment
1.0,3870.0,name1,42.0,Awesome match
1.0,3870.0,name1,43.0,StackOverflow is quite good
1.0,4537.0,name2,90.0,Random value
1.0,4537.0,name2,91.0,Still a random value
1.0,5690.0,name3,10.0,Guess what it is
1.0,5690.0,name3,11.0,A random value
Here is short version of my code, I think I missed something...
input_df = pd.read_csv(INPUT_FILE)
output_df = pd.DataFrame()
for index, row in input_df.iterrows():
matches = get_data(row)
rdict = dict(row)
for m in matches:
m.update(rdict)
output_df = output_df.append(m, ignore_index=True)
# FIXME: this was an attempt to solve the problem
output_df["id"] = output_df["id"].astype(int)
output_df["matchId"] = output_df["matchId"].astype(int)
output_df.to_csv(OUTPUT_FILE, index=False)
How can I convert every float column into integer ?
First solution is add parameter float_format='%.0f' to to_csv:
print output_df.to_csv(index=False, float_format='%.0f')
Division,comment,id,matchId,name
1,StackOverflow is quite good,3870,43,name1
1,StackOverflow is quite good,4537,43,name2
1,StackOverflow is quite good,5690,43,name3
Second possible solution is apply function convert_to_int instead of astype:
print output_df
Division comment id matchId name
0 1 StackOverflow is quite good 3870 43 name1
1 1 StackOverflow is quite good 4537 43 name2
2 1 StackOverflow is quite good 5690 43 name3
print output_df.dtypes
Division float64
comment object
id float64
matchId float64
name object
dtype: object
def convert_to_int(x):
try:
return x.astype(int)
except:
return x
output_df = output_df.apply(convert_to_int)
print output_df
Division comment id matchId name
0 1 StackOverflow is quite good 3870 43 name1
1 1 StackOverflow is quite good 4537 43 name2
2 1 StackOverflow is quite good 5690 43 name3
print output_df.dtypes
Division int32
comment object
id int32
matchId int32
name object
dtype: object

Categories