I'm having an issue with output from in enumerate function. It is adding parenthesis and commas into the data. I'm trying to use the list for a comparison loop. Can anyone tell me why the special characters are added resembling tuples? I'm going crazy here trying to finish this but this bug is causing issues.
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)
df.isnull().sum()/df.count()*100
df.dtypes
# Apply value_counts() on column LaunchSite
df[['LaunchSite']].value_counts()
# Apply value_counts on Orbit column
df[['Orbit']].value_counts()
#landing_outcomes = values on Outcome column
landing_outcomes = df[['Outcome']].value_counts()
print(landing_outcomes)
#following causes data issue
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
#following also causes an issue to the data
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = []
for value in df['Outcome'].items():
if value in bad_outcomes:
landing_class.append(0)
else:
landing_class.append(1)
df['Class']=landing_class
df[['Class']].head(8)
df.head(5)
df["Class"].mean()
The issue I'm having is
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
is changing my data and giving an output of
0 ('True ASDS',)
1 ('None None',)
2 ('True RTLS',)
3 ('False ASDS',)
4 ('True Ocean',)
5 ('False Ocean',)
6 ('None ASDS',)
7 ('False RTLS',)
additionally, when I run
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
my output is
{('False ASDS',),
('False Ocean',),
('False RTLS',),
('None ASDS',),
('None None',)}
I do not understand why my data return is far from expected and how to correct it.
Try this
for i, (outcome,) in enumerate(landing_outcomes.keys()):
print(i, outcome)
Or
for i, outcome in enumerate(landing_outcomes.keys()):
print(i, outcome[0])
Related
I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1
I have the following data frames
import pandas as pd
df_occurencies = pd.DataFrame({'day':[1,2,3,4,5],
'occ':[['frog','wasp','bee'],
['frog','whale','barley','orchid'],
['orchid','barley','frog'],
['orchid','whale','frog'],
['orchid','barley','tulip']]})
df_kingdoms = pd.DataFrame({'item':['frog','wasp','bee',
'whale','barley','orchid',
'tulip'],
'kingdom':['animalia','animalia','animalia',
'animalia','plantae','plantae',
'plantae']})
I need to set up another column, classifying the observations in column occ based on the values of the df_kingdoms values.
The values are all heterogeneous, so the desired outcome would be like this:
day occ desired_result
0 1 [frog, wasp, bee] "animals"
1 2 [frog, whale, barley, orchid] "animals and plants"
2 3 [orchid, barley, frog] "mostly plants"
3 4 [orchid, whale, frog] "mostly animals"
4 5 [orchid, barley, tulip] "plants"
I know that there are many ways to solve this, I've tried unsuccessfully a defined function with lots of .locs that I think is not even worth posting. And I need to perform this on large datasets, so faster is better.
This should do:
dic_kd={i:j for i,j in zip(df_kingdoms.item,df_kingdoms.kingdom)}
desired_output=[]
for I in df_occurencies.occ:
list_aux=[dic_kd[i] for i in I]
if (list_aux.count('animalia')!=0) and (list_aux.count('plantae')==0) :
desired_output.append('animals')
elif (list_aux.count('animalia')==0) and (list_aux.count('plantae')!=0) :
desired_output.append('plants')
elif list_aux.count('animalia')>list_aux.count('plantae'):
desired_output.append('mostly animals')
elif list_aux.count('animalia')<list_aux.count('plantae'):
desired_output.append('mostly plants')
else:
desired_output.append('animals and plants')
df_occurencies['desired output']=desired_output
Tell me if you don't understand anything and I'll help you
Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1
I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)
I have a series of space-delimited data files in x y format as below for a dummy data set, where y represents independent sample population means for value x.
File1.dat
1 15.99
2 17.34
3 16.50
4 18.12
File2.dat
1 10.11
2 12.76
3 14.10
4 19.46
File3.dat
1 13.13
2 12.14
3 14.99
4 17.42
I am trying to compute the standard error of the mean (SEM) line-by-line to get an idea of the spread of the data for each value of x. As an example using the first line of each file (x = 1), a solution would first compute the SEM of sample population means 15.99, 10.11, and 13.13 and print the solution in format:
x1 SEMx1
...and so on, iterating for every line across the three files.
At the moment, I envisage a solution to be something along the lines of:
Read in the data using something like numpy, perhaps specifying only the line of interest for the current iteration. e.g.
import numpy as np
data1 = np.loadtxt('File1.dat')
data2 = np.loadtxt('File2.dat')
data3 = np.loadtxt('File3.dat')
Use a tool such as Scipy stats, calculate the SEM from the three sample population means extracted in step 1
Print result to stout
Repeat for remaining lines
While I imagine other stats packages such as R are well-suited to this task, I'd like to try and keep the solution solely contained within Python. I'm fairly new to the language, and I'm trying to get some practical knowledge in using it.
I see this as being a problem ideally suited for Scipy from what I've seen here in the forums, but haven't the vaguest idea where to start based upon the documentation.
NB: These files contain an equal number of lines.
First let's try to get just the columns of data that we need:
import numpy as np
filenames = map('File{}.dat'.format, range(1,4)) # ['File1.dat', ...]
data = map(np.loadtxt, filenames) # 3 arrays, each 4x2
stacked = np.vstack((arr[:,1] for arr in data))
Now we have just the data we need in a single array:
array([[ 15.99, 17.34, 16.5 , 18.12],
[ 10.11, 12.76, 14.1 , 19.46],
[ 13.13, 12.14, 14.99, 17.42]])
Next:
import scipy.stats as ss
result = ss.sem(stacked)
This gives you:
array([ 1.69761925, 1.63979674, 0.70048396, 0.59847956])
You can now print it, write it to a file (np.savetxt), etc.
Since you mentioned R, let's try that too!
filenames = c('File1.dat', 'File2.dat', 'File3.dat')
data = lapply(filenames, read.table)
stacked = cbind(data[[1]][2], data[[2]][2], data[[3]][2])
Now you have:
V2 V2 V2
1 15.99 10.11 13.13
2 17.34 12.76 12.14
3 16.50 14.10 14.99
4 18.12 19.46 17.42
Finally:
apply(stacked, 1, sd) / sqrt(length(stacked))
Gives you:
1.6976192 1.6397967 0.7004840 0.5984796
This R solution is actually quite a bit worse in terms of performance, because it uses apply on all the rows to get the standard deviation (and apply is slow, because it invokes the target function once per row). This is because base R does not offer row-wise (nor column-wise, etc.) standard deviation. And I needed sd because base R does not offer SEM. At least you can see it gives the same results.