I am passing the below dataframe (rxn_summary_df) via results into pandas table and its showing up as expected but I'm unable to produce most of the plots resulting in a (valueError: subplots should be a bool or an iterable).
results = Table(frame, dataframe=rxn_summary_df, showtoolbar=True, showstatusbar=True, showindex=True,width=x, height=y, align='center')
rxn_summary_df
MW Count Mass Mol % Wt %
Name
(('DETA', 1),) 103.169 900 92852.10 68.2335 38.0899
(('C181', 1),) 282.470 108 30506.76 8.1880 12.5145
(('C181', 1), ('DETA', 1)) 367.620 288 105874.56 21.8347 43.4320
(('C181', 2), ('DETA', 1)) 632.070 23 14537.61 1.7437 5.9636
Window display of results
Plot of Name vs. wt%.
I've confirmed with the dir function all of these are iterable. I thought maybe the odd format of the names were doing something but even after dropping this column and using a simple numeric index, the same error results. I just want to get all of the column into an iterable format
I'm at a loss. Little backstory on me...Beginner/Intermediate python programmer. Likely still an idiot.
Related
I'm having an issue with output from in enumerate function. It is adding parenthesis and commas into the data. I'm trying to use the list for a comparison loop. Can anyone tell me why the special characters are added resembling tuples? I'm going crazy here trying to finish this but this bug is causing issues.
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)
df.isnull().sum()/df.count()*100
df.dtypes
# Apply value_counts() on column LaunchSite
df[['LaunchSite']].value_counts()
# Apply value_counts on Orbit column
df[['Orbit']].value_counts()
#landing_outcomes = values on Outcome column
landing_outcomes = df[['Outcome']].value_counts()
print(landing_outcomes)
#following causes data issue
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
#following also causes an issue to the data
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = []
for value in df['Outcome'].items():
if value in bad_outcomes:
landing_class.append(0)
else:
landing_class.append(1)
df['Class']=landing_class
df[['Class']].head(8)
df.head(5)
df["Class"].mean()
The issue I'm having is
for i,outcome in enumerate(landing_outcomes.keys()):
print(i,outcome)
is changing my data and giving an output of
0 ('True ASDS',)
1 ('None None',)
2 ('True RTLS',)
3 ('False ASDS',)
4 ('True Ocean',)
5 ('False Ocean',)
6 ('None ASDS',)
7 ('False RTLS',)
additionally, when I run
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes
my output is
{('False ASDS',),
('False Ocean',),
('False RTLS',),
('None ASDS',),
('None None',)}
I do not understand why my data return is far from expected and how to correct it.
Try this
for i, (outcome,) in enumerate(landing_outcomes.keys()):
print(i, outcome)
Or
for i, outcome in enumerate(landing_outcomes.keys()):
print(i, outcome[0])
I have a code that I am trying to run and it should be fairly simple, it is just math. But whenever I try to do the math with the Pandas Dataframes I'm using, I'm getting an error. I know that in the dataframe, it is the column labeled 'first' which is giving me some issues. I have gone through and checked all the others. I have also tried a few things to convert it to a column with floating point values, so I can do the math with it. But I am getting errors every time. I am attaching the code along with comments of what I have tried and what errors I have been getting.
Any help would be greatly appreciated! I am very stuck on this. Thank you!
# Set constants
pi = 3.14159265359
e = 2.71828
h = 6.62607004*(10**-34) # J*s
c = 299792458 # m / s
kb = 1.380649*(10**-23) # J/K
temp3 = 3000 # K
temp10 = 10000 # K
constant = (2*pi*h*(c**2))
bb_df = pd.DataFrame({ 'wl_nm' : range(200, 1101 ,1)}) # Gets wavelength ranges I will want plotted
#bb_df.wl
bb_df['wl_m'] = (bb_df.wl_nm * (10**-9)) # Gets wavelength in meters (this one does work doing math with)
bb_df['first'] = constant/((bb_df.wl_m)**5) # This one does not work doing math with; says it's a method, not number, and cannot figure out how to change it
#bb_df['first'] = bb_df['first'].astype(float) # Tried this, but get error: TypeError: Cannot broadcast np.ndarray with operand of type <class 'method'>
#float(bb_df['first']) # Tried this, but get error: TypeError: cannot convert the series to <class 'float'>
bb_df['exponent'] = (h*c)/((bb_df.wl_m)*kb*temp3)
bb_df['denominator'] = e ** (bb_df.exponent) - 1
bb_df['second'] = 1 / bb_df.denominator
bb_df['second'] = bb_df.second + 1
bb_df['final'] = (bb_df.first) * (bb_df.second) # ERROR (because of bb_df.first)
#bb_df['test'] = float(bb_df.first) - float(bb_df.second)
#bb_df['intensity'] = (((2*pi*h*(c**2))/((bb_df.wl_m**5))(1/(e**((h*c)/((bb_df.wl_m)*kb*temp3))-1)))) # Also just tried typing out entire equation here, but this also gives an error
print(bb_df)
When I comment out all the lines that are not working, this is the dataframe I get. It is the 'first' column that says it is a method, and I have been trouble converting it to a floating point value to do math with. I thought perhaps it was because the number is so small, but then I should also not have had the issue when I just tried to do the entire equation all at once (in the 'intensity' column attempt, which also did not work):
wl_nm wl_m first exponent denominator second
0 200 2.000000e-07 1.169304e+18 23.979614 2.595417e+10 1.000000
1 201 2.010000e-07 1.140505e+18 23.860313 2.303537e+10 1.000000
2 202 2.020000e-07 1.112552e+18 23.742192 2.046898e+10 1.000000
3 203 2.030000e-07 1.085418e+18 23.625236 1.820969e+10 1.000000
4 204 2.040000e-07 1.059074e+18 23.509426 1.621836e+10 1.000000
.. ... ... ... ... ... ...
896 1096 1.096000e-06 2.366053e+14 4.375842 7.850652e+01 1.012738
897 1097 1.097000e-06 2.355289e+14 4.371853 7.819001e+01 1.012789
898 1098 1.098000e-06 2.344583e+14 4.367871 7.787533e+01 1.012841
899 1099 1.099000e-06 2.333935e+14 4.363897 7.756247e+01 1.012893
900 1100 1.100000e-06 2.323346e+14 4.359930 7.725142e+01 1.012945
[901 rows x 6 columns]
From the documentation:
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.
You cannot use bb_df.first to mean the same thing as bb_df['first'], because there is already a .first method of the DataFrame class. That's why the error message tells you that bb_df.first is a method - because it is, one that is pre-defined by Pandas. The first column of your DataFrame contains floating-point values the entire time, and no attempt to convert the already-floating-point values into floating-point is relevant because the problem is not with the column. The problem is that the code bb_df.first does not access the column.
Just use indexing consistently (bb_df['first']). The other way is a not-really-convenience that results in this problem sometimes.
Instead of typing bb_df.columnname, change it to bb_df['columnname']. It fixes the issue!
I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index
I have a csv file that has positions and velocities of particles saved like this:
x, y, z, vx, vy, vz
-0.960, 0.870, -0.490, 962.17, -566.10, 713.40
1.450, 0.777, 2.270, -786.27, 63.31, -441.00
-3.350, -1.640, 1.313, 879.20, 637.76, -556.24
-0.504, 2.970, -0.278, 613.22, -717.32, 557.02
0.338, 0.220, 0.090, -927.18, -778.77, -443.05
...
I'm trying to read this file and save it as a Pandas dataframe in a script with read_csv. But I would get errors when calling any column except the first one
AttributeError: 'DataFrame' object has no attribute 'y'
I would never get the error for the 'x' column, so I wrote a snippet to see if I could figure out where the reading error was stemming from.
import pandas as pd
data = pd.read_csv('snap.csv')
print data
print data.x
print data.y
The console correctly prints out
x y z vx vy vz
0 -0.960 0.870 -0.490 962.17 -566.10 713.40
1 1.450 0.777 2.270 -786.27 63.31 -441.00
2 -3.350 -1.640 1.313 879.20 637.76 -556.24
3 -0.504 2.970 -0.278 613.22 -717.32 557.02
4 0.338 0.220 0.090 -927.18 -778.77 -443.05
...
meaning it is assigning the columns the correct names. Then
0 -0.960
1 1.450
2 -3.350
3 -0.504
4 0.338
...
showing it can take one of the columns out correctly. But then it throws the error again when trying to print the second column
AttributeError: 'DataFrame' object has no attribute 'y'
I then looped through data.itertuples() to print the first row individually in order to see what that looked like, and it confirmed that the names were only being assigned to the first column and none of the others.
Pandas(Index=0, x=-0.96, _2=0.87, _3=-0.49, _4=962.17, _5=-566.1, _6=713.4)
There aren't any other problems with the data. The values all correspond to the right index. It's just that the names are not being assigned correctly and only the first column can be called by name. I tried putting single quotes around each column name, and that shows the exact same errors.
I know there are ways I might be able to work around this such as assigning the names in the read_csv function, but I'm curious as to what the issue could actually be so as to avoid having this happen again.
Try declaring column names when you create the data frame.
df = pd.DataFrame(pd.read_csv(“file.csv”), columns=[“x”, “y”, “z”, “vx”, “vy”, “vz”])
df = pd.read_csv("snap.csv",names =["x", "y", "z", "vx", "vy", "vz"])
I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable:
2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...]
2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ]
in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish'
so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it.
Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.
Getting the data into pandas:
import pandas as pd
df = pd.DataFrame(columns = ['timestamp','name','value'])
with open(logfilepath) as f:
for line in f.readlines():
timestamp = line.split(',')[0]
#the data part of each line can be evaluated directly as a Python list
data = eval(line.split('=')[1])
#convert the input data from wide format to long format
for name, value in data:
df = df.append({'timestamp':timestamp, 'name':name, 'value':value},
ignore_index = True)
#convert from long format back to wide format, and fill null values with 0
df2 = df.pivot_table(index = 'timestamp', columns = 'name')
df2 = df2.fillna(0)
df2
Out[142]:
value
name ball cod fish keypassed mouse toy
timestamp
2015-06-20 16:42:48 2 0 0 13 1 2
2015-06-21 16:42:48 7 1 1 20 0 5
Plot the data:
import matplotlib.pylab as plt
df2.value.plot()
plt.show()