plotting specific columns based on user input - python

I have a dataframe with the following setup:
a c g s
Ind b d t d
0 11 12 22 33
1 13 14 44 101
The goal is to receive input from user via GUI, save the input as a list and compare it with the list of the headers in the dataframe. If the the two match, then plot the column where they match (the column will be the y-axis and index will be the x-axis).
For example, if user selects [('c','d')] then I would like the code to plot that column. Here is what I have so far.
df = pd.read_csv('foo.csv',sep=r'\s*,\s*', encoding='ascii', engine='python')
header_list = [('a','b'),('c','d'),('g','t'),('s','d')]
user_Input_list = [('c','d')]
sub_list = []
for contents in header_list:
for contents2 in user_Input_list:
if contents == contents2:
ax = df.reset_index().plot(x='Ind', y=x[header_list], kind='bar')
for p in ax.patches:
ax.annotate('{:.2E}'.format(Decimal(str(p.get_height()))),
(p.get_x(),p.get_height()))
plt.tight_layout()
plt.show()
I think the problem lies in how I am trying to select the y-axis with y=x[header_list].
Edit
Here is the Error message I get when I run the above code.
Traceback (most recent call last):
File "/home/JM9/PycharmProjects/subfolder/subfolder.py", line 360, in <module>
ax = x.reset_index().plot(x='Ind', y=x[header_list], kind='bar')
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/plotting/_core.py", line 780, in __call__
data = data[y].copy()
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2982, in __getitem__
return self._getitem_frame(key)
File "/home/JM9/PycharmProjects/subfolder/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 3081, in _getitem_frame
raise ValueError("Must pass DataFrame with boolean values only")
ValueError: Must pass DataFrame with boolean values only

I'm having trouble figuring out your example code, one thing that could help to solve the problem is to re-create a simpler version of what you're trying to do, without hard-to-deal-with column names.
import random
import pandas as pd
data={
'a': [random.randint(0, 50) for i in range(4)],
'b': [random.randint(0, 50) for i in range(4)],
'c': [random.randint(0, 50) for i in range(4)]
}
df = pd.DataFrame(data)
df.index = df.index.rename('Ind')
user_input = 'b'
if user_input in df.columns:
ax = df[user_input].plot(x='Ind', kind='bar')
Some useful takeaways for you: (a) instead of looping you can perform a simple test to see if a user's input is equal to a data frame column (if user_input in df:) and (b) you can call .plot() against individual columns.
Edit: changed 'b' to user_input

Related

Removing data points above/below value in python

I have a dataframe where I am trying to remove all the values outside the range [-500,500], I simply want to remove the particular colum/"Index" values that exceed this limit. I have tried a lot of different things, but nothing really seems to work. I have tried using this code, but then I get the error. enter image description here
File "C:\Users\Jeffs.spyder-py3\kplr006779699.py", line 30, in data = data[data['0'] < abs(500)]
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\frame.py", line 3024, in getitem indexer = self.columns.get_loc(key)
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\indexes\range.py", line 354, in get_loc raise KeyError(key)
KeyError: '0'
which i'm guessing is because the column named '0' doesn't have really have a column name.
from astropy.io import ascii
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
#Data from KIC 6779699
df = ascii.read(r'G:\Downloads\kplr006779699_kasoc-ts_llc_v1-2.dat')
# print(df)
x_Julian_data = df['col1']
x_data_raw = (x_Julian_data-54000)*86400 #Julian time to seconds: 60*60*24
data = np.linspace(0, 65541, num = int(65541) , endpoint = True)
y_data_raw = df['col2'] #Relative flux ppm
for i in range (65541-2):#Cleaning up data
data[i+1] = y_data_raw[i+1]-.5*(y_data_raw[i]+y_data_raw[i+2])
data[0] = 0
data[65540] = 0
data = pd.DataFrame(data)
data = data[data['0'] < abs(500)]
plt.plot(x_data_raw, data)
plt.xlim([1.1E8,1.25E8])
plt.ylim([-500,500])
I can't quite get it to work, even if I try using a definition.
Is there an easier way to approach this?
"data" is a numpy array (created using np.linspace), so you can filter it by value *before you create the data frame:
data = data[data < abs(500)]
new_df = pd.DataFrame(data, columns=['a_useful_column_name'])
(while debugging consider using a new variable name for the DataFrame)

Trouble importing Excel fields into Python via Pandas - index out of bounds error

I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.

How to handle missing value for gradient crosstab using Bokeh 0.10.0?

I am new and is using Bokeh 0.10.0, following this example.
I am introducing missing value in the pandas df by
# Swap a real numeric value to missing
data['Jan'][0] = np.nan
After the line
data = data.set_index('Year')
When it runs, it gives an error
Traceback (most recent call last):
File "C:\Users\KubiK\Desktop\Try2.py", line 36, in <module>
color.append(colors[min(int(monthly_rate)-2, 8)])
ValueError: cannot convert float NaN to integer
How can we tell Bokeh to skip that missing value?
I see two possible options.
[Option 1] Do a replace on the pandas DataFrame data beforehand and handle the color assignment in the for loop:
data.replace([np.nan], -1, inplace=True)
for y in years:
for m in months:
month.append(m)
year.append(y)
monthly_rate = data[m][y]
if monthly_rate == -1:
color.append('#FFFFFF')
rate.append(monthly_rate)
color.append(colors[min(int(monthly_rate)-2, 8)])
[Option 2] Handle the np.nan in the for loop with an if.
for y in years:
for m in months:
month.append(m)
year.append(y)
monthly_rate = data[m][y]
if np.isnan(monthly_rate):
rate.append(-1)
color.append('#FFFFFF')
else:
rate.append(monthly_rate)
color.append(colors[min(int(monthly_rate)-2, 8)])
Notice I am assigning the color #FFFFFF, and the value of -1 but you can change it to what you want.

Get Pandas DataFrame first column

This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.
using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)

Pandas dataframe float index and transpose error

I'm trying to properly import data from a space delimited file into a pandas dataframe so that I can plot it properly. My data file looks like so:
Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000 ...
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008 ...
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009 ...
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008 ...
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008 ...
So the test location is originally in the rows with the columns increasing in voltage to the right to 20V.
My code to read the data file into the dataframe is:
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, index_col="Vmeas", delim_whitespace=True, header=0)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index',axis=1, inplace=True)
make_plots(df, save)
The actual plotting is done by:
def make_plots(df, save):
voltage = np.arange(-5, 20, 0.5)
plt.figure(figsize=(10, 7))
for col in df:
plt.plot(voltage, col, legend=False)
plt.show()
At first, I encountered problems with the voltage being treated by pandas as a string and since pandas doesn't play nice with float indexes. Trying that initially started my plot of a diode current-voltage relationship at 0. (http://i.imgur.com/wgIZCyq.jpg) Then, I re-indexed it but then plotting that still didn't work. Now, I've re-indexed the dataframe, dropped the old index column and when I check the df.head() everything looks right:
Die_numbers vfd3051 vfd3151 vfd3251 vfd3351
Voltage
0 -3.202241e-08 -3.711351e-08 -4.728576e-08 -2.184733e-08
1 -1.493095e-09 -6.580329e-09 3.594383e-09 -3.710431e-08
2 1.377107e-08 -6.581644e-09 8.683344e-09 3.595368e-09
except now I keep running into a ValueError in mpl. I think this is related to the col values being strings instead of floats which I don't understand because it was printing the currents properly before.
Admittedly, I'm new to pandas but it seems like at every turn I am stopped, by my ignorance no doubt, but it's getting tiresome. Is there a better way to do this? Perhaps I should just ignore the first row of the logfile? Can I convert from scientific notation while reading the file in? Keep plugging away?
Thanks.
df.info() is:
Int64Index: 51 entries, 0 to 50
Columns: 1092 entries, vfd3051 to vfd6824
dtypes: float64(1092)
Everything seems to load into pandas correctly but mpl doesn't like something in the data. The columns are floats, I'm not using the index of integers. If the column names were being added as my first row, the columns would be treated as str or obj type. The error is:
Traceback (most recent call last):
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 51, in <module>
make_plots(df, save)
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 21, in make_plots
plt.plot(voltage, col, legend=False)
File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 4139, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 319, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 278, in _plot_args
linestyle, marker, color = _process_plot_format(tup[-1])
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 131, in _process_plot_format
'Unrecognized character %c in format string' % c)
ValueError: Unrecognized character f in format string
I figured out how to make this work entirely in pandas. Don't indicate an index nor a header row. Transpose the dataframe and drop the index. Then, create a list out of the first row of data which will be your string titles for the columns you really wanted. Assign the column names to this list and then reassign the dataframe to a sliced dataframe eliminating the first row of string names ('vfd3021' in my case).
After that, you're good to go. The columns are float and since my voltage range is fixed, I just create a list with arange when I plot.
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, delim_whitespace=True)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index', axis=1, inplace=True)
names = df.iloc[0].values
df.columns = names
df = df[1:]
make_plots(df, save)
As far as I can see all your problems are coming from not getting your data in the
correct format to begin with. Just focus on importing the data and print what your going to plot
checking that the types are what you would expect them to be.
I would advise using a different method to import the data as the file format is not what pandas
works best with (e.g it is transposed). For example, you could use numpy.genfromtxt, an introduction is given here.
import numpy as np
from StringIO import StringIO
data_file = (
"""Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008
""")
data = np.genfromtxt(StringIO(data_file), dtype=None)
print data
>>> array([('Vmeas', -5.0, -4.5, -4.0, -3.5),
('vfd3051', -3.2e-08, -1.49e-09, 1.38e-08, -1.17e-08),
('vfd3151', -3.71e-08, -6.58e-09, -6.58e-09, -6.58e-09),
('vfd3251', -4.73e-08, 3.59e-09, 8.68e-09, -1.68e-08),
('vfd3351', -2.18e-08, -3.71e-08, 3.6e-09, -3.2e-08)],
dtype=[('f0', 'S7'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
So now we have a numpy array of tuples with the column names as the first index and all the
data as the rest of the tuple. Most importantly all the numbers are numbers, try to avoid having
strings because conversions are messy.
Then we could do the following to get a nice pandas DataFrame:
DataDictionary = {row[0]:list(row)[1:] for row in iter(data)}
pd.DataFrame(DataDictionary)
Firstly we create a dictionary of the data by using a Python dictionary comprehension, then pass this into the DataFrame. This results in a well behaved dataframe with columns
named by the strings "Vmeas", "vdf*" and an index of all the data.
Vmeas vfd3051 vfd3151 d3251 vfd3351
0 -5.0 -3.200000e-08 -3.710000e-08 -4.730000e-08 -2.180000e-08
1 -4.5 -1.490000e-09 -6.580000e-09 3.590000e-09 -3.710000e-08
2 -4.0 1.380000e-08 -6.580000e-09 8.680000e-09 3.600000e-09
3 -3.5 -1.170000e-08 -6.580000e-09 -1.680000e-08 -3.200000e-08
I doubt this will fully answer your question but it is a start to getting the data correct before plotting it which I think was your problem. Try to keep it as simple as possible!

Categories