Cannot plot or use .tolist() on pd dataframe column - python

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!

So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]

The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

Related

Inconsistent behavior in dataframe while changing dtypes automatically

I'm seeing a behavior with Pandas DataFrames where attempting to assign a value with a data type that is incompatible with the existing dtype of the Series may or may work, depending on the existing dtype of the Series. For instance, if the Series is of dtype int64 and I assign a list then it will fail with the error:
ValueError: setting an array element with a sequence.
... on the other hand if the initial dtype was bool then the assignment of a single element list will work, but only the contents of the list will be stored in the DataFrame, not the list + contents as expected/intended.
The code below shows some of this unpredictable behavior ... in this instance I try to store a DataFrame inside a DataFrame and can get it achieve a mangled result that stores only part of the nested DataFrame. If I first wrap the DataFrame in a list, Pandas will unwrap it and store it fine ... but again, not as a list of DataFrames as expected.
I can work around the issue by changing the dtype of the Series to a compatible dtype prior to the assignment but maybe someone can explain what's going on here or point to the relevant documentation. Thanks.
test_df_1 = pd.DataFrame( {'A':[1]} )
print('*** New DataFrame #1 ****************************')
print(test_df_1)
test_df_1.iloc[0,0] = 2
print('\n*** Works as Expected: [0,0]=2 ******************')
print(test_df_1)
test_df_2 = pd.DataFrame( { 'X':'Z',
'Y':[5] })
print('\n*** New DataFrame #2 ****************************')
print(test_df_2)
test_df_1.iloc[0,0] = [1.2] # this works fine
#test_df_1.iloc[0,0] = [12] # this will break
#test_df_1.iloc[0,0] = test_df_2 # this will execute but can't explain behavior
test_df_1.iloc[0,0] = [test_df_2] # Why do I have to wrap in a list?
print('\n*** Nested DataFrame ****************************')
print(test_df_1)
print('\n*** Retrieve the DataFrame **********************')
out = test_df_1.iloc[0,0]
print(type(out)) # why not a list?
print(out)

Python iloc slice range from dictionary value

I am trying to use a dictionary value to define the slice ranges for the iloc function but I keep getting the error -- Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] . The excel sheet is built for visual information and not in any kind of real table format (not mine so I can’t change it) so I have to slice the specific ranges without column labels.
tried code - got the error
cr_dict= {'AA':'[42:43,32:65]', 'BB':'[33:34, 32:65]'}
df = my_df.iloc[cr_dict['AA']]
the results I want would be similar to
df = my_df.iloc[42:43,32:65]
I know I could change the dictionary and use the following but it looks convoluted and not as easy to read– is there a better way?
Code
cr_dict= {'AA':[42,43,32,65], 'BB':'[33,34, 32,65]'}
df = my_df.iloc[cr_dict['AA'][0]: cr_dict['AA'][0], cr_dict['AA'][0]: cr_dict['AA'][0]]
Define your dictionaries slightly differently.
cr_dict= {'AA':[42,43]+list(range(32,65)),
'BB':[33,34]+list(range(32,65))}
Then you can slice your DataFrame like so:
>>> my_df.iloc[cr_dict["AA"], cr_dict["BB"]].sort_index()

Sorting numpy matrix results in value error

I created a numpy matrix called my_stocks with two columns:
column 0 is made by specific objects that I defined
column 1 is made of integers
I try to sort it by column 1, but this causes ValueError: np.sort(my_stocks)
The full method:
def sort_stocks(self):
my_stocks = np.empty(2)
for data in self.datas:
if data.buflen() > self.p.period:
new_stock=[data,get_slope(self,data)]
my_stocks = np.vstack([my_stocks,new_stock])
self.stocks = np.sort(my_stocks)
It doesn't cause problems if instead of sorting the matrix, I print it. A print example is below:
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ec2eb0>
1.1081551020408162]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ec2190>
0.20202275819418677]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124eda610>
0.08357118119975258]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ecc400>
0.5487027829313539]]
I figured out that the method was sorting my matrix by raw and not by column.
This works even if it looks rather counterintuitive:
my_stocks[my_stocks[:, 1].argsort()]

Why does pandas.to_numeric result in a list of lists?

I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.

Pandas convert columns type from list to np.array

I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df

Categories