So, I have this code to process CTD data, it's working normally until I try to process the salinity column, giving me this error:
KeyError: "['SALINITY;PSU'] not in index"
But when I check the columns it's all there, here is the code`
df = pd.read_csv('/home/labdino/PycharmProjects/CTDprocessing/venv/DadosCTD_tabulacao.csv',
sep='\t',
skiprows=header,
)
down, up = df.split()
down = down[["TEMPERATURE;C", "SALINITY;PSU"]]
process = (down.remove_above_water()
.remove_up_to(idx=7)
.despike(n1=2, n2=20, block=100)
.lp_filter()
.press_check()
.interpolate()
.bindata(delta=1, method="average")
.smooth(window_len=21, window="hanning")
)
process.head()
`
Output:
Traceback (most recent call last):
File "/home/labdino/PycharmProjects/CTDprocessing/venv/CTDLab.py", line 47, in <module>
down = down[["TEMPERATURE;C", "SALINITY;PSU"]]
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3811, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6108, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6171, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['SALINITY;PSU'] not in index"
When I use this code with any other column it works, but when trying with salinity it doesn't, and I checked the csv file and it's all normal.
Related
I'm using a market data API and am using Pandas dataframes to filter and restructure before storing in a database. Before it goes into InfluxDB, I need to restructure a date column which currently looks like:
Earnings
May 15/b
Apr 09/a
What I have so far is below, but I'm getting the error -
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/OMT/Demo/scratchpad.py", line 41, in <module>
filtered['NewEarnings'] = pd.to_datetime(filtered['NewEarnings'], format='%b%d')
File "/usr/local/lib/python3.8/dist-packages/pandas/core/tools/datetimes.py", line 801, in to_datetime
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/tools/datetimes.py", line 178, in _maybe_cache
cache_dates = convert_listlike(unique_dates, format)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/tools/datetimes.py", line 460, in _convert_listlike_datetimes
raise e
File "/usr/local/lib/python3.8/dist-packages/pandas/core/tools/datetimes.py", line 423, in _convert_listlike_datetimes
result, timezones = array_strptime(
File "pandas/_libs/tslibs/strptime.pyx", line 144, in pandas._libs.tslibs.strptime.array_strptime
ValueError: time data '-' does not match format '%b%d' (match)
The code in question -
filtered = df
earningsColumn = filtered['Earnings'].squeeze()
stripped = earningsColumn.str.rstrip('. /a/b')
filtered['NewEarnings'] = stripped
filtered['NewEarnings'] = pd.to_datetime(filtered['NewEarnings'], format='%b%d')
I wanna insert data into a data frame like:
df = pd.DataFrame(columns=["Date", "Title", "Artist"])
insertion happens here:
df.insert(loc=0, column="Date", value=dateTime.group(0), allow_duplicates=True)
df.insert(loc=0, column="Title", value=title, allow_duplicates=True)
df.insert(loc=0, column="Artist", value=artist, allow_duplicates=True)
sadly I don't know how to handle these errors:
Traceback (most recent call last):
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 697, in _try_cast
subarr = maybe_cast_to_datetime(arr, dtype)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1067, in maybe_cast_to_datetime
value = maybe_infer_to_datetimelike(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 865, in maybe_infer_to_datetimelike
if isinstance(value, (ABCDatetimeIndex, ABCPeriodIndex,
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
TypeError: 'in <string>' requires string as left operand, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/sashakaun/IdeaProjects/scrapyscrape/test.py", line 24, in <module>
df.insert(loc=0, column="Title", value=title, allow_duplicates=True)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in insert
self._ensure_valid_index(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3424, in _ensure_valid_index
value = Series(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/series.py", line 261, in __init__
data = sanitize_array(data, index, dtype, copy,
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 625, in sanitize_array
subarr = _try_cast(data, False, dtype, copy, raise_cast_failure)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 720, in _try_cast
subarr = np.array(arr, dtype=object, copy=copy)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/bs4/element.py", line 971, in __getitem__
return self.attrs[key]
KeyError: 0
its my first question, please be kind,
thanks in advance
The error seems to be from your value=dateTime.group(0) value. Can you elaborate on what is structure of dateTime?
Plus, df.insert() inserts a column rather than adding the data to a dataframe.
You should first transform your data into a series object and then use df.concat() to concatenate with the original dataframe.
Below are some references:
https://kite.com/python/answers/how-to-insert-a-row-into-a-pandas-dataframe
Add one row to pandas DataFrame
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I've a dictionary consisting of keys = word, value = Array of 300 float numbers.
I'm unable to use this dictionary in my pyspark UDF.
When size of this dictionary is 2Million keys it does not work. But when I reduce the size to 200K it works.
This is my code for the function to be converted to UDF
def get_sentence_vector(sentence, dictionary_containing_word_vectors):
cleanedSentence = list(clean_text(sentence))
words_vector_list = np.zeros(300)# 300 dimensional vector
for x in cleanedSentence:
try:
words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
except Exception as e:
print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
return words_vector_list.tolist()
This is my UDF
get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))
This is how i'm calling the udf to be added as a column in my dataframe
dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))
And this is the stack trace for the error
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
return udf_obj(*args)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
judf = self._judf
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
broadcast = sc.broadcast(pickled_command)
File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
self._path = self.dump(value, f)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
How big is your fast_text_dictionary in the 2M case? It might be too big.
Try broadcast it first before running udf. e.g.
broadcastVar = sc.broadcast(fast_text_dictionary)
Then use broadcastVar instead in your udf.
See the document for broadcast
I am trying to create different python file where the code is given below. While calling the method, I pass the mydata as data frame with these columns
['wage', 'educ', 'exper', 'tenure'].
import pandas as pd
import numpy as np
from prettytable import PrettyTable as pt
def LinearRegressionOLS(mydata,target_column):
if(not isinstance(mydata,pd.DataFrame)):
raise TypeError("Data must be of type Data Frame")
if(not isinstance(target_column,str)):
raise TypeError("target_column must be String")
if(target_column not in mydata.columns):
raise KeyError("target_column doesn't exist in Data Frame")
data=mydata.copy()
data["one"]=np.ones(data.count()[target_column])
column_list=["one"]
for i in data.columns:
column_list.append(i)
Y=data[target_column].as_matrix()
data.drop(target_column,inplace=True,axis=1)
X=data[column_list].as_matrix()
del data
beta = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),Y)
predY = np.matmul(X,beta)
total = np.matmul((Y-np.mean(Y)).T,(Y-np.mean(Y)))
residual = np.matmul((Y-predY).T,(Y-predY))
sigma = np.matmul((Y-predY).T,(Y-predY))/(X.shape[0]-X.shape[1])
omega = np.square(sigma)*np.linalg.inv(np.matmul(X.T,X))
SE = np.sqrt(np.diag(omega))
tstat = beta/SE
Rsq = 1-(residual/total)
final = pt()
final.add_column(" ",column_list)
final.add_column("Coefficients",beta)
final.add_column("Standard Error",SE)
final.add_column("t-stat",tstat)
print(final)
print("Residual: ",residual)
print("Total: ",total)
print("Standard Error: ",sigma)
print("R Square: ",Rsq)
After running the above code, by calling the function given below,
>>> c
['wage', 'educ', 'exper', 'tenure']
>>> import LR_OLS as inf
>>> inf.LinearRegressionOLS(file[c],"wage")
, i get some error like this
Traceback (most recent call last):
File "<pyshell#182>", line 1, in <module>
inf.LinearRegressionOLS(file[c],"wage")
File "E:\python\LR_OLS.py", line 29, in LinearRegressionOLS
File "C:\Program Files\Python35\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Program Files\Python35\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Program Files\Python35\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['wage'] not in index"
Can anyone help me as to why i am getting this error. How can i resolve it?
The problem is that you still have 'wage' in 'column_list. So in order to never let it get in there do the following adaptation:
for i in data.columns:
if i != 'wage': # add this line to your code
column_list.append(i)
I have an h5 store named weather.h5. My default Python environment is 3.5.2. When I try to read this store I get TypeError: Already tz-aware, use tz_convert to convert.
I've tried both pd.read_hdf('weather.h5','weather_history') and pd.io.pytables.HDFStore('weather.h5')['weather_history], but I get the error no matter what.
I can open the h5 in a Python 2.7 environment. Is this a bug in Python 3 / pandas?
I have the same issue. I'm using Anaconda Python: 3.4.5 and 2.7.3. Both are using pandas 0.18.1.
Here is a reproducible example:
generate.py (to be executed with Python2):
import pandas as pd
from pandas import HDFStore
index = pd.DatetimeIndex(['2017-06-20 06:00:06.984630-05:00', '2017-06-20 06:03:01.042616-05:00'], dtype='datetime64[ns, CST6CDT]', freq=None)
p1 = [0, 1]
p2 = [0, 2]
# Saving any of these dataframes cause issues
df1 = pd.DataFrame({"p1":p1, "p2":p2}, index=index)
df2 = pd.DataFrame({"p1":p1, "p2":p2, "i":index})
store = HDFStore("./test_issue.h5")
store['df'] = df1
#store['df'] = df2
store.close()
read_issue.py:
import pandas as pd
from pandas import HDFStore
store = HDFStore("./test_issue.h5", mode="r")
df = store['/df']
store.close()
print(df)
Running read_issue.py in Python2 has no issues and produces this output:
p1 p2
2017-06-20 11:00:06.984630-05:00 0 0
2017-06-20 11:03:01.042616-05:00 1 2
But running it in Python3 produces Error with this traceback:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2779, in read
ax = self.read_index('axis%d' % i)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2367, in read_index
_, index = self.read_index_node(getattr(self.group, key))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2492, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/indexes/base.py", line 153, in new
result = DatetimeIndex(data, copy=copy, name=name, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tseries/index.py", line 321, in new
raise TypeError("Already tz-aware, use tz_convert "
TypeError: Already tz-aware, use tz_convert to convert.
Closing remaining open files:./test_issue.h5...done
So, there is an issue with indices. However, if you save df2 in generate.py (datetime as a column, not as an index), then Python3 in read_issue.py produces a different error:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['/df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2788, in read
placement=items.get_indexer(blk_items))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 2518, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 90, in init
len(self.mgr_locs)))
ValueError: Wrong number of items passed 2, placement implies 1
Closing remaining open files:./test_issue.h5...done
Also, if you execute generate_issue.py in Python3 (saving either df1 or df2), then there is no problem executing read_issue.py in either Python3 or Python2