Error processing a csv file in Keras - python

I'm working on building an LSTM recurrent neural network that processes a set of texts and uses them to predict the author of new texts. I have a CSV file containing a single column of long text entries that is comma separated like this:
"sample text here","more text here","extra text written here"
This goes on for a few thousand entries. I'm trying to load this so I can feed it through the Keras Tokenizer and then use it in training my model but I'm stuck on an error originating on the first call to the Tokenizer where it kicks back with
Traceback (most recent call last):
File "test.py", line 35, in <module>
t.fit_on_texts(X_train)
File "I:\....text.py",
line 175, in fit_on_texts
self.split)
File "I:\....text.py",
line 47, in text_to_word_sequence
text = text.translate(translate_map)
AttributeError: 'numpy.ndarray' object has no attribute 'translate'
I'm very new to python, but as far as I can tell the issue is that the Tokenizer is expecting strings, but it's getting passed an ndarray instead. What I can't seem to manage is finding a way to pass it the correct thing, and I would really appreciate any advice. I've been working on this for a couple days now and it's just not coming to me.
Here's the relevant section of my code:
X_train = pandas.read_csv('I:\\xTrain.csv', sep=",", header=None, error_bad_lines=False).as_matrix()
t = Tokenizer(lower=False)
t.fit_on_texts(X_train)
t.texts_to_matrix(X_train, mode='count', lower=False)
I've tried reading it in a variety of ways, including using numpy.loadtxt. The error has varied a bit with the methods, but it's always that I'm trying to feed the wrong kind of input to the Tokenizer and I can't seem to work out how to get the right kind. What am I missing here? Thanks for taking the time to read!
Update
With help from furas, I discovered that my array was two columns wide and have successfully removed the second empty column. Unfortunately, this seems to have simply changed the error I'm getting slightly. It now reads:
Traceback (most recent call last):
File "test.py", line 36, in <module>
t.fit_on_texts(X_train)
File "I:\....text.py",
line 175, in fit_on_texts
self.split)
File "I:\....text.py",
line 47, in text_to_word_sequence
text = text.translate(translate_map)
AttributeError: 'numpy.int64' object has no attribute 'translate'
The only change is that numpy.ndarray is now numpy.int64. It looks to me like this is an int array now, even though it contains strings of text, so I'm attempting to find a way to convert it into a string array.
del X_train[1]
X_train[0] = Y_train[0].apply(str)
Is the code I've tried so far. The first line strips the extra column, but the second line seems to do nothing. I'm still trying to figure out how to get this data into the proper format.

Related

How do TrecTools library in Python produce evaluation results from evaluation metrices after passing two parameters as a file in TrecEval method?

Hello everyone, I am facing below error while computing information retrieval evaluatiion metrics using trectools in Python. I am passing two files to TrecEval method (prediction2, gold_labels1). In gold_labels file, first column is displaying tweet id's and third column
claim id's, and for prediction file, first column is again displaying tweet id's and third column is representing claim id's. As you may notice that claim id is different in prediction file, it's in integer format here. Is it causing the error? Could you please also explain how do trectools library compute evaluation metrices? not asking about how do evaluation metrices(Map#k, P#k) get computed but how this whole process work after givng the prediction and gold-lables file as parameters.
Error:
File "evaluate.py", line 41, in extract_metrics
print(results.get_precision(1))
File "C:\ProgramData\Anaconda3\lib\site-packages\trectools\trec_eval.py", line 670, in get_precision
merged = pd.merge(run[["query", "docid", "score"]], qrels[["query","docid","rel"]], how="left")
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py", line 74, in merge
op = _MergeOperation(
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py", line 656, in init
self._maybe_coerce_merge_keys()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py", line 1165, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Details about the Error:
I am facing this error, ValueError: You are trying to merge on int64 and object columns.
results = TrecEval(prediction, gold_labels)
results.get_metric(depth)
e.g. results.get_map(depth=1),
error is in this line. Maybe, it's due to the data that I am using, as data looks like this
prediction file
2019_bla_bla_bla(string format) Q0 4353(int format) 1 score lable
2020_bla_bla_bla(string format) Q0 9923(int format) 1 score lable
gold labels file
2019_bla_bla_bla1(string) 0 vclaim-pol-375(string) 1
2019_bla_bla_bla2(string) 0 vclaim-pol-16814(string) 1
Maybe trectools while joining in the backend not able to find similar data type for vclaim in both data files, so that's why it's giving this error. So just wanted to be sure that whether is there something wrong with the data, especially vclaim id in both files?
tweet claim pairs file in TrecQrel format
tweet claim Prediction file in TrecRun format
Trectools is open source. You can read the code yourself to understand what is going on behind the curtains (https://github.com/joaopalotti/trectools). I think this can be a good exercise for you.
If your goals is figuring out what is wrong with your code, please share as much as possible. Copy and paste the smallest snippet of your code that results in this error and also share the actual data that you are using. Dont share the data as an image, but copy and paste the it here as like this:
2019_bla_bla_bla1 0 vclaim 1
2019_bla_bla_bla2 0 vclaim 1
Thanks,
J

Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

I need to store a list/array of strings in an HDF5 file using h5py. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables. The linked answer shows performance for both h5py and PyTables). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.
To improve performance, you need to write more data each time. Since you have all the data loaded in list data, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array() from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py. Instead, you need dtype='S#'
Code below show show to convert your list of strings to a np.array() of strings. Also, I highly recomend you use Python's with/as: contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).
Code below:
import h5py
import numpy as np
sentence=['this','is','a','sentence']
data = []
for i in range(10_000):
data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)
dt = h5py.special_dtype(vlen=str)
arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
dset = h5File.create_dataset('words',data=arr,dtype=dt)
print(dset.shape, dset.dtype)

Error "buffer is too small for requested array" when modify a fits file with python

I'm trying to create a new fits file from an initial template.fits
This template.fits has the table of the voice 1 with 3915 rows, instead, my new file, must have more then 50000 rows.
The part of the code is the following:
hdulist = fits.open('/Users/Martina/Desktop/Ubuntu_Condivisa/Post_Doc_IAPS/ASTRI/ASTRI_scienceTools/Astrisim_MC/template.fits')
hdu0=hdulist[0]
hdu0.writeto(out_pile+'.fits', clobber=True)
hdu1=hdulist[1]
hdu1.header['NAXIS2'] = na
hdu1.header['ONTIME'] = tsec
hdu1.header['LIVETIME'] = tsec
hdu1.writeto(out_pile+'.fits', clobber=True)
hdu1_data=hdu1.data
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
hdu1.writeto(out_pile+'.fits', clobber=True)
When I try to fill the new table (the last part of the code), the error is the following:
Traceback (most recent call last):
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 734, in __get__
return obj.__dict__[self._key]
KeyError: 'data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Astrisim_MC_4.py", line 340, in
hdu1_data=hdu1.data
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 736, in __get__
val = self.fget(obj)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 404, in data
data = self._get_tbdata()
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 171, in _get_tbdata
self._data_offset)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\base.py", line 478, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\file.py", line 279, in readarray
buffer=self._mmap)
TypeError: buffer is too small for requested array
I tried to vary the number of rows and the code works correctly up to 3969 rows.
How can I solve the problem?
Thank you very much in advance,
cheers!
Martina
Your initial problem where where you did this:
hdu1.header['NAXIS2'] = na
A natural thing to think you might be able to do, but you actually should not. In general when working with astropy.io.fits, one should almost never manually mess with keywords in the FITS header that describe the structure of the data itself. This stems in part from the design of FITS itself--that it mixes these structural keywords in with metadata keywords--and partly a design issue with astropy.io.fits that it lets you manipulate these keywords at all, or that it doesn't more tightly tie the data to them. I wrote about this issue at more length here: https://github.com/astropy/astropy/issues/3836 but never got around to adding more explanation of this to the documentation.
Basically the way you can think about it is that when a FITS file is opened, its header is first read and parsed into a Header object containing all the header keywords. Some book-keeping is also done to keep track of how much data is in the file after the header. Then when you access the data of the HDU the header keywords are used to determine what the type and shape of the data is. So by doing something like
hdu1.header['NAXIS2'] = na
hdu1_data = hdu1.data
this isn't somehow growing the data in the file. Instead it's just confusing it into thinking there are more rows of data in the file then there actually are, hence the error "buffer is too small for requested array". The "buffer" it's referring to in this case is the rest of the data in the file, and you're requesting that it read an array that's longer than there is data in the file.
The fact that it allows you do break this at all is a bug in Astropy IMO. When the file is first opened it should save away all the correct structural keywords in the background, so that the data can still be loaded properly even if the user accidentally modifies these keywords (or perhaps the user should be completely prevented from modifying these keywords directly.
That's a long way to explain where you went wrong, but maybe it will help better understand how the library works.
As to your actual question, I think #Evert's advice is good, to use the higher-evel and easier to work with astropy.table to create a new table that's the size you need, and then copy the existing table into the new one. You can open the FITS table directly as a Table object as well with Table.read. I think you can also copy the FITS metadata over but I'm not sure exactly the best way to do that.
One other minor comment unrelated to your main question--when working with arrays you don't have to (and in fact shouldn't) use for loops to perform vectorizable operations.
For example since this is just looping over array indices:
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
you can write operations like this like:
hdu1_data[:][1] = np.arange(na)
hdu1_data[:][3] = t + 0.
hdu1_data[:][7] = ra
and so on (I'm not sure why you were doing j+1 because this is skipping the first row, but the point still stands). This assumes of course that the array being updated (hdu1_data, in this case) already has na rows. But that's why you need to grow or concatenate to your array first if it's not already that size.

PyLdaVis : TypeError: cannot sort an Index object in-place, use sort_values instead

I am trying to visualize LDA topics in Python using PyLDAVis but I can't seem to get it right. My model has a vocab size of 150K words and about 16 Million tokens were taken to train it.
I am doing it outside of an iPython notebook and this is the code that I wrote to do it.
model_filename = "150k_LdaModel_topics_"+ topics +"_passes_"+passes +".model"
dictionary = gensim.corpora.Dictionary.load('LDADictSpecialRemoved150k.dict')
corpus = gensim.corpora.MmCorpus('LDACorpusSpecialRemoved150k.mm')
ldamodel = gensim.models.ldamodel.LdaModel.load(model_filename)
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.save_html(vis, "topic_viz_"+topics+"_passes_"+passes+".html")
I get the following error after 2-3 hours of running code on a high speed server with >30GBs of RAM. Can someone help where I am going wrong?
Traceback (most recent call last):
File "create_vis.py", line 36, in <module>
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
File "/local/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 110, in prepare
return vis_prepare(**opts)
File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 398, in prepare
token_table = _token_table(topic_info, term_topic_freq, vocab, term_frequency)
File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 267, in _token_table
term_ix.sort()
File "/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 1703, in sort
raise TypeError("cannot sort an Index object in-place, use "
TypeError: cannot sort an Index object in-place, use sort_values instead
There was a problem with the LDAVis Code and upon reporting the issue, it has been resolved.

PySpark serialization EOFError

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.
Toy code and error below.
#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)
#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()
Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):
Traceback (most recent call last):
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
in manager
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
in worker
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136,
in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545,
in read_int
raise EOFError
EOFError
The error appears to happen in the pySpark read_int function. Code for which is as follows from spark site :
def read_int(stream):
length = stream.read(4)
if not length:
raise EOFError
return struct.unpack("!i", length)[0]
This would mean that when reading 4bytes from the stream, if 0 bytes are read, EOF error is raised. The python docs are here.
I have faced the same issues and don't know how to debug it. seems that it will cause executor thread stuck and never return anything.
Have you checked to see where in your code the EOError is arising?
My guess would be that it's coming as you attempt to define df with, since that's the only place in your code that the file is actually trying to be read.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
At every point after this line, your code is working with the variable df, not the file itself, so it would seem likely that this line is generating the error.
A simple way to test if this is the case would be to comment out the rest of your code, and/or place a line like this right after the line above.
print(len(df))
Another way would be to use a try loop, like:
try:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
except:
print("Failed to load file into df!")
If it turns out that that line is the one generating the EOFError, then you're never getting the dataframes in the first place, so attempting to reduce them won't make a difference.
If that is the line generating the error, two possibilities come to mind:
Your code is calling one or both of the .csv files earlier on, and isn't closing it prior to this line. If so, simply close it above your code here.
There's something wrong with the .csv files themselves. Try loading them outside of this code, and see if you can get them into memory properly in the first place, using something like csv.reader, and manipulate them in ways you'd expect.

Categories