How to use dataframes within a map function in Spark?

How to use dataframes within a map function in Spark? - python

Definitions:
sampleDF is sample dataframe that has a list record for look up purposes.
sampleDS is a RDD with list of elements in it.
mappingFunction is to look up the elements of sampleDS in sampleDF and map them to 1 if they exist in sampleDF and to 0 if they don't.
I have a mapping function as follows:
def mappingFunction(element):
# The dataframe lookup!
lookupResult = sampleDF.filter(sampleDF[0] == element).collect()
if len(lookupResult) > 0:
print lookupResult
return 1
return 0
The problem:
Accessing sampleDF outside of the mapping function works perfectly fine but as soon as I use it inside the function I get the following error:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:744)
What else I tried:
I did try saving a temporary table and using the sqlContext select inside the map function and still couldn't get this to work. This is the error I get:
File "/usr/lib64/python2.6/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.6/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.6/pickle.py", line 686, in _batch_setitems
save(v)
File "/usr/lib64/python2.6/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/opt/spark/python/pyspark/cloudpickle.py", line 542, in save_reduce
save(state)
File "/usr/lib64/python2.6/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.6/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.6/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib64/python2.6/pickle.py", line 306, in save
rv = reduce(self.proto)
TypeError: 'JavaPackage' object is not callable
What I am asking for:
I have tried to simplify my problem through simple example. Any help on how to use a dataframe inside a map function is highly appreciated.

It is not possible. Spark doesn't support nested operations on distributed data structures (RDDs, DataFrames, Datasets). Even if it did executing large number of jobs wouldn't be a good idea. Given the code you've shown you probably want to convert your RDD to a DataFrame and perform join wit
(rdd.map(x => (x, )).toDF(["element"])
.join(sampleDF, sampleDF[0] == df[0])
.groupBy("element")
.agg(count("element") > 0))
On a side note printing inside map is completely useless not mention it add additional IO overhead.

Related

Why does pandas.read_csv read all of the blank values in one code but not the other?

I am writing code that requires use of the pandas.read_csv function and I like to use a test python file before I implement my code into the main python file. The part of the CSV file I am trying to read in just has data displayed in a seemingly random fashion with no real columns or headers. I just want to read the information into a dataframe so I can parse the data exactly where I know it is going to be for every file. For the test code and the main code I am using the same list of CSV files. The only difference is the test code runs in a different folder and does not sit inside a function. In my test code I have no issue extracting data from the CSV file using the read_csv function but on my main program it is giving me errors. In my test code I can easily use pd.read_csv this way:
for x in range(len(filelist)):
df = pd.read_csv(filelist[x], index_col=False, nrows=15, header=None, usecols=(0,1,2,3),
dtype={0:"string",1:"string",2:"string",3:"string"})
print(df)
The output is shown below:
Output from test code execution
However, when I try to port this over into my main code it won't function the same way. If I copy the code exactly it says there is no column 1,2, or 3. My next step was to erase the usecols and dtype variables and then it gave me the error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2
I tried adding the comma delimiter and I tried changing the engine to python. Neither worked. Eventually, I determined from the tokenizing error that the program was expecting less information on certain lines so I broke up the code so that there were two dataframes and each would hold a respective amount of columns. This finally worked. The dataframes I created were structured as shown below:
df1 = pd.read_csv(filelist[x],skiprows=range(1), index_col=False, nrows=11, header=None)
df2 = pd.read_csv(filelist[x],skiprows=range(0,13), index_col=False, nrows=2,
usecols=(0,1,2), header=None)
print(df1)
print(df2)
The output for this is shown below:
Output from main code execution
This gives me something I can work with to accomplish my task, but it was extremely frustrating working through this and I have no idea why I even needed to go through all of this. I still will have to go back through and make some final adjustments including all the calls to the variables I need from these so if I can figure out why it is not working the same in the main code it would make my life a little easier. Does anyone have any clue why I had to make these adjustments? It seems the code for my main program is not reading in empty cells or just takes the amount of spaces used for the first row it looks at and just assumes the rest should be the same. Any information would be greatly appreciated. Thank you.
I am adding the full error messages below. I made it so it only calls the first file in the list for debugging purposes. This first one is when I copy the read_csv command over exactly:
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 484, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None, usecols=(0,1,2,3),
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1231, in _make_engine
return mapping[engine](f, **self.options)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 146, in __init__
self._validate_usecols_names(
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\base_parser.py", line 913, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: [1, 2, 3]
This next error occurs after I remove usecols and dtype from the parameters.
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 483, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 817, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2
This final set of errors is given after I add the delimiter=',' and engine='python' parameters while usecols and dtypes have still been removed.
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 483, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None, delimiter=',', engine='python')
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 270, in read
alldata = self._rows_to_cols(content)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 1013, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 739, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 2 fields in line 13, saw 4

Conflict between dill and pickle while using pyspark

Currently pyspark uses 2.4.0 version as part of conda installation. pip installation allows to use a later version of pyspark which is 3.1.2. but using this version, dill library has conflicts with pickle library.
i use this for unit test for pyspark. If I import dill library in test script, or any other test which imports dill which is run along with the pyspark test using pytest, it breaks.
The error it gives the below given error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/pyspark/serializers.py", line 437, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
cp.dump(obj)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/opt/conda/lib/python3.6/pickle.py", line 610, in save_reduce
save(args)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
This happens in /opt/conda/lib/python3.6/pickle.py file in save function. After persistent id and memo check it tries to get the type of the obj and if that is ‘cell’ class, it tries to get the details of it in the next line using self.dispatch.get function. On using pyspark 2.4.0 returns ‘None’ and it works well but on using pyspark 3.1.2, it returns an object and it forces the object to use save_reduce function. It is unable to save it since the cell is empty. Eg: <cell at 0x7f0729a2as66: empty>,
If we force the return value to be None for pyspark 3.1.2 installation, it works, but that needs to happen gracefully, than by hardcoding.
Anyone had this issue ? any suggestion on using which versions of dill, pickle and pyspark to use together.
here is the code that is being used
import pytest
from pyspark.sql import SparkSession
import dill # if this line is added, test does not work with pyspark-3.1.2
simpleData = [
("James", "Sales", "NY", 90000, 34, 10000),
]
schema = ["A", "B", "C", "D", "E", "F"]
#pytest.fixture(scope="session")
def start_session(request):
spark = (
SparkSession.builder.master("local[1]")
.appName("Python Spark unit test")
.getOrCreate()
)
yield spark
spark.stop()
def test_simple_rdd(start_session):
rdd = start_session.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7])
assert rdd.stdev() == 2.0
This works with pyspark 2.4.0 but does not work with pyspark 3.1.2 with the above given error.
dill version - 0.3.1.1
pickle version - 4.0
python - 3.6

Apparently you aren't using dill except to import it. I assume you will be using it later...? As I mentioned in my comment, cloudpickle and dill do have some mild conflicts, and this appears to be what you are experiencing. Both serializers add logic to the pickle registry to tell python how to serialize different kinds of objects. So, if you use both dill and cloudpickle, there can be conflicts as the pickle registry is a dict -- so the order of import and etc matters.
The issue is similar to as noted here:
https://github.com/tensorflow/tfx/issues/2090
There's a few things you can try:
(1) some codes allow you to replace the serializer. So, if you are able replace cloudpickle for dill, then that may resolve the conflicts. I'm not sure this can be done with pyspark, but there is a pyspark module on serializers, so that is promising...
Set PySpark Serializer in PySpark Builder
(2) dill provides a mechanism to help mitigate some of the conflicts in the pickle registry. If you use dill.extend(False) before using cloudpickle, then dill.extend(True) before using dill, it may clear up the issue you are seeing.

How do I efficiently retrieve a non-initial slice of a Flywheel scan?

Given a large Dynamo table with lots of items, I would like to be able to start a scan and later resume iteration on it from an unrelated Python context, as if I had continued calling the next() of gen() on the scan itself.
What I am trying to avoid:
offset = 500
count = 25
scan_gen = engine.scan(AModel).gen()
for _ in range(offset):
scan_gen.next()
results = [scan_gen.next() for _ in range(count)]
Because this would require restarting the scan from the top, every single time.
I see that the DynamoDB API normally works in a cursor-like fashion with the LastEvaluatedKey property: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html
Is there a way to use this to jump ahead in scan generator in Flywheel?
Failing that, is there a way to serialize the state of the generator? I have tried pickling the generator, and it causes pickle.PicklingError due to name resolution problems:
>>> with open('/tmp/dump_dynamo_result', 'wb') as out_fp:
... pickle.dump(engine.scan(AModel).gen(), out_fp, pickle.HIGHEST_PROTOCOL)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/pickle.py", line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/python2.7/pickle.py", line 396, in save_reduce
save(cls)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 748, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <type 'generator'>: it's not found as __builtin__.generator

Yes, you can build the LastEvaulatedKey yourself and pass it in as the ExclusiveStartKey in another scan. It will basically be the key (hash or hash/range) of the last item that was evaluated -- i.e. the item that you want to start scanning from.
If you want to see what it looks like, do a Scan and set the limit to 2 or something small. You'll get a LastEvaluatedKey returned and you can examine it and determine how to build it yourself. Then, choose an item from your database from where you want to start your scan and create a LastEvaluatedKey out of it.

tkinter: How to serialize a treeview?

I am working on a python tkinter application that reads initial data from yaml file into a hierarchical TreeView to be edited further by the user.
To implement "save data" and "undo" functions, should I walk the treeview and reconstruct the data into a python object to be serialized (pickle)?
Or is there a python module allowing, for example, to specify the treeview and the output file to be saved on?

I doubt there's any Python module that does what you want, and even if there was, I don't think you'd want to structure your application around using it. Instead you would probably be better off decoupling things and storing the primary data in something independent of the human interface (which may or may not be graphical and might vary or otherwise be changed in the future). This is sometimes called the application "Model".
Doing so will allow you to load and save it regardless of what constitutes the current human interface. So, for example, you would then be free to use pickle if the internal Model is comprised of one or more Python objects. Alternatively you could save the data back into a yaml format file which would make loading it back in again later a cinch since the program can already do that.
Likewise, as the user edits the TreeView, equivalent changes should be made to the Model to keep the two in sync.
Take some time out from coding and familiarize yourself with the Model–View–Controller (MVC) design pattern.

Out of the box, the answer is no, you can't serialize a TreeView. dill is probably your best bet at serialization out of the box… and it fails to pickle a TreeView object.
>>> import ttk
>>> import Tkinter as tk
>>>
>>> f = tk.Frame()
>>> t = ttk.Treeview(f)
>>>
>>> import dill
>>> dill.pickles(t)
False
>>> dill.detect.errors(t)
PicklingError("Can't pickle 'tkapp' object: <tkapp object at 0x10eda75e0>",)
>>>
You might be able to figure out how to pickle a TreeView, and then add that method to the pickle registry… but, that could take some serious work on your part to chase down how things fail to pickle.
You can see what happens, it hits the __dict__ of the Tkinter.Tk object, and dies trying to pickle something.
>>> dill.detect.trace(True)
>>> dill.dumps(t)
C2: ttk.Treeview
D2: <dict object at 0x1147f5168>
C2: Tkinter.Frame
D2: <dict object at 0x1147f1050>
C2: Tkinter.Tk
D2: <dict object at 0x1148035c8>
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 194, in dumps
dump(obj, file, protocol, byref, fmode)#, strictio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 184, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 725, in save_inst
save(stuff)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 678, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 725, in save_inst
save(stuff)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 678, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 725, in save_inst
save(stuff)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.3.dev0-py2.7.egg/dill/dill.py", line 678, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 313, in save
(t.__name__, obj))
pickle.PicklingError: Can't pickle 'tkapp' object: <tkapp object at 0x10eda7648>
>>>
That something is a tkapp object.
So, if you'd like to dig further, you can use the methods in dill.detect to help you uncover exactly why it's not pickling… and try to get around it.
I'm doubtful that pickling a widget is the right way to go. You probably also don't want to go the route of pulling the state out of a treeview into a shadow class, and saving that class. The problem is that the treeview is not really built with a good separation in mind for saving state.
If you can redesign to cleanly separate the state of your application from the widgets themselves, then that's more likely to do what you want. So, when you ask How to serialize a treeview, this is really not what you are asking. You want to know how to save the state of your application.
There are packages that can do something like that very easily. I'd suggest you looking at enaml and/or traits. enaml is a declarative markup that asks you to build a class that describes how your application interface works. It forces you to separate the inner workings of the thing your are displaying from the code that's necessary top operate the user interface… and it does it in a very easy to build way -- where the state of the application is separate from the user interface wiring. Thus, an instance of the class you build contains the state of the application at any time -- regardless of whether it has a UI on it or not, or two or three UIs for that matter. It makes saving the state of the application very easy because you never have to
worry about saving the state of the UI -- the UI has no state -- it's just layout painted on top of the application. Then you won't have to worry about pickling widgets…
Check out enaml here: https://github.com/nucleic/enaml
and traits here: http://docs.enthought.com/traits

Another Q&A shows how to pickle a treeview on exit and reload it on startup:
How to save information from the program then use it to show in program again (simple programming)
The OP has information laid out thusly:
#----------TreeViewlist----------------------
Header =['Website','Username','Password','etc']
The gist of the treeview is a record of each website the OP visits, what user ID is used and the password used.
To summarize the accepted answer:
Save treeview to pickle on exit
x=[tree.item(x)['values'] for x in tree.get_children()]
filehandler = open('data.pickle', 'wb')
pickle.dump(x,filehandler)
filehandler.close()
Load pickle to build treeview on startup
items = []
try:
filehandler = open('data.pickle', 'rb')
items = pickle.load(filehandler)
filehandler.close()
except:
pass
for item in items:
tree.insert('','end',values=item)
The answer appears straight forward (to me) but if you have any questions post a comment below. If you see a flaw or bug in the code post a comment in the link above.

how to save and retrive a list from a file in Python?

I searched for a way to save variables into a file (making them persistent for other computations).
I found some solutions like: https://stackoverflow.com/a/899199/1846113
but when I implemented it on a list like:
import pickle
list = [['cccc',['asd','sdad','sdadas']],['cscc',['asd','sdad','sdadas']]]
pickle.dump(list, outfile)
It gives me this error
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1370, in dump
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 615, in _batch_appends
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 615, in _batch_appends
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 739, in save_global
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 811, in whichmodule
TypeError: unhashable type: 'list'
Anyone know what the problem is?
Or other solutions?
Edit: with solution
The problem was that I made an error creating the list.
I'll post it (so you can laugh) and avoid this stupid error:
I was creating the list by processing some elements of a list with an (ugly) function:
def process_element(doc):
processed_value = do_something(doc.pop())
return [doc.pop, processed_value]
As some of you already notice, I've made an error returning the output:
[doc.pop, processed_value]
I added a method, which is not hashable, to the list, giving me the error.
The correct version is:
def process_element(doc):
processed_value = do_something(doc.pop())
return [doc.pop(), processed_value]
Thanks.

A simpler way is to use with, for and enumerate like this :
with open('input.txt', 'r') as f, open('output.txt', 'w') as o:
for numline, line in enumerate((line.split() for line in f), start=1):
# process line elements by using line[0] line[1]...
# When you're done, you can write results in an output file like this
# (add a loop if needed):
o.write("%s %s\n" % (results[0], results[1]))
Et voilà !

For your example list, and most others, I would use JSON file format:
import json
lst = [['cccc',['asd','sdad','sdadas']],['cscc',['asd','sdad','sdadas']]]
with open('list.json', w) as jsonout:
json.dump(lst, jsonout)
# and then
with (open('list.json') as jsonin:
lst = json.load(jsonin)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use dataframes within a map function in Spark? - python

Related

Why does pandas.read_csv read all of the blank values in one code but not the other?

Conflict between dill and pickle while using pyspark

How do I efficiently retrieve a non-initial slice of a Flywheel scan?

tkinter: How to serialize a treeview?

how to save and retrive a list from a file in Python?

Categories

Resources