Reading old HDF5 stores created by pandas - python

I'm having some trouble reading and old HDF5 file that I made with pandas in python 2.7.
At the time I was using the to_hdf method to append groups to the file (e.g. db.to_hdf('File.h5', 'groupNameA', mode='a', data_columns=True, format='table'))
Now when I open the store and get the keys of the groups I find that each one has a slash added to the name ('/groupNameA' in the example above). Attempting to access those groups with store['/groupNameA'], store.select('/groupNameA'), etc. produces TypeError: getattr(): attribute name must be string. Getting that error seems correct (slashes should not be used in these keys) but that doesn't help me get my data into a python 3 environment.
If there's a way to get around this problem in python 3, that'd be great.
Alternatively, I can still load the data in my 2.7 environment. So changing the code for writing the store so that slashes don't get added would probably solve the issue as well.

Related

Modify flow file attributes in NiFi with Python sys.stdout?

In my pipeline I have a flow file that contains some data I'd like to add as attributes to the flow file. I know in Groovy I can add attributes to flow files, but I am less familiar with Groovy and much more comfortable with using Python to parse strings (which is what I'll need to do to extract the values of these attributes). The question is, can I achieve this in Python when I use ExecuteStreamCommand to read in a file with sys.stdin.read() and write out my file with sys.stdout.write()?
So, for example, I use the code below to extract the timestamp from my flowfile. How do I then add ts as an attribute when I'm writing out ff?
import sys
ff = sys.stdin.read()
t_split = ff.split('\t')
ts = t_split[0]
sys.stdout.write(ff)
Instead of writing back the entire file again, you can simply write the attribute value from the input FlowFile
sys.stdout.write(ts) #timestamp in you case
and then, set the Output Destination Attribute property of the ExecuteStreamCommand processor with the desired attribute name.
Hence, the output of the stream command will be put into an attribute of the original FlowFile and the same can be found in the original relationship queue.
For more details, you can refer to ExecuteStreamCommand-Properties
If you're not importing any native (CPython) modules, you can try ExecuteScript with Jython rather than ExecuteStreamCommand. I have an example in Jython in an ExecuteScript cookbook. Note that you don't use stdin/stdout with ExecuteScript, instead you have to get the flow file from the session and either transfer it as-is (after you're done reading) or overwrite it (there are examples in the second part of the cookbook).

Dask read_csv fails where pandas doesn't

Trying to use dask's read_csv on file where pandas's read_csv like this
dd.read_csv('data/ecommerce-new.csv')
fails with the following error:
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The file is csv file of scraped data using scrapy with two columns, one with the url and the other with the html(which is stored multiline using " as delimiter char). Being actually parsed by pandas means it should be actually well-formatted.
html,url
https://google.com,"<a href=""link"">
</a>"
Making the sample argument big enough to load the entire file in memory seems to work, which makes me believe it actually fails when trying to infer the datatypes(also there's this issue which
should have been solved https://github.com/dask/dask/issues/1284)
Has anyone encountered this problem before? Is there a fix/workaround?
EDIT: Apparently this is a known problem with dask's read_csv if the file contains a newline character between quotes. A solution I found was to simply read it all in memory:
dd.from_pandas(pd.read_csv(input_file), chunksize=25)
This works, but at the cost of parallelism. Any other solution?
For people coming here in 2020, the dd.read_csv works directly for newlines inside quotes. It has been fixed. Update to the latest version of Dask (2.18.1 and above) to get these benefits.
import dask.dataframe as dd
df = dd.read_csv('path_to_your_file.csv')
print(df.compute())
Gives,
html url
0 https://google.com \n
OR
For people who want to use an older version for some reason, as suggested by #mdurant you might wanna pass blocksize=None to dd.read_csv which will be at a cost of parallel loading.

Proper format for a file name in python

I'm importing an mp3 file using IPython (more specifically, the IPython.display.display(IPython.display.Audio() command) and I wanted to know if there was a specific way you were supposed to format the file path.
The documentation says it takes the file path so I assumed (perhaps incorrectly) that it should be something like \home\downloads\randomfile.mp3 which I used an online converter to convert into unicode. I put that in (using, of course, filename=u'unicode here' but that didn't work, instead giving a bunch of errors. I tried reformatting it in different ways (just \downloads\randomfile.mp3, etc) but none of them worked. For those who are curious, here is the unicode characters: \u005c\u0044\u006f\u0077\u006e\u006c\u006f\u0061\u0064\u0073\u005c\u0062\u0064\u0061\u0079\u0069\u006e\u0073\u0074\u0072\u0075\u006d\u0065\u006e\u0074\u002e\u006d\u0070\u0033 which translates to \home\Downloads\bdayinstrument.mp3, I believe.
So, am I doing something wrong? What is the correct way to format the "file path"?
Thanks!

Trouble retrieving data from kivy's jsonstore

I'm having issues retrieving data from a '.json' file if the key contains non-ascii characters.
To explain better i want to illustrate this issue with an example.
Say if i want to save data into a json file as follows
store = JsonStore('example.json')
store.put('André Rose', type = 'sparkling wine', comment = 'favourite')
Then I want to retrieve it as follows
store.get('André Rose')
this returns an error that says:
KeyError: 'Andr\xc3\xa9'
I believe the problem is the non-ascii character " é ".
so my question is how can I save stuffs like this into a json file, and retrieve without getting this key error?
"There is a bug in kivy 1.8.0 under Python 3. When you are using Kivy 1.8.0 and Python 3, URlRequest fails to convert the incoming data to JSON. If you are using this combination you'll need to add:" (Philips, Creating Apps in Kivy)
import json
data = json.loads(data.decode())
I'm not sure if this will help your particular problem, but I thought I might throw it out there.

python pandas.DataFrame.to_csv newline issue

Greeting Dear community.
I need to write a python pandas.DataFrame into csv
I try to use something like this:
dfPRR.to_csv(prrDumpName,index=False,quotechar="'",quoting=csv.QUOTE_ALL)
It works fine in some sample but for other sample with long string. I encounter the issue that one record breaks into 2 or 3 different line.
what I want my output file:
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
Instead what I get...the file breaking into two seperate line::
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
Is there an option to tell to_csv use a specific record delimitor ?
I do not see that option in documentation of to_csv
What my goal is to create a csv and then a loader program will load the csv
As for now, the loader program cannot load the file when this happens. As it is not able to tell the record is finish or not..?
I see other sample file that the record does not break into 2 or 3 lines when the string is not as long. This is the desired behavior.
How I can enforce this ??

Categories