get/access each chunk of dask.dataframe(df, chunksize=100) - python

I used below code to split a dataframe using dask:
result=dd.from_pandas(df, chunksize=75)
I use below code to create a custom json file:
for z in result:
createjson (z)
It just didnt work! how can I access to each chunk?

There may be a more native way (feels like there should be) but you can do:
for i in range(result.npartitions):
partition = result.get_partition(i)
# your code here

We do not know what your createjson function does, but perhaps it is covered by to_json().
Alternatively, if you really want to do something unique to each of your partition, and this is not unique to JSON, then you will want the method map_partitions().

Related

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Pyspark - reducer task iterates over values

I am working with pyspark for the first time.
I want my reducer task to iterates over the values that return with the key from the mapper just like in java.
I saw there is only option of accumulator and not iteration - like in add function add(data1,data2) => data1 is the accumulator.
I want to get in my input a list with the values that belongs to the key.
That's what i want to do. Anyone know if there is option of doing that?
Please use reduceByKey function. In python, it should look like
from operator import add
rdd = sc.textFile(....)
res = rdd.map(...).reduceByKey(add)
Note: Spark and MR has fundamental diffrences, so it is suggested not to force-fit one to another. Spark also supports pair functions pretty nicely, look for aggregateByKey if you want something fancier.
Btw, word count problem is discussed in depth (esp usage of flatmap) in spark docs, you may want to have a look

How do I make a dynamically expanding array in python

Ok I have this part of code:
def Reading_Old_File(self, Path, turn_index, SKU):
print "Reading Old File! Turn Index = ", turn_index, "SKU= ", SKU
lenght_of_array=0
array_with_data=[]
if turn_index==1:
reading_old_file = open(Path,'rU')
data=np.genfromtxt(reading_old_file, delimiter="''", dtype=None)
for index, line_in_data in enumerate(data, start=0):
if index<3:
print index, "Not Yet"
if index>=3:
print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Reading All Old Items"
i=index-3
old_items_data[i]=line_in_data.split("\t")
old_items_data[i]=[lines_old.strip()for lines_old in old_items_data]
print old_items_data[i]
print len(old_items_data)
So what I am doing here is, I'm reading a file, on my first turn, I want to read it all, and keep all data, so it would be something like:
old_items_data[1]=['123','dog','123','dog','123','dog']
old_items_data[2]=['124','cat','124','cat','124','cat']
old_items_data[n]=['amount of list members is equal each time']
each line of the file should be stored in list, so I can use it in future for comparing, when turn_index will be greater than 2 I'll compare coming line with lines in every list(array) by iterating over all lists.
So question is how do I do it, or is there any better way to compare lists?
I'm new to python so maybe someone could help me with this issue?
Thanks
You just need to use append.
old_items_data.append(line_in_data.split("\t"))
I would use the package pandas for this. It will not only be much quicker, but also simpler. Use pandas.read_table to import the data (specifying delimiter and row-skipping can be done here by passing arguments to sep and skiprows). Then, use pandas.DataFrame.apply to apply your function to the rows of your data.
The speed gains are going to come from the fact that pandas was optimized to perform actions across lists like this (in the case of a pandas DataFrame, these would be called rows). This applies to both importing the data and applying a function to every row. The simplicity gains should hopefully be clear.

Pandas dataframe from generator where each line is a tab-separated row

I am trying parse a generator to the dataframe constructor, pd.Dataframe testdf = pd.DataFrame(test). I am unable to specify that each line is tab-delimited. The result is that I end up with a single column dataframe where each row is the entire row of values separated with '\t'.
I've tried a couple of other ways:
pd.read_csv(test)
pandas.io.parsers.read_table(test, sep='\t')
but neither of these work of them work because they do not take the input type generator.
Not too familiar with generators. Can you throw them into a list comprehension? If so, how about
pd.DataFrame([x.split('\t') for x in test])
One solution that I found would be to use a split function on the one column to break it up:
testdf_parsed = pd.DataFrame(testdf.row.str.split('\t').tolist(), )
...and that did work for me, but maybe there is a more elegant and simple solution exist that leverages the core capabilities of Pandas?
You might try implementing a file-like object that wraps your generator, then feeding that to read_table.

Python sas7bdat module usage

I have to dump data from SAS datasets. I found a Python module called sas7bdat.py that says it can read SAS .sas7bdat datasets, and I think it would be simpler and more straightforward to do the project in Python rather than SAS due to the other functionality required. However, the help(sas7bdat) in interactive Python is not very useful and the only example I was able to find to dump a dataset is as follows:
import sas7bdat
from sas7bdat import *
# following line is sas dataset to convert
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
#following line is txt file to create
foo.convertFile('/support/textfiles/locked_data.txt','\t')
This doesn't do what I want because a) it uses the SAS variable names as column headers and I need it to use the variable labels, and b) it uses "nan" to denote missing numeric values where I'd rather just leave the value blank.
Can anyone point me to some useful documentation on the methods included in sas7bdat.py? I've Googled every permutation of key words that I could think of, with no luck. If not, can someone give me an example or two of using readColumnAttributes(), readColumnLabels(), and/or readColumnNames()?
Thanks, all.
As time passes, solutions become easier. I think this one is easiest if you want to work with pandas:
import pandas as pd
df = pd.read_sas('/support/sas/locked_data.sas7bdat')
Note that it is easy to get a numpy array by using df.values
This is only a partial answer as I've found no [easy to read] concrete documentation.
You can view the source code here
This shows some basic info regarding what arguments the methods require, such as:
readColumnAttributes(self, colattr)
readColumnLabels(self, collabs, coltext, colcount)
readColumnNames(self, colname, coltext)
I think most of what you are after is stored in the "header" class returned when creating an object with SAS7BDAT. If you just print that class you'll get a lot of info, but you can also access class attributes as well. I think most of what you may be looking for would be under foo.header.cols. I suspect you use various header attributes as parameters for the methods you mention.
Maybe something like this will get you closer?
from sas7bdat import SAS7BDAT
foo = SAS7BDAT(inFile) #your file here...
for i in foo.header.cols:
print '"Atrributes"', i.attr
print '"Labels"', i.label
print '"Name"', i.name
edit: Unrelated to this specific question, but the type() and dir() commands come in handy when trying to figure out what is going on in an unfamiliar class/library
I know I'm late for the answer, but in case someone searches for similar question. The best option is:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()
Personally I think the better approach would be to export the data using SAS then process the external file as needed using Python.
In SAS, you can do this...
libname datalib "/support/sas";
filename sasdump "/support/textfiles/locked_data.txt";
proc export
data = datalib.locked_data
outfile = sasdump
dbms = tab
label
replace;
run;
The downside to this is that while the column labels are used rather than the variable names, the labels are enclosed in double quotes. When processing in Python, you may need to programmatically remove them if they cause a problem. I hope that helps even though it doesn't use Python like you wanted.

Categories