Spark Statistical Functions Python

Spark Statistical Functions Python - python

I asked a question for statistical functions and got an answer but I am looking at another way to do it:
What I find weird is:
This works:
myData = dataSplit.map(lambda arr: (arr[1]))
myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).groupByKey()
stats[1] = myData2.map(lambda (Column, values): (min(values))).collect()
But when I add this function:
stats[4] = myData2.map(lambda (Column, values): (values)).variance()
It fails.
So I put some print:
myData = dataSplit.map(lambda arr: (arr[1]))
print myData.collect()
myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).groupByKey()
print myData2.map(lambda (Column, values): (values)).collect()
Printing myData:
[u'18964', u'18951', u'18950', u'18949', u'18960', u'18958', u'18956', u'19056', u'18948', u'18969', u'18961', u'18959', u'18957', u'18968', u'18966', u'18967', u'18971', u'18972', u'18353', u'18114', u'18349', u'18348', u'18347', u'18346', u'19053', u'19052', u'18305', u'18306', u'18318', u’18317']
Printing myData2:
[<pyspark.resultiterable.ResultIterable object at 0x7f3f7d3e0710>]

solved
print myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).map(lambda (column, value) : (value)).stdev()

Related

How can I initialize dictionary in Python where key is an object?

I am trying to rephrase the implementation found here. This is what I have so far:
import csv
import math
import random
training_set_ratio = 0.67
training_set = []
test_set = []
class IrisFlower:
def __init__(self, petal_length, petal_width, sepal_length, sepal_width, flower_type):
self.petal_length = petal_length
self.petal_width = petal_width
self.sepal_length = sepal_length
self.sepal_width = sepal_width
self.flower_type = flower_type
def __hash__(self) -> int:
return hash((self.petal_length, self.petal_width, self.sepal_length, self.sepal_width))
def __eq__(self, other):
return (self.petal_length, self.petal_width, self.sepal_length, self.sepal_width) \
== (other.petal_length, other.petal_width, other.sepal_length, other.sepal_width)
def load_data():
with open('dataset.csv') as csvfile:
rows = csv.reader(csvfile, delimiter=',')
for row in rows:
iris_flower = IrisFlower(float(row[0]), float(row[1]), float(row[2]), float(row[3]), row[4])
if random.random() < training_set_ratio:
training_set.append(iris_flower)
else:
test_set.append(iris_flower)
def euclidean_distance(flower_one: IrisFlower, flower_two: IrisFlower):
distance = 0.0
distance = distance + math.pow(flower_one.petal_length - flower_two.petal_length, 2)
distance = distance + math.pow(flower_one.petal_width - flower_two.petal_width, 2)
distance = distance + math.pow(flower_one.sepal_length - flower_two.sepal_length, 2)
distance = distance + math.pow(flower_one.sepal_width - flower_two.sepal_width, 2)
return distance
def get_neighbors(test_flower: IrisFlower):
distances = []
for training_flower in training_set:
dist = euclidean_distance(test_flower, training_flower)
d = dict()
d[training_flower] = dist
print(d)
return
load_data()
get_neighbors(test_set[0])
Currently, print statements in the following code block:
def get_neighbors(test_flower: IrisFlower):
distances = []
for training_flower in training_set:
dist = euclidean_distance(test_flower, training_flower)
d = dict()
d[training_flower] = dist
print(d)
return
will have outputs similar to
{<__main__.IrisFlower object at 0x107774fd0>: 0.25999999999999945}
which is ok. But I do not want to create the dictionary first, and then append the key value, as in:
d = dict()
d[training_flower] = dist
So this is what I am trying:
d = dict(training_flower = dist)
However, it does not seem like the dist method is using the instance, but rather a String, because what I see printed is as follows:
{'training_flower': 23.409999999999997}
{'training_flower': 16.689999999999998}
How do I create the dictionary by using the object as key in one statement?

In your snippet, where you write d = dict(training_flower=dist), "training_flower" is a keyword argument for dict function and not an object. It is equivalent to writing d = {'training_flower': dist}. The only way to create a dictionary with an object as a key is to use the latter syntax:
d = {training_flower: dist}

To directly create a dict with a key which is not a valid keyword, use the {} syntax like:
Code:
d = {training_flower: 'a_value'}
Test Code:
training_flower = 'a key'
d = {training_flower: 'a_value'}
print(d)
Results:
{'a key': 'a_value'}

to initialize a dictionary with an object as a key, (edit: and the string in Stephen's example is an object anyway)
class Flower:
def __repr__(self):
return 'i am flower'
flower1 = Flower()
d = {flower1: 4}
print(d)
outputs
{i am flower: 4}
this is my first post here, and I know I'm late, sorry if it's a duplicate solution. just to show it works with an object.
would upvote Stephen's answer but I can't yet.

Q: Pandas dataframe from for loop

EDIT 2, 9/1 See my answer below!
Pretty new at Python and Pandas here. I've got a script here that uses a for loop to query my database using each line in my list. That all works great, but I can't figure out how to build a data frame from the results of that loop. Any and all pointers are welcome!
#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
multi_line = multiple_replacer(key_values)
print "Querying Database..."
for line in source:
brand_url = multi_line(line)
#Run Query with cleaned list
mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
list1 = []
brands = my_query('prod', mysql_query)
print "Writing CSV..."
#Create DF and CSV
for row in brands:
list1.append({"URL":row['URL'],"Name":['Name'],"ID":['ID']})
if brands.shape == (3,0):
df1 = pd.DataFrame(data = brands, columns=['URL','Name','ID'])
output = df1.to_csv('ongoing.csv',index=False)
EDIT 8/30
Here is my edit, attempting to use zyxue's method:
#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
multi_line = multiple_replacer(key_values)
print "Querying Database..."
for line in source:
brand_url = multi_line(line)
#Run Query with cleaned list
mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
brands = my_query('prod', mysql_query)
print "Writing CSV..."
#Create DF and CSV
records = []
for row in brands:
records.append({"URL":row['URL'],"Name":['Name'],"ID":['ID']})
if brands.shape == (3,0):
records.append(dict(zip(brands, ['URL', 'Name', 'ID'])))
df1 = pd.DataFrame.from_records(records)
output = df1.to_csv('ongoing.csv', index=False)
but this only returns a blank CSV. I'm sure I'm applying it wrong.

records = []
for row in brands:
# if brands.shape == (3,0):
# records.append(dict(zip(brands, ['URL', 'Name', 'ID'])))
# update bug fix:
if row.shape == (3,0):
records.append(dict(zip(row, ['URL', 'Name', 'ID'])))
df1 = pd.DataFrame.from_records(records)
output = df1.to_csv('ongoing.csv', index=False)
# ref:
# >>> pd.DataFrame.from_records([{'a': 1, 'b':2}, {'a': 11, 'b': 22}])
# a b
# 0 1 2
# 1 11 22

Okay, I figured it out, and I thought I should post the working script. #zyxue was pretty much right.
source = open('urls.txt')
key_values = ("http://",""), ("https://",""), ("www.",""), ("\n","")
#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
multi_line = multiple_replacer(key_values)
print "Querying Database..."
records = []
for line in source:
brand_url = multi_line(line)
#Run Query with cleaned list
mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
brands = my_query('prod', mysql_query)
#Append results to dict (records)
for row in brands:
records.append({"URL":row['URL'],"Name":row['Name'],"ID":row['ID']})
#Create DataFrame
df = pd.DataFrame.from_dict(records)
#Create CSV
output = df.to_csv('ongoing.csv',index=False)
Essentially, I needed to layer the second for loop under the first and create the 'records' dictionary before the looping began. This causes an append to the dictionary for every line in 'source'. Seems like a pretty simple concept now!

Python dictionary print with parentheses

I am trying to edit this function so the values of the dictionary will not be printed in parentheses and will be iterable:
def traverse_appended(key):
reg_dict = {}
#keypath = r"SOFTWARE\\Wow6432Node\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\"
for item in traverse_reg(key):
keypath_str = str(keypath+item)
reg_dict[item] = str(get_reg("Displayversion", keypath_str)), str(get_reg("DisplayName", keypath_str))
#reg_dict[item] = get_reg("DisplayName", keypath_str)
return reg_dict
the expected output is :
{'DXM_Runtime': 'None', 'None'}
The function output:
{'DXM_Runtime': ('None', 'None')}

#Consider traverse_appended returns following dict.
#I think, converting func_dict values which are tuple into string, will help you to get expected output.
func_dict = {"DXM_Runtime":('None','None'),
"TMP_KEY":('A','B')
}
derived_dict = {}
for k,v in func_dict.viewitems():
tmp_str = ",".join(v)
derived_dict[k] = tmp_str
print derived_dict
#Output
E:\tmp_python>python tmp.py
{'DXM_Runtime': 'None,None', 'TMP_KEY': 'A,B'}
#If this doesn't help you, then please post the code for get_reg and traverse_reg function also.

python generating nested dictionary key error

I am trying to create a nested dictionary from a mysql query but I am getting a key error
result = {}
for i, q in enumerate(query):
result['data'][i]['firstName'] = q.first_name
result['data'][i]['lastName'] = q.last_name
result['data'][i]['email'] = q.email
error
KeyError: 'data'
desired result
result = {
'data': {
0: {'firstName': ''...}
1: {'firstName': ''...}
2: {'firstName': ''...}
}
}

You wanted to create a nested dictionary
result = {} will create an assignment for a flat dictionary, whose items can have any values like "string", "int", "list" or "dict"
For this flat assignment
python knows what to do for result["first"]
If you want "first" also to be another dictionary you need to tell Python by an assingment
result['first'] = {}.
otherwise, Python raises "KeyError"
I think you are looking for this :)
>>> from collections import defaultdict
>>> mydict = lambda: defaultdict(mydict)
>>> result = mydict()
>>> result['Python']['rules']['the world'] = "Yes I Agree"
>>> result['Python']['rules']['the world']
'Yes I Agree'

result = {}
result['data'] = {}
for i, q in enumerate(query):
result['data']['i'] = {}
result['data'][i]['firstName'] = q.first_name
result['data'][i]['lastName'] = q.last_name
result['data'][i]['email'] = q.email
Alternatively, you can use you own class which adds the extra dicts automatically
class AutoDict(dict):
def __missing__(self, k):
self[k] = AutoDict()
return self[k]
result = AutoDict()
for i, q in enumerate(query):
result['data'][i]['firstName'] = q.first_name
result['data'][i]['lastName'] = q.last_name
result['data'][i]['email'] = q.email

result['data'] does exist. So you cannot add data to it.
Try this out at the start:
result = {'data': []};

You have to create the key data first:
result = {}
result['data'] = {}
for i, q in enumerate(query):
result['data'][i] = {}
result['data'][i]['firstName'] = q.first_name
result['data'][i]['lastName'] = q.last_name
result['data'][i]['email'] = q.email

Learning Python: Store values in dict from stdout

How can I do the following in Python:
I have a command output that outputs this:
Datexxxx
Clientxxx
Timexxx
Datexxxx
Client2xxx
Timexxx
Datexxxx
Client3xxx
Timexxx
And I want to work this in a dict like:
Client:(date,time), Client2:(date,time) ...

After reading the data into a string subject, you could do this:
import re
d = {}
for match in re.finditer(
"""(?mx)
^Date(.*)\r?\n
Client\d*(.*)\r?\n
Time(.*)""",
subject):
d[match.group(2)] = (match.group(1), match.group(2))

How about something like:
rows = {}
thisrow = []
for line in output.split('\n'):
if line[:4].lower() == 'date':
thisrow.append(line)
elif line[:6].lower() == 'client':
thisrow.append(line)
elif line[:4].lower() == 'time':
thisrow.append(line)
elif line.strip() == '':
rows[thisrow[1]] = (thisrow[0], thisrow[2])
thisrow = []
print rows
Assumes a trailing newline, no spaces before lines, etc.

What about using a dict with tuples?
Create a dictionary and add the entries:
dict = {}
dict['Client'] = ('date1','time1')
dict['Client2'] = ('date2','time2')
Accessing the entires:
dict['Client']
>>> ('date1','time1')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark Statistical Functions Python - python

solved print myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).map(lambda (column, value) : (value)).stdev()

Related

How can I initialize dictionary in Python where key is an object?

Q: Pandas dataframe from for loop

Python dictionary print with parentheses

python generating nested dictionary key error

Learning Python: Store values in dict from stdout

Categories

Resources