python dictionary for incremental values

python dictionary for incremental values - python

I would like to perform the following:
Given a total number of observations (in this case the variable 'total_models'), I would like to parse this for parallel processing by a given number of python sessions ('sessions' variable and 'by' variable). I figure it best to perform this task using a dictionary.
The desired results should can be found in the 'obs_dict' object. For any given input to 'total_models', 'sessions' and 'by'. Can you assist in creating the desired output in a dictionary object? If possible, I would like to see the answer using some sort of list or dictionary comprehension.
total_models=1000000
sessions=4
by=int(total_models/sessions)
### Desired Output.
obs_dict={1:'0:250000',2:'250001:500000',3:'500001:750000',4:'750001:1000000'}

obs = {i+1: str(i*by+1)+':'+str((i+1)*by) for i in range(sessions)}
Edited, Kyle:
For the odd models it would seem wrapping a ceil the division will ensure we we do not go over the 'total_models'
total_models=1000326
sessions=5
by=math.ceil(total_models/sessions)
obs = {i+1: str(i*by+1)+':'+str(min((i+1)*by,total_models)) for i in range(sessions)}

Related

Get random number from set deprecation

I am trying to get a random n number of users from a set of unique users.
Here is what I have so far
users = set()
random_users = random.sample((users), num_of_user)
This works well but it is giving me a deprecated warning. What should I be using instead? random.choice doesn't work with sets
UPDATE
I am trying to get reactions on a post and want them to be unique which is why I used a set. Would it be better to stick with a list for this?
users = set()
for reaction in msg.reactions:
async for user in reaction.users():
users.add(user)

Convert your set to a list.
by using the list function:
random_users = random.choices(list(users),k=num_of_user)
by using * operator to unpack your set or dict:
random_users = random.choices([*users],k=num_of_user)
Solution 1. is 3 char longer than the 2., but solution 1. is more literal - to me.
It is not guaranteed that you will get the same result for the list order through different executions, python versions and platforms, therefore you may end up with different random result, despite a careful random number generator initialization. To resolve this issue, order your list.
You can also store the users in a list, and make the elements unique later with set, and then convert again to a list. For this, a common way is to convert your list to a set, and back to a list again.

FWIW, random.sample() in Python 3.9.2 says this when passed a dict:
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).
And this solution does seem to work for both set and dict inputs.

BatchDataset not subscriptable when trying to format Python dictionary as table

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):

Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']

BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

Python object and its elements

I am studying the codes written by other.
I think the owner wrote it using OOP.
when I print the results object sims
the output is something like below:
[<WorldCupSim.WorldCupSim object at 0x0000018DC471B908>, <WorldCupSim.WorldCupSim object at 0x0000018DC471B5C0>]
Number of objects in sims depends on number of iterations.
in this case, I run used two iterations.
I want to print the elements of the object sims.
It seems like I need to give more details about it.
Please advise me what are the other info to be provided.
I am confused with the codes.
Thanks
Zep

What you are seeing is a default representation of the list sims. If you do
print(sims[0])
then, if WorldCupSim.WorldCupSim has defined a string representation (which is what print() will show you) then you should see more useful stuff.

As per your code what I am seeing is sims is array of objects. You can use for loop to iterate your array. Below is the sample snippet
for o in sims:
print(o)
It will print all the objects inside sims variable

string comparison for multiple values python

I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.

It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!

No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.

I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.

As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python dictionary for incremental values - python

Related

Get random number from set deprecation

BatchDataset not subscriptable when trying to format Python dictionary as table

Python object and its elements

string comparison for multiple values python

summing entries in a variable based on another variable(make unique) in python lists

Categories

Resources