Pandas - Create new column from calculation over irregular string patterns

Pandas - Create new column from calculation over irregular string patterns - python

I have some data in a pandas dataframe like so:
| Data |
----------------------------
| 10-9 8-6 100-2 |
----------------------------
| 1-2 3-4 |
----------------------------
| 55-45 |
----------------------------
Now my question is, using pandas, what is the best way to do the following:
Calculate the average of the first numbers before the hyphen, and the average of the numbers after the hyphen.
Subract the second from the first, and place into a new column.
For example, for the first row, the value in the new column will be: average(10, 8, 100) - average(9, 6, 2)
I am guessing I will need to use some sort of lambda function, but I am not sure how to go about it.
Any help is appreciated. Thank you!

Make a function to contain the string parsing logic:
import pandas as pd
import numpy as np
def string_handling(string):
values = [it for it in string.strip().split(' ') if it]
values = [v.split('-') for v in values]
first_values = [int(v[0]) for v in values]
second_values = [int(v[1]) for v in values]
return pd.Series([np.mean(first_values), np.mean(second_values)])
Apply the function:
df[['first_value','second_value']] = df['Data'].apply(string_handling)
df['diff'] = df['first_value'] - df['second_value']

This might do the trick. split() will get rid of all the white space. Also using list comprehension to go through all the tokens created by split() (e.g. ['10-9', '8-6', '100-2']).
In [37]: df = DataFrame({'Data': [" 10-9 8-6 100-2 ",
" 1-2 3-4 ",
" 55-45 "]})
In [38]: def process(cell):
...: avg = []
...: for i in range(2):
...: l = [int(x.split("-")[i]) for x in cell.split()]
...: avg.append(sum(l) * 1. / len(l))
...: return avg[0] - avg[1]
...:
In [39]: df['Data'].apply(process)
Out[39]:
0 33.666667
1 -1.000000
2 10.000000
Name: Data, dtype: float64
Hope this helps!

Related

How to select list elements based on crteria from other lists

I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.

You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.

Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.

The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37

Comparing multiple columns for a single row

I am grouping columns and identifying rows that have different values for each group. For example: I can group columns A,B,C,D and delete column A because it is different (Row 2 is 2.1). Also, I can group columns E,F,G,H and delete column G because Row 1 (Row 0 is Blue).
A | B | C | D | E | F | G | H
| ---------------------------------------------------------|
0 | 1.0 | 1 | 1 in | 1 inch | Red | Red | Blue | Red
| ---------------------------------------------------------|
1 | 2.0 | 2 | 2 in | 2 inch | Green | Green | Green| Green
| ---------------------------------------------------------|
2 | 2.1 | 2 | 2 in | 2 inch | Blue | Blue | Blue | Blue
What I have tried so far to compare values:
import difflib
text1 = '1.0'
text2 = '1 in'
text3 = '1 inch'
output = str(int(difflib.SequenceMatcher(None, text1, text2, text3).ratio()*100))
output: '28'
This does not work well to compare numbers followed by a measurement like inches or mm. I then tried spacy.load('en_core_web_sm') and that works better but its still not there yet. Are there any ways to compare a group of values that are similar to 1.0, 1, 1 in, 1 inch?

For columns with only strings, you can use pandas df.equals() that compares two dataframes or series (cols)
#Example
df.E.equals(df.F)
You can use this function to compare many columns to a single one I called main or template, which should be the column where you have the "correct" values.
def col_compare(main_col, *to_compare):
'''Compares each column from a list to another column
Inputs:
* main_col: enter the column name (e.g. 'A')
* to_compare: enter as many column names as you want (e.g. 'B', 'C') '''
# Columns to compare to list
to_compare = list(to_compare)
# List to store results
results = []
# Compare columns from the list with the template column
for col in to_compare:
if not df[main_col].equals(df[col]):
results.append(col)
print(f'Main Column: {main_col}')
print(f'Compared to: {to_compare}')
return f"The columns that have different values from {main_col} are {results}"
e.g
`col_compare('E', 'F', 'G', 'H')`
output:
Main Column: E
Compared to: ['F', 'G', 'H']
The columns that have different values from E are ['G']
For the columns A, B, C and D, where you have numbers you want to compare, but pieces of strings after that, one option is to extract the numbers into new columns just for comparison and you can drop them later.
You can create new columns with the code below for each column with numbers and strings:
df['C_num'] = df.C.apply( lambda x: int(re.search('[0-9]*', x).group() ) )
and then use the function col_compare above to run the comparison between the numeric columns.

I found an answer to my question. Crystal L recommended that I use FuzzyMatch and I found it to be useful. Here is the documentation: https://www.datacamp.com/community/tutorials/fuzzy-string-python Here are a couple of things I tried:
# Fucntion to compare length and similar characters
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Str1= '1 mm'
Str2= '1 in'
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
import Levenshtein as lev
Str1= '1 mm'
Str2= '1 in'
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
Str1= '2 inches'
Str2= '1 mm'
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

Find when the values of a pandas.Series change by at least x

I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.

I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0

I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.

Feature extraction from the training data

I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?

Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)

Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast

As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George

Python - Combining For Loops

I have two dictionaries of data from 2016 and 2017 respectively which have the same 5 keys. I want to calculate the percentage of each key's value to the sum of the values in its dictionary and then join the two percentages of each individual key to a label. I have managed to do so below but my method requires a lot of for looping and seems somewhat clunky. I am looking for ways of condensing or rewriting my code so as to make it more efficient.
UsersPerCountry, UsersPerPlatform, UsersPerPlatform2016, UsersPerPlatform2017 = Analytics.UsersPerCountryOrPlatform()
labels = []
sizes16 = []
sizes17 = []
sumc1 = 0
sumc2 = 0
percentages = []
for k, v in dict1.iteritems():
sumv1 += v
for k, v in dict1.iteritems():
v1 = round(((float(v) / sumc1) * 100), 1)
percentages.append(v1)
labels.append(k)
sizes16.append(c)
for k, v in dict2.iteritems():
sumv1 += v
for k, v in dict2.iteritems():
v2 = round(((float(v) / sumc1) * 100), 1)
percentages.append(v2)
sizes17.append(c)
for i in range(5):
labels[i] += (', ' + str(percentages[i]) + '%' + ', ' + str(percentages[i + 5]) + '%')
This is what the label looks like:
EDIT: I have now added the variable declaration. I thought the hashed line about setting all variables to empty lists or 0 would suffice.

You could use Panda's data frame class to simplify things. I am a bit unsure of how your percentages are being calculated so that may need to be worked out a bit but otherwise, try this:
import pandas as pd
#convert data to DataFrame class
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
#compute the percentages
percnt1 = df1.sum(axis=0).div(df1.sum().sum())
percnt2 = df2.sum(axis=0).div(df2.sum().sum())
#to get the sum:
percnt1 + percnt2
Here's an example:
## create a data frame:
import numpy as np
df1 = pd.DataFrame({'Android':np.random.poisson(10,100), 'iPhone':np.random.poisson(10,100),
'OSX':np.random.poisson(10,100), 'WEBGL':np.random.poisson(10,100), 'Windows':np.random.poisson(10,100)})
In [11]: df1.head()
Out[11]:
Android OSX WEBGL Windows iPhone
0 12 12 9 9 5
1 9 8 14 7 11
2 12 10 7 10 11
3 11 12 7 17 5
4 15 16 15 11 13
In [10]: df1.sum(axis=0).div(df1.sum(axis=0).sum())
Out[10]:
Android 0.205279
OSX 0.198782
WEBGL 0.200609
Windows 0.198376
iPhone 0.196954
dtype: float64

Without Pandas:
You should take advantage of some of Python's built-in features, as well as functions. Here I'm trying to replicate what you're doing to be a little more Pythonic.
Note this is untested because you didn't give a full code snippet (sumc1 and c were undeclared). I wrote this based on what I think you're trying to do.
# Your size16/size17 lists appear to be full of the constant c
# can use Pythons list replication operation
sizes16 = [c]*len(dict1)
sizes17 = [c]*len(dict2)
# define function for clarity / reduce redundancy
def get_percentages(l):
s = sum(l)
percentages = [ round(((float(n) / s)*100),1) for n in l ] # percentages calculation is a great place for list comprehension
return percentages
# can grab the labels directly, rather than in a loop
labels = dict1.keys()
percentages1 = get_percentages(dict1.values())
percentages2 = get_percentages(dict2.values())
# no magic number 5
for i in range(len(labels)):
labels[i] += (', ' + str(percentages[i]) + '%' + ', ' + str(percentages[i + 5]) + '%')
That last line could be cleaned up if I had a better idea of what you were doing.
I haven't looked closely, but this code may run over the data an extra once or twice, so it may be a little less efficient. However, it's much more readable IMO.

Here's a way to go without an external library. You don't mention any problems in the way the code runs, just it's aesthetic (which one could argue has an effect on the way it runs). Anyway, this looks clean:
# Sample data
d1 = {'a':1.,'b':6.,'c':10.,'d':5.}
d2 = {'q':10.,'r':60.,'s':100.,'t':50.}
# List comprehension for each dictionary sum
sum1 = sum([v for k,v in d1.items()])
sum2 = sum([v for k,v in d2.items()])
# Using maps and lambda functions to get the distributions of each dictionary
d1_dist = map(lambda x: round(x/sum1*100, 1), list(d1.values()))
d2_dist = map(lambda y: round(y/sum2*100, 1), list(d2.values()))
# Insert your part with the labels here (I really didn't get that part)
>>> print(d1_dist)
[4.5, 45.5, 27.3, 22.7]
And if you want to join the original keys from a dictionary to these new distribution values, just use:
d1_formatted = dict(zip(list(d1.keys()), d1_dist))
>>> print(d1_formatted)
{'a': 4.5, 'c': 45.5, 'b': 27.3, 'd': 22.7}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Create new column from calculation over irregular string patterns - python

Related

How to select list elements based on crteria from other lists

Comparing multiple columns for a single row

Find when the values of a pandas.Series change by at least x

Feature extraction from the training data

Python - Combining For Loops

Categories

Resources