Pairwise Elements Using Python - Calculating Average of individual elements of array - python

So I have a query; I am accessing an API that gives the following response:
[["22014",201939,"0021401229","APR 15 2015",Team1 vs. Team2","W",
19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],["22014",201939,"0021401","APR
13 2015",Team1 vs. Team3","W",
15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],["22014",201939,"0021401192","APR
11 2015",Team1 vs. Team4","W",
22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
I could just as easily have 16 different variables that I assign zero to, then print them out like the following example:
sum_pts = 0
for n in range(0,len(shot_data)): #range of games; these lengths vary per player
sum_pts= sum_pts+float(json.dumps(shots_array[n][24]))
print sum_pts/float(len(shots_array))
Output:
>>>
23.75
But I'd rather not create 16 different variables that calculate the average of the individual elements in this list. I'm looking for an easier way that I could get the average of Team1
I would like it the output to eventually be, so that I can apply this to infinite number of players or individual stats:
Team1 AVGPTS AVGAST AVGSTL AVGREB...
23.75 5.3 2.1 3.2
Or it could be:
Player1 AVGPTS AVGAST AVGSTL AVGREB ...
23.75 5.3 2.1 3.2 ...

To get the averages of the last 16 entries of each entry, you could use the following approach, this avoids the need to define multiple variables for each column:
data = [
["22014",201939,"0021401229","APR 15 2015", "Team1 vs. Team2","W", 19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],
["22014",201939,"0021401","APR 13 2015","Team1 vs. Team3","W", 15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],
["22014",201939,"0021401192","APR 11 2015","Team1 vs. Team4","W", 22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
length = float(len(data))
values = []
for entry in data:
values.append(entry[6:])
values = zip(*values)
averages = [sum(v) / length for v in values]
for col in averages:
print "{:.2f} ".format(col),
This would display:
18.67 4.33 11.00 0.40 2.00 6.00 0.50 0.00 0.00 0.00 2.00 2.00 4.00 7.00 5.00 0.00 4.00 1.00 10.00 14.00 1.00
Note, your data is missing an opening quote before each Team1 vs Team2.

Related

How to sum all rows from multiple columns

I want to do several operations that are repeated for several columns but I can't do it with a list-comprehension or with a loop.
The dataframe I have is concern_polls and I want to rescale the percentages and the total amounts.
text very somewhat \
0 How concerned are you that the coronavirus wil... 19.00 33.00
1 How concerned are you that the coronavirus wil... 26.00 32.00
2 Taking into consideration both your risk of co... 13.00 26.00
3 How concerned are you that the coronavirus wil... 23.00 32.00
4 How concerned are you that you or someone in y... 11.00 24.00
.. ... ... ...
625 How worried are you personally about experienc... 33.09 36.55
626 How do you feel about the possibility that you... 30.00 31.00
627 Are you concerned, or not concerned about your... 34.00 35.00
628 Are you personally afraid of contracting the C... 28.00 32.00
629 Taking into consideration both your risk of co... 22.00 40.00
not_very not_at_all url
0 23.00 11.00 https://morningconsult.com/wp-content/uploads/...
1 25.00 7.00 https://morningconsult.com/wp-content/uploads/...
2 43.00 18.00 https://d25d2506sfb94s.cloudfront.net/cumulus_...
3 24.00 9.00 https://morningconsult.com/wp-content/uploads/...
4 33.00 20.00 https://projects.fivethirtyeight.com/polls/202...
.. ... ... ...
625 14.92 12.78 https://docs.google.com/spreadsheets/d/1cIEEkz...
626 14.00 16.00 https://www.washingtonpost.com/context/jan-10-...
627 19.00 12.00 https://drive.google.com/file/d/1H3uFRD7X0Qttk...
628 16.00 15.00 https://leger360.com/wp-content/uploads/2021/0...
629 21.00 16.00 https://docs.cdn.yougov.com/4k61xul7y7/econTab...
[630 rows x 15 columns]
Variables very, somewhat, not_very and not_at_all they are represented as percentages of the column SAMPLE_SIZE, not shown in the sample share. The percentages don't always add up to 100% so I want to rescale it
To do this, I take the following steps: I calculate the sum of the columns -> variable I sum calculate the amount per %. This step could leave it as a variable and not create a new column in it df. I calculate the final amounts
The code I have so far is this:
sums = concern_polls['very'] + concern_polls['somewhat'] + concern_polls['not_very'] + concern_polls['not_at_all']
concern_polls['Very'] = concern_polls['very'] / sums * 100
concern_polls['Somewhat'] = concern_polls['somewhat'] / sums * 100
concern_polls['Not_very'] = concern_polls['not_very'] / sums * 100
concern_polls['Not_at_all'] = concern_polls['not_at_all'] / sums * 100
concern_polls['Total_Very'] = concern_polls['Very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Somewhat'] = concern_polls['Somewhat'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_very'] = concern_polls['Not_very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_at_all'] = concern_polls['Not_at_all'] / 100 * concern_polls['sample_size']
I have tried to raise the function with "list comprehension" but I can't.
Could someone make me a suggestion?
The problems that I find is that I want to add all the rows of several columns, but they are not all of the df doing repetitive operations on several columns, but they are not all of the df
Thank you.
df[newcolumn] = df.apply(lambda row : function(row), axis=1)
is your friend here I think.
"axis=1" means it does it row by row.
As an example :
concern_polls['Very'] = concern_polls.apply(lambda row: row['very'] / sums * 100, axis=1)
And if you want sums to be the total of each of those df columns it'll be
sums = concern_polls[['very', 'somewhat', 'not_very', 'not_at_all']].sum().sum()

Efficient Numpy Array Multiplication and Reshaping

Is there a simpler way to get this done?
Consider that I have an array of data points of length m - for instance, the amount of rain that accumulated at a weather station over the course of a single day for m many days. Now, we want to add n many small, semi-random perturbations to each day's data to create m * n many perturbed observations. Furthermore, we can divide the day into q many periods, and we have an estimate of the proportion of any day's rain that accumulates in each of those periods; we are assuming that the proportion of rain that accumulates during any period is not dependent on the day.
So I have an array of the daily observations of length m, (EDIT: length n was fixed to length m) which when perturbed becomes an array of shape [m,n], and an an array of period proportions of length q. What I want now is an array of shape [n,m*q] with one row for each perturbation, where each row is the concatenation of the "period-expanded" perturbed estimates of the daily rainfall observations.
As an example, we can define a toy set of data:
import numpy as np
m = 4
n = 3
q = 5
X = np.arange(m*n).reshape(m,n)
E = (np.arange(1,m*n + 1) /10).reshape(m,n)
X = X - E
Y = np.arange(1,q)
np.random.shuffle(Y)
Y = Y / np.sum(np.arange(1,q))
print(f'X : \n{X}')
print(f'Y : \n{Y}')
which gives us
X :
[[-0.1 0.8 1.7]
[ 2.6 3.5 4.4]
[ 5.3 6.2 7.1]
[ 8. 8.9 9.8]]
Y :
[0.2 0.3 0.1 0.4]
My solution is :
res = (X[:,:,np.newaxis] * Y[np.newaxis,np.newaxis,:]).transpose(1,0,2).reshape(X.shape[1],-1)
print(f'res : \n{res}')
which gives us the appropriate answer :
res :
[[-0.02 -0.03 -0.01 -0.04 0.52 0.78 0.26 1.04 1.06 1.59 0.53 2.12
1.6 2.4 0.8 3.2 ]
[ 0.16 0.24 0.08 0.32 0.7 1.05 0.35 1.4 1.24 1.86 0.62 2.48
1.78 2.67 0.89 3.56]
[ 0.34 0.51 0.17 0.68 0.88 1.32 0.44 1.76 1.42 2.13 0.71 2.84
1.96 2.94 0.98 3.92]]
Admittedly it would be easier to expand the observatons first and then randomly perturb, but the order of operations is a hard requirement on this particular problem: the observations must first be perturbed then the perturbed observations must be expanded and concatenated.

Swap and group column names in a pandas DataFrame

I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)

optimizing the for loop for faster performance

I have a dataframe that contains the similarity scores 100x100 for each 100 products against 100 products(data_neighbours). I have another dataframe that has the data at user and product level(1000x100). I want to go through each product for each user and get top10 similar products from data_neighbours and their corresponding similarity scores and compute a function getScore as below:
def getScore(history, similarities):
return sum(history*similarities)/sum(similarities)
for i in range(0,len(data_sims.index)):
for j in range(1,len(data_sims.columns)):
user = data_sims.index[i]
product = data_sims.columns[j]
if data.ix[i][j] == 1:
data_sims.ix[i][j] = 0
else:
product_top_names = data_neighbours.ix[product][1:10]
product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
user_purchases = data_germany.ix[user,product_top_names]
data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
How can I optimize this loop for faster processing. The example has been cited from here: http://www.salemmarafi.com/code/collaborative-filtering-with-python/
Sample data:
Data:(1000x101) user is the 101th column:
Index user song1 song2.....
0 1 0 0
1 33 0 1
2 42 1 0
3 51 0 0
data_ibs(similarity scores)--(100x100):
song1 song2 song3 song4
song1 1.00 0.00 0.02 0.05
song2 0.00 1.00 0.05 0.03
song3 0.02 0.05 1.00 0.11
song4 0.05 0.03 0.11 1.00
data_neighbours(top10 similar songs for each song based on sorted score from data_ibs)--(100x10):
1 2 3......... 10
song1 song5 song10 song4
song2 song8 song11 song5
song3 song9 song12 song10
data germany(user level data for each song as column, except userid)--(1000x100):
index song1 song2 song3
1 0 0 0
2 1 0 0
3 0 0 1
Expected dataset(data_sims)--1000x101:
user song1 song2 song3
1 0.00 0.00 0.22
33 0.09 0.00 0.11
42 0.00 0.10 0.00
51 0.09 0.09 0.00
where if value is 1 in data for any song, basically its score is set to 0, other cases, top 10 songs are fetched from data_neighbours and corresponding scores from data_ibs. Now it is checked if those songs are already present for the user or not(1,0) in user_purchases dataset. finally similarity scores are computed for the ixj position using user_purchses(1/0 values for each top 10 song) multiply by similarity score from data_ibs and divide by the sum of total top 10 similarity scores. Repeat the same for all the user x song combination.

Opening a space(?) delimited text file in python 2.7?

I have what I think is a space delimited text file that I would like to open and copy some of the data to lists (Python 2.7). This is a snippet of the data file:
0.000000 11.00 737.09 1.00 1116.00
0.001000 14.00 669.29 10.00 613.70
0.002000 15.00 962.27 2.00 623.50
0.003000 7.00 880.86 7.00 800.71
0.004000 9.00 634.67 3.00 1045.00
0.005000 12.00 614.67 3.00 913.33
0.006000 12.00 782.58 6.00 841.00
0.007000 13.00 860.08 6.00 354.00
0.008000 14.00 541.07 4.00 665.25
0.009000 14.00 763.00 6.00 1063.00
0.010000 9.00 790.33 6.00 857.83
0.011000 6.00 899.83 4.00 1070.75
0.012000 16.00 710.88 10.00 809.90
0.013000 12.00 863.50 7.00 923.14
0.014000 9.00 591.67 6.00 633.17
0.015000 12.00 740.58 6.00 837.00
0.016000 10.00 727.60 7.00 758.00
0.017000 12.00 838.75 4.00 638.75
0.018000 9.00 991.33 7.00 731.57
0.019000 12.00 680.75 5.00 1079.40
0.020000 15.00 843.20 3.00 546.00
0.021000 11.00 795.18 5.00 1317.20
0.022000 9.00 943.33 5.00 911.00
0.023000 13.00 711.23 3.00 981.67
0.024000 11.00 922.73 5.00 1111.00
0.025000 1112.00 683.58 6.00 542.83
0.026000 15.00 1053.80 5.00 1144.40
Below is the code I have tried, which does not work. I would like to have two lists, one each from the second and the fourth column.
listb = []
listd = []
with open('data_file.txt', 'r') as file:
reader = csv.reader(file,delimiter=' ')
for a,b,c,d,e in reader:
listb.append(int(b))
listd.append(int(d))
What am I doing wrong?
One alternative is to take advantage of the built-in str.split():
a, b, c, d, e = zip(*((map(float, line.split()) for line in open('data_file.txt'))))
The problem is the multiple spaces between fields (columns).
CSV stands for comma-separated values. Imagine for a second that you are using commas instead of spaces. Line 1 in your file would then look like:
,,,,0.000000,,,,,,,11.00,,,,,,737.09,,,,,,,1.00,,,,,1116.00
So, the CSV reader sees more than 5 fields (columns) in that row.
You have two options:
Switch to using single space separators
Use a simple split() to deal with multiple whitespace:
:
listb = []
listd = []
with open('text', 'r') as file:
for row in file:
a, b, c, d, e = row.split()
listb.append(int(b))
listd.append(int(d))
P.S: Once this part is working, you will run into a problem calling int() on strings like "11.00" which aren't really integers.
So I recommend using something like:
int(float(b))
f=open("input.txt",'r')
x=f.readlines()
list1=[]
list2=[]
import re
for line in x:
pattern=re.compile(r"(\d+)(?=\.)")
li=pattern.findall(line)
list1.append(li[1])
list2.append(li[3])
You can use this if you only want to capture integersand not floats.
You can find all values you need, using regexp
import re
list_b = []
list_d = []
with open('C://data_file.txt', 'r') as f:
for line in f:
list_line = re.findall(r"[\d.\d+']+", line)
list_b.append(float(list_line[1])) #appends second column
list_d.append(float(list_line[3])) #appends fourth column
print list_b
print list_d

Categories