I have a large DataFrame (i.e. thousands of rows and 20 columns) and I want to calculate the average (or any other mathmathical function like the total sum etc) over all columns. example:
x = [
[0.5 0.7 0.1 4 80 101],
[0.1 0.7 0.8 5 4 58],
[0.4 0.1 0.6 6 1 66],
...
[0.9 0.4 0.1 7 44 12]
]
This should result in
avg = [0.475 0.95 ...]
or
sum = [15.1 8.17 ...]
Is there any quick formula or oneliner that can easily apply this formula? It does not have to be a pandas.DataFrame, a Numpy array is also good
df.mean(axis=0)
df.sum(axis=0)
Related
I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})
You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))
This question already has answers here:
Pandas Groupby and Sum Only One Column
(3 answers)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 4 years ago.
I have a 2d numpy array with repeated values in first column.
The repeated values can have any corresponding value in second column.
Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.
How can we do this effectively using numpy or pandas?
Here, I have solved the problem using ineffective for-loop.
I was wondering if there is a more elegant solution.
Question
How can we get the same result in more effective fashion?
Help will be appreciated.
#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1
unq = np.unique(aa)
ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)
for i, u in enumerate(unq):
for j, a in enumerate(aa):
if a == u:
print(a, u)
ans[i] += bb[j]
print(ans)
"""
# given data
idx col0 col1
0 7. 0.0
1 15. 0.1
2 11. 0.2
3 8. 0.3
4 7. 0.4
5 19. 0.5
6 11. 0.6
7 11. 0.7
8 4. 0.8
9 8. 0.9
# sorted data
4. 0.8
7. 0.0
7. 0.4
8. 0.9
8. 0.3
11. 0.6
11. 0.7
11. 0.2
15. 0.1
19. 0.5
# cumulative sum for repeated serial
4. 0.8
7. 0.0 + 0.4
8. 0.9 + 0.3
11. 0.6 + 0.7 + 0.2
15. 0.1
19. 0.5
# Required answer
4. 0.8
7. 0.4
8. 1.2
11. 1.5
15. 0.1
19. 0.5
"""
You can groupby col0 and find the .sum() for col1.
df.groupby('col0')['col1'].sum()
Output:
col0
4.0 0.8
7.0 0.4
8.0 1.2
11.0 1.5
15.0 0.1
19.0 0.5
Name: col1, dtype: float64
I think a pandas method such as the one offered by #HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:
np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])
which returns:
array([[ 4. , 0.8],
[ 7. , 0.4],
[ 8. , 1.2],
[ 11. , 1.5],
[ 15. , 0.1],
[ 19. , 0.5]])
I have a numpy vector my_indexes of size 1XN which contain boolean values of indexing and a 2D array my_array of size MxK where K << N. Actually, the boolean vector correspond to columns that I remove (or keep in the array my_array) and I want to add those columns back filled with zeros (or 'NaNs'). My code for removing the columns:
my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)])
cols = np.all(np.isnan(my_array), axis=0)
my_array = some_process(my_array)
# How can I add the removed columns
My array if of size MXN and then the size is MXK. How can I fill the removed columns again with nan or zeros?
An example could be:
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
Firstly, I want to remove the nan columns using my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)]).
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
And the my_indexes vector is:
False True False False True False False
Then I want to process the matrix and then have the nan columns back (note that the preprocessing cannot happened with the nan columns). I guess that I need to use the np.insert function however how can I do so using my boolean vector
You can probably use masked arrays for that:
import numpy as np
import numpy.ma as ma
def some_process(x):
return x**2
x = np.arange(9, dtype=float).reshape(3, 3)
x[:,1] = np.nan
print(x)
# [[ 0. nan 2.]
# [ 3. nan 5.]
# [ 6. nan 8.]]
# mask all np.nan and np.inf
masked_x = ma.masked_invalid(x)
# Compute the process only on the unmasked values and fill back np.nan
x = ma.filled(some_process(masked_x), np.nan)
print(x)
# [[ 0. nan 4.]
# [ 9. nan 25.]
# [ 36. nan 64.]]
Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.
Based on the definition of probability distribution you have given, you can use pandas to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4
I am trying to print items in two separate lists in a way that items in list-1 will align with items in list-2.
Here is my attempt:
import numpy as np
list_1=[1,2,3,4]
list_2=np.arange(0.1,0.4,0.1)
for x in list_1:
j=x/2.0
for y in list_2:
print j,',', y
My Output:
0.5 , 0.1
0.5 , 0.2
0.5 , 0.3
0.5 , 0.4
1.0 , 0.1
1.0 , 0.2
1.0 , 0.3
1.0 , 0.4
1.5 , 0.1
1.5 , 0.2
1.5 , 0.3
1.5 , 0.4
2.0 , 0.1
2.0 , 0.2
2.0 , 0.3
2.0 , 0.4
Desired Output:
0.5 , 0.1
1.0 , 0.2
1.5 , 0.3
2.0 , 0.4
What you want is zip().
Example:
>>> l1 = range(10)
>>> l2 = range(20,30)
>>> for x,y in zip(l1, l2):
print x, y
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
Explanation:
zip receives iterables, and then iterates over all of them at once, starting from the 0 element of each, then going on to the 1st and then 2nd and so on, once any of the iterables reaches the end - the zip will stop, you can use izip_longest from itertools to fill empty items in iterables with None (or you can do some fancier things - but that is for a different question)