I am trying to copy an array, replace all values in the copy below a threshold but keep the original array in tact.
Here is a simplified example of what I need to do.
import numpy as np
A = np.arange(0,1,.1)
B = A
B[B<.3] = np.nan
print ('A =', A)
print ('B =', B)
Which yields
A = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
B = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
I can't understand why the values in A <= .3 are also overwritten?
Can someone explain this to me and suggest a work around?
Change B = A to B = A.copy() and this should work as expected. As written, B and A refer to the same object in memory.
Related
I am using pandas to get subgroup averages, and the basics work fine. For instance,
d = np.array([[1,4],[1,1],[0,1],[1,1]])
m = d.mean(axis=1)
p = pd.DataFrame(m,index='A1,A2,B1,B2'.split(','),columns=['Obs'])
pprint(p)
x = p.groupby([v[0] for v in p.index])
pprint(x.mean('Obs'))
x = p.groupby([v[1] for v in p.index])
pprint(x.mean('Obs'))
YIELDS:
Obs
A1 2.5
A2 1.0
B1 0.5
B2 1.0
Obs
A 1.75. <<<< 1.75 is (2.5 + 1.0) / 2
B 0.75
Obs
1 1.5
2 1.0
But, I also need to know how much A and B (1 and 2) deviate from their common mean. That is, I'd like to have tables like:
Obs Dev
A 1.75 0.50 <<< deviation of the Obs average, i.e., 1.75 - 1.25
B 0.75 -0.50 <<< 0.75 - 1.25 = -0.50
Obs Dev
1 1.5 0.25
2 1.0 -0.25
I can do this using loc, apply etc - but this seems silly. Can anyone think of an elegant way to do this using groupby or something similar?
Aggregate the means, then compute the difference to the mean of means:
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-d['Obs'].mean())
)
Or, in case of a variable number of items if you want the difference to the overall mean (not the mean of means!):
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-p['Obs'].mean()) # notice the p (not d)
)
output:
Obs Dev
A 1.75 0.5
B 0.75 -0.5
I have written a python function for some calculation that uses below two data sets.I want to calculate z for every data in data_2 using row1,row2,ror3,row4,row5 of data_1.However, as i am new to python i tried to write but fails in between.please help.Thanks.
data_1 data_2
file a b c d x
file1 0.5 0.6 0.8 0.3 0.5
file1 0.2 0.2 0.4 0.1 0.8
file1 0.1 0.4 0.5 0.2 0.9
my tried code is here:
import numpy as np
file1=np.loadtxt('data_1',skiprows=1,usecols=(1,2,3))
file2=np.loadtxt('data_2',skiprows=1,usecols=(0))
def calculation(a,b,c,x):
z=(a+b+c)*x
return z
for value in file2:
print(value)
calculation
my expected output should be something like
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
Python is a dynamic language and numpy tends to override normal operators to apply operations to entire collections of data. Often, if you have a for loop, you are not taking advantage of that.
numpy arrays can only hold a single data type but you have a string in column 0. pandas wraps numpy and makes multiple data types easier to deal with. So I've switch to reading pandas.DataFrame objects instead of arrays.
It looks like you want the cartesian product of file2["x"] with the rows in file1. One way to do that is to create a dummy column in both dataframes that have matching values and then merge. Use the sum method for a + b + c and then multiply with x, and you have the result.
import pandas as pd
# read space separated tables
file1=pd.read_table('data_1', sep=r"\s+")
file2=pd.read_table('data_2', sep=r"\s+")
# we want (a+b+c)*x, for each value in file2["x"]. Do the sum, then
# use `merge` with a temporary key to create the cartesian product
# with x. For each `x`, merge will create a row for each matching
# key and since all keys match, we've got a cartesian product.
# Finally, multiply.
file1["_tmpsums"] = file1[["a", "b", "c"]].sum(axis=1)
file1["_tmpmergekey"] = file2["_tmpmergekey"] = 1
file1 = pd.merge(file1, file2, on="_tmpmergekey")
file1["z"] = file1["_tmpsums"] * file1["x"]
file1 = file1.drop(["_tmpsums", "_tmpmergekey", "x"], axis=1)
print(" data_3")
print(file1.to_string(col_space=6, index=False, justify="center"))
Result
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 0.95
file1 0.5 0.6 0.8 0.3 1.52
file1 0.5 0.6 0.8 0.3 1.71
file1 0.2 0.2 0.4 0.1 0.40
file1 0.2 0.2 0.4 0.1 0.64
file1 0.2 0.2 0.4 0.1 0.72
file1 0.1 0.4 0.5 0.2 0.50
file1 0.1 0.4 0.5 0.2 0.80
file1 0.1 0.4 0.5 0.2 0.90
Using pandas as follows
import pandas as pd
# Load Data
data_1 = pd.read_csv('data_1.txt', delimiter = r"\s+")
data_2 = pd.read_csv('data_2.txt', delimiter = r"\s+")
# Compute the cartesian product of data_1 with data_2
# since for each row in data_1, we need sequence of rows in data_2
# We do this using DataFrame merge by injecting a key that is repeated for each row
# i.e. 'merge_key'
data_1['merge_key'] = pd.Series([1]*len(data_1))
data_2['merge_key'] = pd.Series([1]*len(data_2))
df = pd.merge(data_1, data_2, on = 'merge_key')
# Drop merge key from result
df.drop('merge_key', axis = 'columns', inplace = True)
# DataFrame df now has columns File, a, b, c, d, x
# We can apply function calulation to each row using apply
# and specifying the columns to send to calculation
df['z'] = df.apply(lambda row: calculation(row['a'], row['b'], row['c'], row['x']), axis = 'columns')
# Drop x column
df.drop('x', axis = 'columns', inplace = True)
# Write to CSV file
df.to_csv('data_3.txt', index=False, sep = " ")
Output
Pandas DataFrame df
file a b c d z
0 file1 0.5 0.6 0.8 0.3 0.95
1 file1 0.5 0.6 0.8 0.3 1.52
2 file1 0.5 0.6 0.8 0.3 1.71
3 file1 0.2 0.2 0.4 0.1 0.40
4 file1 0.2 0.2 0.4 0.1 0.64
5 file1 0.2 0.2 0.4 0.1 0.72
6 file1 0.1 0.4 0.5 0.2 0.50
7 file1 0.1 0.4 0.5 0.2 0.80
8 file1 0.1 0.4 0.5 0.2 0.90
CSV File data_3.txt
file a b c d z
file1 0.5 0.6 0.8 0.3 0.9500000000000001
file1 0.5 0.6 0.8 0.3 1.5200000000000002
file1 0.5 0.6 0.8 0.3 1.7100000000000002
file1 0.2 0.2 0.4 0.1 0.4
file1 0.2 0.2 0.4 0.1 0.6400000000000001
file1 0.2 0.2 0.4 0.1 0.7200000000000001
file1 0.1 0.4 0.5 0.2 0.5
file1 0.1 0.4 0.5 0.2 0.8
file1 0.1 0.4 0.5 0.2 0.9
Basic Python
Same output
# Get data from first file
with open('data_1.txt', 'r') as f:
# first file header
header1 = f.readline()
# Let's get the lines of data
data_1 = []
for line in f:
new_data = line.rstrip().split() # strip '\n' and split on parens
for i in range(1, len(new_data)):
new_data[i] = float(new_data[i]) # convert columns after file to float
data_1.append(new_data)
# Get data from second file
with open('data_2.txt', 'r') as f:
# second file header
header2 = f.readline()
# Let's get the lines of data
data_2 = []
for line in f:
new_data = float(line.rstrip()) # only one value per line
data_2.append(new_data)
with open('data_3.txt', 'w') as f:
# Output file
# Write Header
f.write("file a b c d z\n")
# Use double loop to loop through all rows of data_2 for each row in data_1
for v1 in data_1:
# For each row in data_1
file, a, b, c, d = v1 # unpacking the values in v1 to individual variables
for v2 in data_2:
# for each row in data_2
x = v2 # data2 just has a single value per row
# Calculation using posted formula
z = calculation(a, b, c, x)
# Write result
f.write(f"{file} {a} {b} {c} {d} {z}\n")
Numpy Version
import numpy as np
file1=np.loadtxt('data_1.txt',skiprows=1,usecols=(1,2,3, 4))
file2=np.loadtxt('data_2.txt',skiprows=1,usecols=(0))
with open('data_3.txt', 'w') as f:
# Write header
f.write("file a b c d z\n")
# Double loop to through the values of file1 and file2
for val1 in file1:
for val2 in file2:
# Only use first 3 values (val1[:3] to only use first 3 value so ignore d)
z = calculation(*val1[:3], val2) # *val[:3] is unpacking values to go into calculation
# Write result
# map(str, val1) - converts values to string
# str(z) converts z to string
#' '.join([*map(str, val1), str(z)] - creates a space separated string
f.write(' '.join([*map(str, val1), str(z)]) + "\n")
This question already has answers here:
Pandas Groupby and Sum Only One Column
(3 answers)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 4 years ago.
I have a 2d numpy array with repeated values in first column.
The repeated values can have any corresponding value in second column.
Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.
How can we do this effectively using numpy or pandas?
Here, I have solved the problem using ineffective for-loop.
I was wondering if there is a more elegant solution.
Question
How can we get the same result in more effective fashion?
Help will be appreciated.
#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1
unq = np.unique(aa)
ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)
for i, u in enumerate(unq):
for j, a in enumerate(aa):
if a == u:
print(a, u)
ans[i] += bb[j]
print(ans)
"""
# given data
idx col0 col1
0 7. 0.0
1 15. 0.1
2 11. 0.2
3 8. 0.3
4 7. 0.4
5 19. 0.5
6 11. 0.6
7 11. 0.7
8 4. 0.8
9 8. 0.9
# sorted data
4. 0.8
7. 0.0
7. 0.4
8. 0.9
8. 0.3
11. 0.6
11. 0.7
11. 0.2
15. 0.1
19. 0.5
# cumulative sum for repeated serial
4. 0.8
7. 0.0 + 0.4
8. 0.9 + 0.3
11. 0.6 + 0.7 + 0.2
15. 0.1
19. 0.5
# Required answer
4. 0.8
7. 0.4
8. 1.2
11. 1.5
15. 0.1
19. 0.5
"""
You can groupby col0 and find the .sum() for col1.
df.groupby('col0')['col1'].sum()
Output:
col0
4.0 0.8
7.0 0.4
8.0 1.2
11.0 1.5
15.0 0.1
19.0 0.5
Name: col1, dtype: float64
I think a pandas method such as the one offered by #HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:
np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])
which returns:
array([[ 4. , 0.8],
[ 7. , 0.4],
[ 8. , 1.2],
[ 11. , 1.5],
[ 15. , 0.1],
[ 19. , 0.5]])
I have a numpy vector my_indexes of size 1XN which contain boolean values of indexing and a 2D array my_array of size MxK where K << N. Actually, the boolean vector correspond to columns that I remove (or keep in the array my_array) and I want to add those columns back filled with zeros (or 'NaNs'). My code for removing the columns:
my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)])
cols = np.all(np.isnan(my_array), axis=0)
my_array = some_process(my_array)
# How can I add the removed columns
My array if of size MXN and then the size is MXK. How can I fill the removed columns again with nan or zeros?
An example could be:
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
0.1 nan 0.3 .04 nan 0.12 0.12
Firstly, I want to remove the nan columns using my_array= np.array(my_array[:, ~np.all(np.isnan(my_array), axis=0)]).
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
0.1 0.3 .04 0.12 0.12
And the my_indexes vector is:
False True False False True False False
Then I want to process the matrix and then have the nan columns back (note that the preprocessing cannot happened with the nan columns). I guess that I need to use the np.insert function however how can I do so using my boolean vector
You can probably use masked arrays for that:
import numpy as np
import numpy.ma as ma
def some_process(x):
return x**2
x = np.arange(9, dtype=float).reshape(3, 3)
x[:,1] = np.nan
print(x)
# [[ 0. nan 2.]
# [ 3. nan 5.]
# [ 6. nan 8.]]
# mask all np.nan and np.inf
masked_x = ma.masked_invalid(x)
# Compute the process only on the unmasked values and fill back np.nan
x = ma.filled(some_process(masked_x), np.nan)
print(x)
# [[ 0. nan 4.]
# [ 9. nan 25.]
# [ 36. nan 64.]]
I am trying to print items in two separate lists in a way that items in list-1 will align with items in list-2.
Here is my attempt:
import numpy as np
list_1=[1,2,3,4]
list_2=np.arange(0.1,0.4,0.1)
for x in list_1:
j=x/2.0
for y in list_2:
print j,',', y
My Output:
0.5 , 0.1
0.5 , 0.2
0.5 , 0.3
0.5 , 0.4
1.0 , 0.1
1.0 , 0.2
1.0 , 0.3
1.0 , 0.4
1.5 , 0.1
1.5 , 0.2
1.5 , 0.3
1.5 , 0.4
2.0 , 0.1
2.0 , 0.2
2.0 , 0.3
2.0 , 0.4
Desired Output:
0.5 , 0.1
1.0 , 0.2
1.5 , 0.3
2.0 , 0.4
What you want is zip().
Example:
>>> l1 = range(10)
>>> l2 = range(20,30)
>>> for x,y in zip(l1, l2):
print x, y
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
Explanation:
zip receives iterables, and then iterates over all of them at once, starting from the 0 element of each, then going on to the 1st and then 2nd and so on, once any of the iterables reaches the end - the zip will stop, you can use izip_longest from itertools to fill empty items in iterables with None (or you can do some fancier things - but that is for a different question)