I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.
Related
I have a matrix (3x5) where a number is randomly selected in this matrix. I want to swap the selected number with the one down-right. I'm able to locate the index of the randomly selected number but not sure how to replace it with the one that is down then right. For example, given the matrix:
[[169 107 229 317 236]
[202 124 114 280 106]
[306 135 396 218 373]]
and the selected number is 280 (which is in position [1,3]), needs to be swapped with 373 on [2,4]. I'm having issues on how to move around with the index. I can hard-code it but it becomes a little more complex when the number to swap is randomly selected.
If the selected number is on [0,0], then hard-coded would look like:
selected_task = tard_generator1[0,0]
right_swap = tard_generator1[1,1]
tard_generator1[1,1] = selected_task
tard_generator1[0,0] = right_swap
Any suggestions are welcome!
How about something like
chosen = (1, 2)
right_down = chosen[0] + 1, chosen[1] + 1
matrix[chosen], matrix[right_down] = matrix[right_down], matrix[chosen]
will output:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> index = (1, 2)
>>> right_down = index[0] + 1, index[1] + 1
>>> a[index], a[right_down] = a[right_down], a[index]
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 13, 8, 9],
[10, 11, 12, 7, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
There should be a boundary check but its omitted
Try this:
import numpy as np
def swap_rdi(mat, index):
row, col = index
rows, cols = mat.shape
assert(row + 1 != rows and col + 1 != cols)
mat[row, col], mat[row+1, col+1] = mat[row+1, col+1], mat[row, col]
return
Example:
mat = np.matrix([[1,2,3], [4,5,6]])
print('Before:\n{}'.format(mat))
print('After:\n{}'.format(swap_rdi(mat, (0,1))))
Outputs:
Before:
[[1 2 3]
[4 5 6]]
After:
[[1 6 3]
[4 5 2]]
Having done the clustering:
from sklearn.cluster import AffinityPropagation
final_n_clusters = []
preference = np.arange(-20,1,0.3) #preference values
iter_value = np.arange(1, 20, 1) #Maximum number of iterations
for k in preference:
n_cluster_list = []
for j in iter_value:
af = AffinityPropagation(preference = k, max_iter = j, affinity='precomputed').fit(X)
labels = af.labels_
n_clusters = len(np.unique(labels))
n_cluster_list.append(n_clusters)
final_n_clusters.append(n_cluster_list)
And for final_n_clusters I get:
[[1, 97, 97, 97, 97, 1, 1, 97, 97, 97, 97, 97, 1, 97, 97, 97, 21, 1, 97.
...
[1, 30, 37, 5, 45, 33, 13, 8, 7, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7]]
It means: every row are values of "preference" starting from "-20". Every number in row is a different values of "iter_value" starting from "1".
My question is:
Can I get a data frame like this, by applying a "zip"? Or Any other method?
I already have cluster numbers in a "final_n_clusters"
preference iter_value number_of_cluster
-20 1 1 #as you can see number of clusters are from `final_n_clusters`
-20 2 97
... ... ...
-20 3 1
-20 4 97
Use enumerate and list comprehension:
data = [(kv, jv, final_n_clusters[ki][ji]) for ji,jv in enumerate(iter_value) for ki,kv in enumerate(preference)]
df = pd.DataFrame(data, columns=['preference', 'iter_value', 'number_of_cluster'])
Hi I am fairly new to Python/programming and am having trouble with a unpacking a nested column in my dataframe.
The df in question looks like this:
The column I am trying to unpack looks like this (in JSON format):
df['id_data'] = [{u'metrics': {u'app_clicks': [6, 28, 13, 28, 43, 45],
u'card_engagements': [6, 28, 13, 28, 43, 45],
u'carousel_swipes': None,
u'clicks': [18, 33, 32, 48, 70, 95],
u'engagements': [25, 68, 46, 79, 119, 152],
u'follows': [0, 4, 1, 1, 1, 5],
u'impressions': [1697, 5887, 3174, 6383, 10250, 12301],
u'likes': [3, 4, 6, 9, 12, 15],
u'poll_card_vote': None,
u'qualified_impressions': None,
u'replies': [0, 0, 0, 0, 0, 1],
u'retweets': [1, 3, 0, 2, 5, 6],
u'tweets_send': None,
u'url_clicks': None},
u'segment': None}]
As you can see, there is a lot going on in this column. There is a list -> dictionary -> dictionary -> potentially another list. I would like each individual metric (app_clicks, card_engagement, carousel_swipes, etc) to be its own column. I've tried the following code with no progress.
df['app_clicks'] = df.apply(lambda row: u['app_clicks'] for y in row['id_data'] if y['metricdata'] = 'list', axis=1)
Any thoughts on how I could tackle this?
You should be able to pass the dictionary directly to the dataframe constructor:
foo = pd.DataFrame(df['id_data'][0]['metrics'])
foo.iloc[:3, :4]
app_clicks card_engagements carousel_swipes clicks
0 6 6 None 18
1 28 28 None 33
2 13 13 None 32
Hopefully I am understanding your question correctly and this gets you what you need
You can use to_json:
df1 = pd.DataFrame(json.loads(df["id_data"].to_json(orient="records")))
df2 = pd.DataFrame(json.loads(df1["metrics"].to_json(orient="records")))
pd.DataFrame({'email':["a#gmail.com", "b#gmail.com", "c#gmail.com", "d#gmail.com", "e#gmail.com",],
'one':[88, 99, 11, 44, 33],
'two': [80, 80, 85, 80, 70],
'three': [50, 60, 70, 80, 20]})
Given this DataFrame, I would like to compute, for each column, one, two and three, how many values are in certain ranges.
The ranges are for example: 0-70, 71-80, 81-90, 91-100
So the result would be:
out = pd.DataFrame({'colname': ["one", "two", "three"],
'b0to70': [3, 1, 4],
'b71to80': [0, 3, 1],
'b81to90': [1, 1, 0],
'b91to100': [1, 0, 0]})
What would be a nice idiomatic way to do this?
This would do it:
out = pd.DataFrame()
for name in ['one','two','three']:
out[name] = pd.cut(df[name], bins=[0,70,80,90,100]).value_counts()
out.sort_index(inplace=True)
Returns:
one two three
(0, 70] 3 1 4
(70, 80] 0 3 1
(80, 90] 1 1 0
(90, 100] 1 0 0
Is there a more idiomatic way of doing this in Pandas?
I want to set-up a column that repeats the integers 1 to 48, for an index of length 2000:
df = pd.DataFrame(np.zeros((2000, 1)), columns=['HH'])
h = 1
for i in range(0,2000) :
df.loc[i,'HH'] = h
if h >=48 : h =1
else : h += 1
Here is more direct and faster way:
pd.DataFrame(np.tile(np.arange(1, 49), 2000 // 48 + 1)[:2000], columns=['HH'])
The detailed step:
np.arange(1, 49) creates an array from 1 to 48 (included)
>>> l = np.arange(1, 49)
>>> l
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
np.tile(A, N) repeats the array A N times, so in this case you get [1 2 3 ... 48 1 2 3 ... 48 ... 1 2 3 ... 48]. You should repeat the array 2000 // 48 + 1 times in order to get at least 2000 values.
>>> r = np.tile(l, 2000 // 48 + 1)
>>> r
array([ 1, 2, 3, ..., 46, 47, 48])
>>> r.shape # The array is slightly larger than 2000
(2016,)
[:2000] retrieves the 2000 first values from the generated array to create your DataFrame.
>>> d = pd.DataFrame(r[:2000], columns=['HH'])
df = pd.DataFrame({'HH':np.append(np.tile(range(1,49),int(2000/48)), range(1,np.mod(2000,48)+1))})
That is, appending 2 arrays:
(1) np.tile(range(1,49),int(2000/48))
len(np.tile(range(1,49),int(2000/48)))
1968
(2) range(1,np.mod(2000,48)+1)
len(range(1,np.mod(2000,48)+1))
32
And constructing the DataFrame from a corresponding dictionary.