I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4
Related
I have the following .txt file:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
I need to copy the first line and add it to the .txt file as last line in order to get this:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
How can I do that? (Notice the 13 as the first entry of the last line)
Try the following. I have added some comments to describe the steps
with open('yourfile.txt') as f:
t=f.readlines()
row=t[-1].split()[0] #get the last row index
row=str(int(row)+1) #increase the last row index
new_line=t[0] #copy first line
new_line=new_line.replace('0', row, 1).replace(' ', '',len(row)-1) #add the next row index to new line, taking care of spaces
t[-1]=t[-1]+'\n'
t.append(new_line) #append the new line
with open('yourfile.txt', 'w') as f:
f.writelines(t)
Applied to your existing .txt, result of the above code is:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
You can use:
with open('fileName.txt') as file:
first_line = file.readline()
count = sum(1 for _ in file)
line1 = first_line.split()
line1[0] = count
str = ' '.join(line1)
#then you can add it to the end of the file with:
file_object.write(str)
Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.
reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]
We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113
I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()
I'm trying to do a substring on data from column "ORG". I only need the 2nd and 3rd character. So for 413 I only need 13. I've tried the following:
Attempt 1: dr2['unit'] = dr2[['ORG']][1:2]
Attempt 2: dr2['unit'] = dr2[['ORG'].str[1:2]
Attempt 3: dr2['unit'] = dr2[['ORG'].str([1:2])
My dataframe:
REGION ORG
90 4 413
91 4 413
92 4 413
93 5 503
94 5 503
95 5 503
96 5 503
97 5 504
98 5 504
99 1 117
100 1 117
101 1 117
102 1 117
103 1 117
104 1 117
105 1 117
106 3 3
107 3 3
108 3 3
109 3 3
Expected output:
REGION ORG UNIT
90 4 413 13
91 4 413 13
92 4 413 13
93 5 503 03
94 5 503 03
95 5 503 03
96 5 503 03
97 5 504 04
98 5 504 04
99 1 117 17
100 1 117 17
101 1 117 17
102 1 117 17
103 1 117 17
104 1 117 17
105 1 117 17
106 3 3 03
107 3 3 03
108 3 3 03
109 3 3 03
thanks for any and all help!
Your square braces are not matching and you can easily slice with [-2:].
apply str.zfill with a width of 2 to pad the items in the new series:
>>> import pandas as pd
>>> ld = [{'REGION': '4', 'ORG': '413'}, {'REGION': '4', 'ORG': '414'}]
>>> df = pd.DataFrame(ld)
>>> df
ORG REGION
0 413 4
1 414 4
>>> df['UNIT'] = df['ORG'].str[-2:].apply(str.zfill, args=(2,))
>>> df
ORG REGION UNIT
0 413 4 13
1 414 4 14
2 3 4 03