So I have a csv that contains data on a daily basis separated by a header. Is there anyway I can make separate pandas dfs each time the program hits a header?
The data bascially looks like this
#dateinformation
data1, data2, data3
data4, data5, data6
#dateinformation
an example of the real csv is this
#7240320140101002301 131
21101400B 86 12B 110 325 25
10100000 200B 6B 110 325 77
20 95300 -9999 -27B 100-9999-9999
10 92500 820B -39B 90 290
.....
#7240320140102002301
21101400B 86 14B 110 325 25
10100000 200B 2B 110 325 77
20 95300 -9999 -85B 100-9999-9999
10 92500 820B -25B 90 290
I've already got the formatting of the actual data down fine. I just need some help with how to separate out the different sets within the csv
(code below based on the header row starting with '#')
I suppose in theory you'd do this with read_table and chunksize but in practice I had trouble getting that to work very well on account of the different number of fields per row. The following is fairly simple but I did have to resort to iterrows.
In [1435]: df_list = []
...: df = pd.DataFrame()
...: j = 0
...: foo = pd.read_csv('foo.txt',sep=' *',names=list('abcdef'))
...: for i, row in foo.ix[1:].iterrows():
...: if row[0][0] == '#':
...: df_list.append(df)
...: df = pd.DataFrame()
...: else:
...: df = df.append(row)
...: df_list.append(df)
In [1436]: df_list[0]
Out[1436]:
a b c d e f
1 21101400B 86 12B 110 325 25
2 10100000 200B 6B 110 325 77
3 20 95300 -9999 -27B 100-9999-9999 NaN
4 10 92500 820B -39B 90 290
In [1437]: df_list[1]
Out[1437]:
a b c d e f
6 21101400B 86 14B 110 325 25
7 10100000 200B 2B 110 325 77
8 20 95300 -9999 -85B 100-9999-9999 NaN
9 10 92500 820B -25B 90 290
This answer is based on the assumption that each 'frame' contains the same number of rows
First we read the file with pandas read_csv(). We leverage the comment parameter to omit each of your headers and read in just the data
df = pd.read_csv('data.txt', comment='#', delim_whitespace=True, header=None)
df
0 1 2 3 4 5
0 21101400B 86 12B 110 325 25
1 10100000 200B 6B 110 325 77
2 20 95300 -9999 -27B 100-9999-9999 NaN
3 10 92500 820B -39B 90 290
4 21101400B 86 14B 110 325 25
5 10100000 200B 2B 110 325 77
6 20 95300 -9999 -85B 100-9999-9999 NaN
7 10 92500 820B -25B 90 290
Then a for loop to parse and store each frame in a list. I'm assuming number of rows = 4
frames = []
for begin in range(0,len(df),4):
frames.append(df[begin:begin+4])
frames[0]
0 1 2 3 4 5
0 21101400B 86 12B 110 325 25
1 10100000 200B 6B 110 325 77
2 20 95300 -9999 -27B 100-9999-9999 NaN
3 10 92500 820B -39B 90 290
Related
I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4
I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?
Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0
Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I have the following .txt file:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
I need to copy the first line and add it to the .txt file as last line in order to get this:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
How can I do that? (Notice the 13 as the first entry of the last line)
Try the following. I have added some comments to describe the steps
with open('yourfile.txt') as f:
t=f.readlines()
row=t[-1].split()[0] #get the last row index
row=str(int(row)+1) #increase the last row index
new_line=t[0] #copy first line
new_line=new_line.replace('0', row, 1).replace(' ', '',len(row)-1) #add the next row index to new line, taking care of spaces
t[-1]=t[-1]+'\n'
t.append(new_line) #append the new line
with open('yourfile.txt', 'w') as f:
f.writelines(t)
Applied to your existing .txt, result of the above code is:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
You can use:
with open('fileName.txt') as file:
first_line = file.readline()
count = sum(1 for _ in file)
line1 = first_line.split()
line1[0] = count
str = ' '.join(line1)
#then you can add it to the end of the file with:
file_object.write(str)
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113
I have a big DataFrame df and I want to count each values. I can't do:
df = pandas.read_csv('my_big_data.csv')
values_df = df.apply(value_counts)
because It is a very big Database.
I think it must be possible to do it chunk by chunk with chunksize, but I can't see how.
In [9]: pd.set_option('max_rows',10)
Construct a sample frame
In [10]: df = DataFrame(np.random.randint(0,100,size=100000).reshape(-1,1))
In [11]: df
Out[11]:
0
0 50
1 35
2 20
3 66
4 8
... ..
99995 51
99996 33
99997 43
99998 41
99999 56
[100000 rows x 1 columns]
In [12]: df.to_csv('test.csv')
Chunk read it and construct the .value_counts for each chunks
Concacatenate all of these results (so you have a frame that is indexed by the value being counts and the values are the counts).
In [13]: result = pd.concat([ chunk.apply(Series.value_counts) for chunk in pd.read_csv('test.csv',index_col=0,chunksize=10000) ] )
In [14]: result
Out[14]:
0
18 121
75 116
39 116
55 115
60 114
.. ...
88 83
8 83
56 82
76 76
18 73
[1000 rows x 1 columns]
Then groupby the index which puts all of the duplicates (indexes) in a groups. Summing give the sum of the individual value_counts.
In [15]: result.groupby(result.index).sum()
Out[15]:
0
0 1017
1 1015
2 992
3 1051
4 973
.. ...
95 1014
96 949
97 1011
98 999
99 981
[100 rows x 1 columns]