Add a new line to .txt file in python - python

I have the following .txt file:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
I need to copy the first line and add it to the .txt file as last line in order to get this:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
How can I do that? (Notice the 13 as the first entry of the last line)

Try the following. I have added some comments to describe the steps
with open('yourfile.txt') as f:
t=f.readlines()
row=t[-1].split()[0] #get the last row index
row=str(int(row)+1) #increase the last row index
new_line=t[0] #copy first line
new_line=new_line.replace('0', row, 1).replace(' ', '',len(row)-1) #add the next row index to new line, taking care of spaces
t[-1]=t[-1]+'\n'
t.append(new_line) #append the new line
with open('yourfile.txt', 'w') as f:
f.writelines(t)
Applied to your existing .txt, result of the above code is:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0

You can use:
with open('fileName.txt') as file:
first_line = file.readline()
count = sum(1 for _ in file)
line1 = first_line.split()
line1[0] = count
str = ' '.join(line1)
#then you can add it to the end of the file with:
file_object.write(str)

Related

concat result of apply in python

I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

How to read .data format data with Python?

I have uploaded data from https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/ . As you see it has .data format. How to read it as pandas datframe in Python?
I try this. but it dens work:
with open("arrhythmia.data", "r") as f:
arryth_df = pd.DataFrame(f.read())
It says ValueError: DataFrame constructor not properly called!
You can pass url of file to read_csv because here .data is csv format, but no header, so added header=None:
#if want see all data
pd.options.display.max_columns = None
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None)
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 75 0 190 80 91 193 371 174 121 -16 13 64 -2 ? 63 0
1 56 1 165 64 81 174 401 149 39 25 37 -17 31 ? 53 0
2 54 0 172 95 138 163 386 185 102 96 34 70 66 23 75 0
3 55 0 175 94 100 202 380 179 143 28 11 -5 20 ? 71 0
4 75 0 190 80 88 181 360 177 103 -16 13 61 3 ? ? 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 \
0 52 44 0 0 32 0 0 0 0 0 0 0 44 20 36
1 48 0 0 0 24 0 0 0 0 0 0 0 64 0 0
2 40 80 0 0 24 0 0 0 0 0 0 20 56 52 0
3 72 20 0 0 48 0 0 0 0 0 0 0 64 36 0
4 48 40 0 0 28 0 0 0 0 0 0 0 40 24 0
...
...
...
If want also convert ? to missing values NaNs add na_values='?' parameter:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None, na_values='?')
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 75 0 190 80 91 193 371 174 121 -16 13.0 64.0 -2.0 NaN
1 56 1 165 64 81 174 401 149 39 25 37.0 -17.0 31.0 NaN
2 54 0 172 95 138 163 386 185 102 96 34.0 70.0 66.0 23.0
3 55 0 175 94 100 202 380 179 143 28 11.0 -5.0 20.0 NaN
4 75 0 190 80 88 181 360 177 103 -16 13.0 61.0 3.0 NaN
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 \
0 63.0 0 52 44 0 0 32 0 0 0 0 0 0 0 44
1 53.0 0 48 0 0 0 24 0 0 0 0 0 0 0 64
2 75.0 0 40 80 0 0 24 0 0 0 0 0 0 20 56
3 71.0 0 72 20 0 0 48 0 0 0 0 0 0 0 64
4 NaN 0 48 40 0 0 28 0 0 0 0 0 0 0 40
...
...
Do it this way with StringIO:
from io import StringIO
import pandas as pd
with open("arrhythmia.data", "r") as f:
data = StringIO(f.read())
arryth_df = pd.read_csv(data)

cumsum in numpy ndarray (edited)

I had an issue with cumsum in a dataframe, which was nicely resolved here : https://stackoverflow.com/a/61842690/7937578
But when I tried to do it with my entire dataframe, I couldn't fit all my data into pandas, so I tried converting it to numpy arrays only, but I can't seem to reproduce the code in numpy only.
So far I have this :
test = np.arange(200).reshape(4, 50)
test[2] = np.random.choice([-1, 0, 1], size=50)
TARGET_SUM = 10
x = np.cumsum(test[2] != 0)
changing = np.roll(x, 1) != x
indices = np.where(changing & (x % TARGET_SUM == 0) & (x > 0))[0]
indices = np.concatenate(([-1,], indices))
indices += 1
for i1, i2 in zip(indices[0:-1], indices[1:]):
print(i1, i2)
print(test[i1:i2])
But the output is this :
0 13
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49]
[ 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
86 87 88 89 90 91 92 93 94 95 96 97 98 99]
[ -1 1 0 -1 -1 0 0 -1 1 -1 1 1 -1 -1 0 -1 0 0
0 -1 0 0 -1 1 -1 1 1 -1 -1 0 1 0 0 -1 1 -1
1 0 0 0 1 0 -1 1 1 1 1 1 1 1]
[150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
186 187 188 189 190 191 192 193 194 195 196 197 198 199]]
13 29
[]
29 46
[]
Where it should be more like this :
0 13
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12]
[ 50 51 52 53 54 55 56 57 58 59 60 61 62]
[ -1 1 0 -1 -1 0 0 -1 1 -1 1 1 -1]
[ 150 151 152 153 154 155 156 157 158 159 160 161 162]]
13 29
etc...
The solution of #Ben.T juste above worked perfectly !
Quoting : "I think suppr is a list of arrays, so I guess what you need is print(np.array(suppr)[:, i1:i2]) and if it is already an array, then suppr[:, i1:i2] should be enough. "

Pandas dataframe from nested dictionary to melted data frame

I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K

Categories