How to remove Form Feed from file

How to remove Form Feed from file - python

I have created a script which is used to extract a certain section of data from the file, which works properly. But in the specific data which I want to extract Form Feed and in next line [] is present which I don't want.
The Form Feed which appear in the file is:-
['STAAD', 'SPACE', '--', 'PAGE', 'NO.', '']
import csv
found_type = False
filename = (r'C:\Users\GURBHEJ SINGH\Desktop\3d_frame.txt')
with open(filename, 'r') as f:
for line in f:
if line == '\n':
continue
if '0.00 MAX' in line:
found_type = True
if found_type:
if '|---------------------------------------------------------------------------' in line:
found_type = False
else:
t_line = str(line)
t_line = list(t_line.split())
print(t_line)
At some positions in the result From feed and in next line [] is present which I don't want like as shown below
['STAAD', 'SPACE', '--', 'PAGE', 'NO.', '8']
[]
To elaborate my problem I am giving a short part of the data I want which is from the file I used.
FILE
2 0.00 MAX 64.08 2 6.17 2 0.00 2 0.00 2
MIN 42.72 1 4.11 1 0.00 2 0.00 2
STAAD SPACE -- PAGE NO. 6
0.31 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
0.62 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
0.92 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
1.23 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
1.54 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
1.85 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.15 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.46 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
2.77 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
3.08 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
3.38 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
3.69 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
4.00 MAX 42.72 1 4.11 1 0.00 2 0.00 2
MIN 64.08 2 6.17 2 0.00 2 0.00 2
3 0.00 MAX 64.08 2 6.17 2 0.00 2 0.00 2
MIN 42.72 1 4.11 1 0.00 2 0.00 2
0.31 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
0.62 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
0.92 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
1.23 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
1.54 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
1.85 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.15 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
STAAD SPACE -- PAGE NO. 7
2.46 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
2.77 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
3.08 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
3.38 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
3.69 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
4.00 MAX 42.72 1 4.11 1 0.00 2 0.00 2
MIN 64.08 2 6.17 2 0.00 2 0.00 2
But the output which I want is like as shown below:-
2 0.00 MAX 64.08 2 6.17 2 0.00 2 0.00 2
MIN 42.72 1 4.11 1 0.00 2 0.00 2
0.31 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
0.62 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
0.92 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
1.23 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
1.54 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
1.85 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.15 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.46 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
2.77 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
3.08 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
3.38 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
3.69 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
4.00 MAX 42.72 1 4.11 1 0.00 2 0.00 2
MIN 64.08 2 6.17 2 0.00 2 0.00 2
3 0.00 MAX 64.08 2 6.17 2 0.00 2 0.00 2
MIN 42.72 1 4.11 1 0.00 2 0.00 2
0.31 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
0.62 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
0.92 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
1.23 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
1.54 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
1.85 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.15 MAX 4.93 2 57.54 2 0.00 2 0.00 2
MIN 3.29 1 38.36 1 0.00 2 0.00 2
2.46 MAX 14.79 2 54.51 2 0.00 2 0.00 2
MIN 9.86 1 36.34 1 0.00 2 0.00 2
2.77 MAX 24.65 2 48.44 2 0.00 2 0.00 2
MIN 16.43 1 32.29 1 0.00 2 0.00 2
3.08 MAX 34.51 2 39.34 2 0.00 2 0.00 2
MIN 23.00 1 26.23 1 0.00 2 0.00 2
3.38 MAX 44.37 2 27.20 2 0.00 2 0.00 2
MIN 29.58 1 18.14 1 0.00 2 0.00 2
3.69 MAX 54.23 2 12.04 2 0.00 2 0.00 2
MIN 36.15 1 8.02 1 0.00 2 0.00 2
4.00 MAX 42.72 1 4.11 1 0.00 2 0.00 2
MIN 64.08 2 6.17 2 0.00 2 0.00 2

Related

change values in dataframe row based on condition

I have this dataframe
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.50 0.50
2 N.Amerca 4.40 4.40 4.40
3 N.Amerca 0.00 8.00 8.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 6.00 6.00
9 Asia 7.50 7.50 7.50
10 Asia 3.75 3.75 3.75
11 Asia 3.50 3.50 3.50
12 Asia 3.80 3.80 3.80
13 Asia 0.00 0.00 0.00
14 Europe 6.52 6.52 6.52
Once a value in 2021 is found it should carry a 0 to the rest (2022 and 2023)
and if a value in 2022 is found -it should carry 0 to the rest. In other words, once value in found in columns 2021 and forth it should zero the rest on the right.
expected result would be:
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.00 0.00
2 N.Amerca 4.40 0.00 0.00
3 N.Amerca 0.00 8.00 0.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 0.00 0.00
9 Asia 7.50 0.00 0.00
10 Asia 3.75 0.00 0.00
11 Asia 3.50 0.00 0.00
12 Asia 3.80 0.00 0.00
13 Asia 0.00 0.00 0.00
14 Europe 6.52 0.00 0.00
I have tried to apply a lambda:
def foo(r):
#if r['2021')>0: then 2020 and forth should be zero)
df = df.apply(lambda x: foo(x), axis=1)
but the challange is that there are 2021 - to 2030 and the foo becomes a mess)

Let us try duplicated
df = df.mask(df.T.apply(pd.Series.duplicated).T,0)
Out[57]:
Region 2021 2022 2023
0 Europe 0.00 0.0 0.00
1 N.Amerca 0.50 0.0 0.00
2 N.Amerca 4.40 0.0 0.00
3 N.Amerca 0.00 8.0 0.00
4 Asia 0.00 0.0 1.75
5 Asia 0.00 0.0 0.00
6 Asia 0.00 0.0 2.00
7 N.Amerca 0.00 0.0 0.50
8 Eurpoe 6.00 0.0 0.00
9 Asia 7.50 0.0 0.00
10 Asia 3.75 0.0 0.00
11 Asia 3.50 0.0 0.00
12 Asia 3.80 0.0 0.00
13 Asia 0.00 0.0 0.00
14 Europe 6.52 0.0 0.00

This is another way:
df2 = df.set_index('Region').diff(axis=1).reset_index()
df2['2021'] = df['2021']
or:
df.iloc[:,1:].where(df.iloc[:,1:].ne(0).cumsum(axis=1).eq(1),0)
Output:
2021 2022 2023
0 0.00 0.0 0.00
1 0.50 0.0 0.00
2 4.40 0.0 0.00
3 0.00 8.0 0.00
4 0.00 0.0 1.75
5 0.00 0.0 0.00
6 0.00 0.0 2.00
7 0.00 0.0 0.50
8 6.00 0.0 0.00
9 7.50 0.0 0.00
10 3.75 0.0 0.00
11 3.50 0.0 0.00
12 3.80 0.0 0.00
13 0.00 0.0 0.00
14 6.52 0.0 0.00

Get proportionate values of columns in a dataframe - Pandas

I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.

You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00

Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)

Pandas sum over partition by rows following SQL equivalent

I am looking a way to aggregate (in pandas) a subset of values based on a particular partition, an equivalent of
select table.*,
sum(income) over (order by id, num_yyyymm rows between 3 preceding and 1 preceding) as prev_income_3,
sum(income) over (order by id, num_yyyymm rows between 1 following and 3 following) as next_income_3
from table order by a.id_customer, num_yyyymm;
I tried with the following solution but it has some problems:
1) Takes ages to complete
2) I have to merge all the results at the end of
for x, y in df.groupby(['id_customer']):
print(y[['num_yyyymm', 'income']])
y['next3'] = y['income'].iloc[::-1].rolling(3).sum()
print(y[['num_yyyymm', 'income', 'next3']])
break
Results:
num_yyyymm income next3
0 201501 0.00 0.00
1 201502 0.00 0.00
2 201503 0.00 0.00
3 201504 0.00 0.00
4 201505 0.00 0.00
5 201506 0.00 0.00
6 201507 0.00 0.00
7 201508 0.00 0.00
8 201509 0.00 0.00
9 201510 0.00 0.00
10 201511 0.00 0.00
11 201512 0.00 0.00
12 201601 0.00 0.00
13 201602 0.00 0.00
14 201603 0.00 0.00
15 201604 0.00 0.00
16 201605 0.00 0.00
17 201606 0.00 0.00
18 201607 0.00 0.00
19 201608 0.00 0.00
20 201609 0.00 1522.07
21 201610 0.00 1522.07
22 201611 0.00 1522.07
23 201612 1522.07 0.00
24 201701 0.00 -0.00
25 201702 0.00 1.52
26 201703 0.00 1522.07
27 201704 0.00 1522.07
28 201705 1.52 1520.55
29 201706 1520.55 0.00
30 201707 0.00 NaN
31 201708 0.00 NaN
32 201709 0.00 NaN
Does anybody have an alternative solution?

removing the name of a pandas dataframe index after appending a total row to a dataframe

I have calculated a series of totals tips by day of a week and appended it to the bottom of totalspt dataframe.
I have set the index.name for the totalspt dataframe to None.
However while the dataframe is displaying the default 0,1,2,3 index it doesn't display the default empty cell in the top left directly above the index.
How could I make this cell empty in the dataframe?
total_bill tip sex smoker day time size tip_pct
0 16.54 1.01 F N Sun D 2 0.061884
1 12.54 1.40 F N Mon D 2 0.111643
2 10.34 3.50 M Y Tue L 4 0.338491
3 20.25 2.50 M Y Wed D 2 0.123457
4 16.54 1.01 M Y Thu D 1 0.061064
5 12.54 1.40 F N Fri L 2 0.111643
6 10.34 3.50 F Y Sat D 3 0.338491
7 23.25 2.10 M Y Sun B 3 0.090323
pivot = tips.pivot_table('total_bill', index=['sex', 'size'],columns=['day'],aggfunc='sum').fillna(0)
print pivot
day Fri Mon Sat Sun Thu Tue Wed
sex size
F 2 12.54 12.54 0.00 16.54 0.00 0.00 0.00
3 0.00 0.00 10.34 0.00 0.00 0.00 0.00
M 1 0.00 0.00 0.00 0.00 16.54 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 20.25
3 0.00 0.00 0.00 23.25 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 10.34 0.00
totals_row = tips.pivot_table('total_bill',columns=['day'],aggfunc='sum').fillna(0).astype('float')
totalpt = pivot.reset_index('sex').reset_index('size')
totalpt.index.name = None
totalpt = totalpt[['Fri', 'Mon','Sat', 'Sun', 'Thu', 'Tue', 'Wed']]
totalpt = totalpt.append(totals_row)
print totalpt
**day** Fri Mon Sat Sun Thu Tue Wed #problem text day
0 12.54 12.54 0.00 16.54 0.00 0.00 0.00
1 0.00 0.00 10.34 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 16.54 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 20.25
4 0.00 0.00 0.00 23.25 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 10.34 0.00
total_bill 12.54 12.54 10.34 39.79 16.54 10.34 20.25

That's the columns' name.
In [11]: df = pd.DataFrame([[1, 2]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
In [13]: df.columns.name = 'XX'
In [14]: df
Out[14]:
XX A B
0 1 2
You can set it to None to clear it.
In [15]: df.columns.name = None
In [16]: df
Out[16]:
A B
0 1 2
An alternative, if you wanted to keep it, is to give the index a name:
In [21]: df.columns.name = "XX"
In [22]: df.index.name = "index"
In [23]: df
Out[23]:
XX A B
index
0 1 2

You can use rename_axis. Since 0.17.0
In [3939]: df
Out[3939]:
XX A B
0 1 2
In [3940]: df.rename_axis(None, axis=1)
Out[3940]:
A B
0 1 2
In [3942]: df = df.rename_axis(None, axis=1)
In [3943]: df
Out[3943]:
A B
0 1 2

missing lines while appending files

I am new to both stackoverflow and python, so this might look obvious:
In this procedure I want to create a new file named database out of a list of files generated by a previous procedure. The files in the list are quite big (around 13.6 MB). The goal is to have a single file with lines from all other:
database = open('current_database', 'a')
def file_apender(new):
for line in new:
database.write(line)
def file_join(list_of_files):
for file in list_of_files:
file_apender(file)
Then if I:
file_join(a_file_list)
I get the database file, but 26 lines are missing and the last one is not complete.
Here is the ending of the file:
63052300774565. 12 4 3 0 0.37 0.79 10.89 12.00 1.21 25.26 0.00 0.00 0.00 0.00
63052300774565. 12 2 0 0 0.06 0.12 2.04 2.21 0.86 5.30 0.00 0.00 0.00 0.00
63052300774565. 12 0 0 0 0.12 0.26 3.13 4.63 3.81 11.95 0.00 0.00 0.00 0.00
63052300774565. 12 2 2 0 0.06 0.15 1.35 2.39 0.00 3.94 0.00 0.00 0.00 0.00
63052300774565. 12 0 1 0 0.06 0.08 1.13 1.29 3.60 6.16 0.00 0.00 0.00 0.00
63052300774565. 12 2 0 0 0.23 0.41 4.02 6.47 8.39 19.52 0.00 0.00 0.00 0.00
63052300774565. 12 1 3 0 0.05 0.16 1.85 2.50 0.57 5.13 0
I have tried to find out if there is a memory limitation... Otherwise I got no ideas.

I'm going to use my psychic debugging skills, and guess that you don't have a database.close().
If you don't close the file when writing to it, there may still be data in the Python output buffers that hasn't been written to the OS yet. If your program exits at that point, then the data is not written to disk and you will be missing data at the end.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove Form Feed from file - python

Related

change values in dataframe row based on condition

Get proportionate values of columns in a dataframe - Pandas

Pandas sum over partition by rows following SQL equivalent

removing the name of a pandas dataframe index after appending a total row to a dataframe

missing lines while appending files

Categories

Resources