Delete lines that contain decimal numbers - python

I am trying to delete lines that contain decimal numbers. For instance:
82.45 76.16 21.49 -2.775
5 24 13 6 9 0 3 2 4 9 7 11 54 11 1 1 18 5 0 0
1 1 0 2 2 0 0 0 0 0 0 0 14 90 21 5 24 26 73 13
20 33 23 59 158 85 17 6 158 66 15 13 13 10 2 37 81 0 0 0
1 3 0 19 8 158 75 7 10 8 5 1 23 58 148 77 120 78 6 7
158 80 15 10 16 21 6 37 100 25 0 0 0 0 0 3 1 10 9 1
0 0 0 0 11 16 57 15 0 0 0 0 158 76 9 1 0 0 0 0
22 17 0 0 0 0 0 0
50.04 143.84 18.52 -1.792
3 0 0 0 0 0 0 0 36 0 0 0 2 4 0 1 23 2 0 0
8 24 4 12 21 9 5 2 0 0 0 4 40 0 0 0 0 0 0 12
150 11 2 7 12 16 4 59 72 8 30 88 68 83 15 27 21 11 49 94
6 1 1 8 17 8 0 0 0 0 0 5 150 150 33 46 9 0 0 20
28 49 81 150 76 5 8 17 36 23 41 48 7 1 16 88 0 3 0 0
0 0 0 0 36 108 13 9 2 0 3 61 19 26 14 34 27 8 98 150
14 2 0 1 1 0 115 150
114.27 171.37 10.74 -2.245
.................. and this pattern continues for thousands of lines and likewise I have about 3000 files with similar pattern of data.
So, I want to delete lines that have these decimal numbers. In most cases, every 8th line has decimal numbers and hence I tried using awk 'NR % 8! == 0' < file_name. But the problem is, not all files in the database have their every 8th line as decimal numbers. So, is there a way in which I can delete the lines that have decimal numbers? I am coding in python 2.7 in ubuntu.

You can just look for lines containing decimal limiters:
with open('filename_without_decimals.txt','wb') as of:
with open('filename.txt') as fp:
for line in fp:
if line.index(".") == -1: of.write(line)
If you prefer to use sed, would be cleaner:
sed -i '/\./d' file.txt

The solution would be something like
file = open('textfile.txt')
text = ""
for line in file.readLines():
if '.' not in line:
text += line
print text

have you tried this:
using awk:
awk '!/\./{print}' your_file

deci = open('with.txt')
no_deci = open('without.txt', 'w')
for line in with_deci.readlines():
if '.' not in line:
no_deci.write(line)
deci.close()
no_deci.close()
readlines returns a list of all the lines in the file.

Related

How to use videos in validate_data?

This is the code I am using for training my model.
model_training = convlstm_model.fit(x=features_train,y=labels_train,epochs=50,batch_size=4,shuffle=True,validation_split=0.2,callbacks=[early_stopping_callback])
I have used validation_split above but I want to instead use validation_data to validate my model. Here's my validate data:
0 0 0.mp4
1 0 1.mp4
2 0 2.mp4
3 0 3.mp4
4 0 4.mp4
5 0 5.mp4
6 0 6.mp4
7 0 7.mp4
8 0 8.mp4
9 0 9.mp4
10 0 10.mp4
11 0 11.mp4
12 0 12.mp4
13 0 13.mp4
14 0 14.mp4
15 0 15.mp4
16 0 16.mp4
17 0 17.mp4
18 0 18.mp4
19 0 19.mp4
20 0 20.mp4
21 0 21.mp4
22 0 22.mp4
23 0 23.mp4
24 1 24.mp4
25 1 25.mp4
26 1 26.mp4
27 1 27.mp4
28 1 28.mp4
29 1 29.mp4
30 1 30.mp4
31 1 31.mp4
32 1 32.mp4
33 1 33.mp4
34 1 34.mp4
35 1 35.mp4
36 1 36.mp4
37 1 37.mp4
38 1 38.mp4
39 1 39.mp4
40 1 40.mp4
41 1 41.mp4
42 1 42.mp4
43 1 43.mp4
44 1 44.mp4
45 1 45.mp4
46 1 46.mp4
47 1 47.mp4
48 1 48.mp4
49 1 49.mp4
50 1 50.mp4
So, its basically a text file which has the first column as serial no. , the second column as class_number and third column is the video file name. The video files for the validate is given separately in another file.
So my question is how do I give this as input to validate_data because I generally see validation_data = (x_test, y_test) ?

ValueError Length mismatch Expected axis has 2 elements, new values have 3 elements

def bar1():
df=pd.read_csv('#CSVFILELOCATION#',encoding= 'unicode_escape')
x=np.arange(11)
df=df.set_index(['Country'])
dfl=df.iloc[:,[4,9]]
w=dfl.groupby('Country')['SummerTotal' , 'WinterTotal'].sum()
final_df=w.sort_values(by='Country').tail(11)
final_df.reset_index(inplace=True)
final_df.columns=('Country','SummerTotal','WinterTotal')
final_df=final_df.drop(11,axis='index')
Countries=df['Country']
STotalMed=df['SummerTotal']
WTotalMed=df['WinterTotal']
plt.bar(x-0.25,STotalMed,label='Total Medals by Countries in Summer',color='g')
plt.bar(x+0.25,WTotalMed,label='Total Medals by Countries in Winter',color='r')
plt.xticks(r,Countries,rotation=30)
plt.title('Olympics Data Analysis of Top 10 Countries',color='red',fontsize=10)
plt.xlabel('Countries')
plt.ylabel('Total Medals')
plt.grid()
plt.legend()
plt.show()
THIS IS THE CODE FOR A BAR GRAPH I AM USING IN A PROJECT
IN HERE THERE IS AN ERROR
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
PLEASE HELP ANYONE I WANT TO SUBMIT THIS PROJECT FAST
CSV:
Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal
 Afghanistan  14 0 0 2 2 0 0 0 0 0 14 0 0 2 2
 Algeria  13 5 4 8 17 3 0 0 0 0 16 5 4 8 17
 Argentina  24 21 25 28 74 19 0 0 0 0 43 21 25 28 74
 Armenia  6 2 6 6 14 7 0 0 0 0 13 2 6 6 14
 Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
 Australia  26 147 163 187 497 19 5 5 5 15 45 152 168 192 512
 Austria  27 18 33 36 87 23 64 81 87 232 50 82 114 123 319
 Azerbaijan  6 7 11 24 42 6 0 0 0 0 12 7 11 24 42
 Bahamas  16 6 2 6 14 0 0 0 0 0 16 6 2 6 14
 Bahrain  9 2 1 0 3 0 0 0 0 0 9 2 1 0 3
 Barbados 12 0 0 1 1 0 0 0 0 0 12 0 0 1 1
 Belarus  6 12 27 39 78 7 8 5 5 18 13 20 32 44 96
 Belgium  26 40 53 55 148 21 1 2 3 6 47 41 55 58 154
 Bermuda  18 0 0 1 1 8 0 0 0 0 26 0 0 1 1
 Bohemia  3 0 1 3 4 0 0 0 0 0 3 0 1 3 4
 Botswana  10 0 1 0 1 0 0 0 0 0 10 0 1 0 1
 Brazil  22 30 36 63 129 8 0 0 0 0 30 30 36 63 129
 British West Indies  1 0 0 2 2 0 0 0 0 0 1 0 0 2 2
 Bulgaria  20 51 87 80 218 20 1 2 3 6 40 52 89 83 224
 Burundi  6 1 1 0 2 0 0 0 0 0 6 1 1 0 2
 Cameroon 14 3 1 2 6 1 0 0 0 0 15 3 1 2 6
INFO-----> SummerTimesPart : No. of times participated in summer by each country
WinterTimesPart : No. of times participated in winter by each country
A few changes were needed to get the chart working:
A tick array is required to plot the country names
Use final_df for the chart data, not df
Set the bar width so the bars don't overlap
Here is the updated code:
data = '''
Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal
Afghanistan 14 0 0 2 2 0 0 0 0 0 14 0 0 2 2
Algeria 13 5 4 8 17 3 0 0 0 0 16 5 4 8 17
Argentina 24 21 25 28 74 19 0 0 0 0 43 21 25 28 74
Armenia 6 2 6 6 14 7 0 0 0 0 13 2 6 6 14
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
Australia 26 147 163 187 497 19 5 5 5 15 45 152 168 192 512
Austria 27 18 33 36 87 23 64 81 87 232 50 82 114 123 319
Azerbaijan 6 7 11 24 42 6 0 0 0 0 12 7 11 24 42
Bahamas 16 6 2 6 14 0 0 0 0 0 16 6 2 6 14
Bahrain 9 2 1 0 3 0 0 0 0 0 9 2 1 0 3
Barbados 12 0 0 1 1 0 0 0 0 0 12 0 0 1 1
Belarus 6 12 27 39 78 7 8 5 5 18 13 20 32 44 96
Belgium 26 40 53 55 148 21 1 2 3 6 47 41 55 58 154
Bermuda 18 0 0 1 1 8 0 0 0 0 26 0 0 1 1
Bohemia 3 0 1 3 4 0 0 0 0 0 3 0 1 3 4
Botswana 10 0 1 0 1 0 0 0 0 0 10 0 1 0 1
Brazil 22 30 36 63 129 8 0 0 0 0 30 30 36 63 129
BritishWestIndies 1 0 0 2 2 0 0 0 0 0 1 0 0 2 2
Bulgaria 20 51 87 80 218 20 1 2 3 6 40 52 89 83 224
Burundi 6 1 1 0 2 0 0 0 0 0 6 1 1 0 2
Cameroon 14 3 1 2 6 1 0 0 0 0 15 3 1 2 6
'''.strip()
with open('data,csv', 'w') as f: f.write(data) # write test file
############################
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def bar1():
df=pd.read_csv('data,csv', encoding= 'unicode_escape', sep=' ', index_col=False)
x=np.arange(11)
df=df.set_index(['Country'])
dfl=df.iloc[:,[4,9]]
w=dfl.groupby('Country')['SummerTotal' , 'WinterTotal'].sum()
final_df=w.sort_values(by='Country').tail(11)
final_df.reset_index(inplace=True)
final_df.columns=('Country','SummerTotal','WinterTotal')
print(final_df)
# final_df=final_df.drop(11,axis='index')
Countries=final_df['Country']
STotalMed=final_df['SummerTotal']
WTotalMed=final_df['WinterTotal']
plt.bar(x-0.25,STotalMed,width=.2, label='Total Medals by Countries in Summer',color='g')
plt.bar(x+0.25,WTotalMed,width=.2, label='Total Medals by Countries in Winter',color='r')
plt.xticks(np.arange(11),Countries,rotation=30)
plt.title('Olympics Data Analysis of Top 10 Countries',color='red',fontsize=10)
plt.xlabel('Countries')
plt.ylabel('Total Medals')
plt.grid()
plt.legend()
plt.show()
bar1()
Output

How to fill in values of a dataframe column if the difference between values in another column is sufficiently small?

I have a dataframe df1:
Time Delta_time
0 0 NaN
1 15 15
2 18 3
3 30 12
4 45 15
5 64 19
6 80 16
7 82 2
8 100 18
9 120 20
where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'.
How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5.
Edit: Currently, df2 looks like this.
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
... ... ...
118 118 0
119 119 0
120 120 0
The expected output for df2 is
Time Short_gap
0 0 0
1 1 0
2 2 0
... ... ...
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 1
19 19 0
20 20 0
... ... ...
78 78 0
79 79 0
80 80 1
81 81 1
82 82 1
83 83 0
84 84 0
... ... ...
119 119 0
120 120 0
Try:
t = df['Delta_time'].shift(-1)
df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True)
.to_frame(name='Short_gap').rename_axis('Time').reset_index())
print(df2.head(20))
print('...')
print(df2.loc[78:84])
Output:
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 0
19 19 0
...
Time Short_gap
78 78 0
79 79 0
80 80 1
81 81 1
82 82 0
83 83 0
84 84 0

get data from xml using pandas

I'm trying to get some data from xml using pandas. Currently I have "working" code, and by working i mean it almost work.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
response = requests.get(url).content
soup = BeautifulSoup(response)
tables = soup.find_all('tabela_rozklad')
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
df = pd.DataFrame()
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
df.columns = tags
a = df[(df.sem == "1")]
a = a[(a.id_spec == "0")]
a = a[(a.dzien == "1")]
print(a)
So I'm getting error on "a = df[(df.sem == "1")]" which is :
File "pandas\index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas\index.c:4443)
File "pandas\index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas\index.c:4289)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13733)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13687)
As i read other stacks questions I saw people suggest using df.loc so i modyfied this line into
a = df.loc[(df.sem == "1")]
Now code compile but the results show like this line doesn't exists. Need to mention that the problem is with the "sem" tag only. Rest works perfectly but unfortunately i need to use exactly this tag. If anyone could explain what i causing this error and how to fix it I would be grateful.
You can add ignore_index=True to append for avoid duplicated index and then need select column sem by [], because function sem:
df = pd.DataFrame()
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all], ignore_index=True)
df.columns = tags
#print (df)
a = df[(df['sem'] == '1') & (df.id_spec == "0") & (df.dzien == "1")]
print(a)
dzien godz ilosc tyg id_naucz id_sala id_prz rodz grupa id_st sem id_spec
0 1 1 2 0 52 79 13 W 1 13 1 0
1 1 3 2 0 12 79 32 W 1 13 1 0
2 1 5 2 0 52 65 13 Ćw 1 13 1 0
3 1 11 2 0 201 3 70 Ćw 10 13 1 0
4 1 5 2 0 36 78 13 Ps 5 13 1 0
5 1 5 2 1 18 32 450 Ps 3 13 1 0
6 1 5 2 2 18 32 450 Ps 4 13 1 0
7 1 7 2 1 18 32 450 Ps 7 13 1 0
8 1 7 2 2 18 32 450 Ps 8 13 1 0
9 1 7 2 0 66 65 104 Ćw 1 13 1 0
10 1 7 2 0 283 3 104 Ćw 5 13 1 0
11 1 7 2 0 346 5 104 Ćw 8 13 1 0
12 1 7 2 0 184 29 13 Ćw 7 13 1 0
13 1 9 2 0 66 65 104 Ćw 2 13 1 0
14 1 9 2 0 346 5 70 Ćw 8 13 1 0
15 1 9 1 0 73 3 203 Ćw 9 13 1 0
16 1 10 1 0 73 3 203 Ćw 10 13 1 0
17 1 9 2 0 184 19 13 Ps 13 13 1 0
18 1 11 2 0 184 19 13 Ps 14 13 1 0
19 1 11 2 0 44 65 13 Ćw 9 13 1 0
87 1 9 2 0 201 54 463 W 1 17 1 0
88 1 3 2 0 36 29 13 Ćw 2 17 1 0
89 1 3 2 0 211 5 70 Ćw 1 17 1 0
90 1 5 2 0 211 5 70 Ćw 2 17 1 0
91 1 7 2 0 36 78 13 Ps 4 17 1 0
105 1 1 2 1 11 16 32 Ps 2 18 1 0
106 1 1 2 2 11 16 32 Ps 3 18 1 0
107 1 3 2 0 51 3 457 W 1 18 1 0
110 1 5 2 2 11 16 32 Ps 1 18 1 0
111 1 7 2 0 91 64 97 Ćw 2 18 1 0
112 1 5 2 0 283 3 457 Ćw 2 18 1 0
254 1 5 1 0 12 29 32 Ćw 6 13 1 0
255 1 6 1 0 12 29 32 Ćw 5 13 1 0
462 1 7 2 0 98 1 486 W 1 19 1 0
463 1 9 1 0 91 1 484 W 1 19 1 0
487 1 5 2 0 116 19 13 Ps 1 17 1 0
488 1 7 2 0 116 19 13 Ps 2 17 1 0
498 1 5 2 0 0 0 431 Ps 2 17 1 0
502 1 5 2 0 0 0 431 Ps 15 13 1 0
503 1 5 2 0 0 0 431 Ps 16 13 1 0
504 1 5 2 0 0 0 431 Ps 19 13 1 0
505 1 5 2 0 0 0 431 Ps 20 13 1 0
531 1 13 2 0 350 79 493 W 1 13 1 0
532 1 13 2 0 350 79 493 W 2 17 1 0
533 1 13 2 0 350 79 493 W 1 18 1 0

Pandas dataframe from nested dictionary to melted data frame

I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K

Categories