I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])
If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs
hi you should try this code :
df[df['Text']!='MAN']
I have a text file txt that has 6 columns:
1.sex (M /F) 2.age 3.height 4.weight 5.-/+ 6.zip code
I need to find from this text how many Males have - sign. ( for example: from the txt 30 M(Males) are - )
So I need only the number at the end.
Logically I need to work with Column1 and column 5 but I am struggling to get only one (sum) number at the end.
This is the content of the text:
M 87 66 133 - 33634
M 17 77 119 - 33625
M 63 57 230 - 33603
F 55 50 249 - 33646
M 45 51 204 - 33675
M 58 49 145 - 33629
F 84 70 215 - 33606
M 50 69 184 - 33647
M 83 60 178 - 33611
M 42 66 262 - 33682
M 33 75 176 + 33634
M 27 48 132 - 33607
I am getting the result now..., but I want both M and positive. How can I add that to occurrences??
f=open('corona.txt','r')
data=f.read()
occurrences=data.count('M')
print('Number of Males that have been tested positive:',occurrences)
You can split the lines like this:
occurrences = 0
with open('corona.txt') as f:
for line in f:
cells = line.split()
if cells[0] == "M" and cells[4] == "-":
occurrences += 1
print("Occurrences of M-:", occurrences)
But it is better to use the csv module or pandas for this type of work.
If you do any significant amount of work with text and columnar data, I would suggest getting started on learning pandas
For this task, if your csv is one record per line and is space-delimited:
import pandas as pd
d = pd.read_csv('data.txt',
names=['Sex', 'Age', 'Height', 'Weight', 'Sign', 'ZIP'],
sep=' ', index_col=False)
d[(d.Sex=='M') & (d.Sign=='-')].shape[0] # or
len(d[(d.Sex=='M') & (d.Sign=='-')]) # same result, in this case = 9
Pandas is a very extensive package. What this code does is build a DataFrame from your csv data, giving each column a name. Then selects from this, each row where both of your conditions Sex == 'M' and Sign == '-', and reports the number of records thus found.
I recommend starting here
I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132
Input:
LineNo word_num left top width text
1 1 322 14 14 My
1 2 304 4 41 Name
1 3 322 5 9 is
1 4 316 14 20 Raghav
2 1 420 129 34 Problem
2 2 420 31 27 just
2 3 420 159 27 got
2 4 431 2 38 complicated
1 1 322 14 14 #40
1 2 304 4 41 #gmail.com
2 1 420 129 34 2019
2 2 420 31 27 January
As you can see there are columns lineNo, left, top and word_num, so I was trying if I can get some logic using these both maybe I can achieve my solution.
I wanted to do some tweaks in the output, actually this output is coming through a PDF after its converted into an image, so it is catching the whole line because of which whole line is coming and the output is not making sense, what i am thinking of doing now is to group the text in a meaning full way. For e.g
lets say this output i am getting by using this:
g = df['line_num'].ne(df['line_num'].shift()).cumsum()
out = '\n'.join(df.groupby(g)['text'].agg(' '.join))
print (out)
Output=
"My name is raghav #40 #gmail.com
Problem just got complicated $2019 January"
Expected Output=
"My name is raghav
*40
#gmail.com
Problem just got complicated
2019 January"
All are in different lines no matter if they are in same line or not but logically grouped in different lines.
In my understanding maybe we can achieve this by doing these steps:
enter image description here
a) Words on same line are grouped if x distance < threshold
b) Words on next line are grouped with previous if y distance < threshold
Threshold is width(image)/ 100; x distance is calculated from left; y distance is calculated from top.
Can we do this ?
Let me know if the question is not clear enough!
Thanks!
Added the image i am trying to get the output, data in it is little complicated this i have changed it according to me!
To answer your second concern, maybe try iterating through the column like so.
phrase = ""
for i in range(0, df.count):
if type(df.iat[i, 'text']) == str:
phrase = phrase + " " + df.iat[i, 'text']
To add the space/..., I agree with jezrael, use the str.cat method.
Use double join - with agg and then for output Series:
out = '.....'.join(df.groupby('LineNo')['text'].agg(' '.join))
print (out)
My Name is Raghav.....Roll No. # 242
Another solution with str.cat:
out = df.groupby('LineNo')['text'].agg(' '.join).str.cat(sep='.....')
EDIT:
g = df['LineNo'].ne(df['LineNo'].shift()).cumsum()
out = '.....'.join(df.groupby(g)['text'].agg(' '.join))
print (out)
My Name is Raghav.....Roll No. # 242.....hello the problem just.....got more complicated !!!!
I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3