How to extract a sum data from a text file on Python - python

I have a text file txt that has 6 columns:
1.sex (M /F) 2.age 3.height 4.weight 5.-/+ 6.zip code
I need to find from this text how many Males have - sign. ( for example: from the txt 30 M(Males) are - )
So I need only the number at the end.
Logically I need to work with Column1 and column 5 but I am struggling to get only one (sum) number at the end.
This is the content of the text:
M 87 66 133 - 33634
M 17 77 119 - 33625
M 63 57 230 - 33603
F 55 50 249 - 33646
M 45 51 204 - 33675
M 58 49 145 - 33629
F 84 70 215 - 33606
M 50 69 184 - 33647
M 83 60 178 - 33611
M 42 66 262 - 33682
M 33 75 176 + 33634
M 27 48 132 - 33607
I am getting the result now..., but I want both M and positive. How can I add that to occurrences??
f=open('corona.txt','r')
data=f.read()
occurrences=data.count('M')
print('Number of Males that have been tested positive:',occurrences)

You can split the lines like this:
occurrences = 0
with open('corona.txt') as f:
for line in f:
cells = line.split()
if cells[0] == "M" and cells[4] == "-":
occurrences += 1
print("Occurrences of M-:", occurrences)
But it is better to use the csv module or pandas for this type of work.

If you do any significant amount of work with text and columnar data, I would suggest getting started on learning pandas
For this task, if your csv is one record per line and is space-delimited:
import pandas as pd
d = pd.read_csv('data.txt',
names=['Sex', 'Age', 'Height', 'Weight', 'Sign', 'ZIP'],
sep=' ', index_col=False)
d[(d.Sex=='M') & (d.Sign=='-')].shape[0] # or
len(d[(d.Sex=='M') & (d.Sign=='-')]) # same result, in this case = 9
Pandas is a very extensive package. What this code does is build a DataFrame from your csv data, giving each column a name. Then selects from this, each row where both of your conditions Sex == 'M' and Sign == '-', and reports the number of records thus found.
I recommend starting here

Related

Extract information from an Excel (by updating arrays) with Excel / Python

I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.

print 2 line above (with specific word) after a specific match pattern

I am newbie in python. I have a file with list of pattern and a large file with these patterns.
I want to extract lines with these patterns and two lines above these match, which has specific word.
My big file: bigfile.txt
QUERY Query_17 Peptide 93 ANN2
ENDQUERY
2 Query_17 Specific 197609 50 89 - 389788
2 Query_17 Specific 197609 50 89 - 389788
2 Query_17 Specific 197609 50 89 LysM - 389788
ENDQUERY
QUERY Query_33 Peptide 305 ANN2
2 Query_33 Specific 372835 33 134 GUB_WAK_bind - 45
2 Query_33 Non-specific 373037 222 WAK_assoc N 45
ENDQUERY
QUERY Query_42 Peptide 217 ANN3
ENDQUERY
QUERY Query_43 Peptide 435 ANN3
2 Query_43 Specific 237995 47 164 B_lectin - 390234
ENDQUERY
QUERY Query_45 Peptide 717 ANN34
ENDQUERY
2 Query_45 Specific 214519 44 160 - 390234
2 Query_45 Specific 237995 376 449 B_lectin N 390234
My pattern match file:
LysM, GUB_WAK_bind, WAK_assoc, B_lectin
Expected output:
QUERY Query_17 Peptide 93 ANN2
2 Query_17 Specific 197609 50 89 LysM - 389788
QUERY Query_33 Peptide 305 ANN2
2 Query_33 Specific 372835 33 134 GUB_WAK_bind - 45
2 Query_33 Non-specific 373037 222 WAK_assoc N 45
QUERY Query_43 Peptide 435 ANN3
2 Query_43 Specific 237995 47 164 B_lectin - 390234
QUERY Query_45 Peptide 717 ANN34
2 Query_45 Specific 237995 376 449 B_lectin N 390234
Any help would be great.
Thanks
Update
After some clarification, it seems that the original question was not formulated quite correctly. Instead, the goal is to get, for each matching line:
the previous QUERY line (unless it's already been printed),
the matching line.
Different goal --> different answer:
def query_grep(file, substrings):
lastquery = None
with open(filename, 'r') as f:
for line in f:
line = line[:-1] # remove newline
if line.startswith('QUERY'):
lastquery = line
else:
if any(s in line for s in substrings):
if lastquery is not None:
yield lastquery
lastquery = None
yield line
Example:
substrings = ['LysM', 'GUB_WAK_bind', 'WAK_assoc', 'B_lectin']
with open(filename, 'r') as f:
for res in query_grep(f, substrings):
print(res)
# or, the get whole list at once and print:
with open(filename, 'r') as f:
print('\n'.join(query_grep(f, substrings)))
# either way, the output it:
QUERY Query_17 Peptide 93 ANN2
2 Query_17 Specific 197609 50 89 LysM - 389788
QUERY Query_33 Peptide 305 ANN2
2 Query_33 Specific 372835 33 134 GUB_WAK_bind - 45
2 Query_33 Non-specific 373037 222 WAK_assoc N 45
QUERY Query_43 Peptide 435 ANN3
2 Query_43 Specific 237995 47 164 B_lectin - 390234
QUERY Query_45 Peptide 717 ANN34
2 Query_45 Specific 237995 376 449 B_lectin N 390234
Original answer
Python's deques come in handy for this kind of things (and many, many others):
# assuming you just want to look for substrings. Otherwise, use regex.
substrings = ['LysM', 'GUB_WAK_bind', 'WAK_assoc', 'B_lectin']
d = deque(maxlen=2)
with open(filename, 'r') as f:
for line in f:
line = line.rstrip('\n')
if any(s in line for s in substrings):
print('\n'.join(d))
print(line)
d.append(line)

writing .txt to .csv excel columns in Python

I have a rather large text file with multiple columns that I must convert to a 15 column .csv file to be read in excel. The logic for parsing the fields I need is written out below, but I am having trouble writing it to .csv.
columns = [ 'TRANSACTN_NBR', 'RECORD_NBR',
'SEQUENCE_OR_PIC_NBR', 'CR_DB', 'RT_NBR', 'ACCOUNT_NBR',
'RSN_COD', 'ITEM_AMOUNT', 'ITEM_SERIAL', 'CHN_IND',
'REASON_DESCR', 'SEQ2', 'ARCHIVE_DATE', 'ARCHIVE_TIME', 'ON_US_IND' ]
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(values)
elif (len(values) >= 4 and values[3] == 'D'
and values[4] in rtnbr):
on_us = '1'
else:
on_us = '0'
print (lines[0])
print (lines[1])
I have originally tried the csv module but the parsed rows are written in 12 columns and I could not find a way to write the date and time (parsed separately) in the columns after each row
I was also looking at the pandas package but have only seen ways to extract patterns, which wouldn't work with the established parsed criteria
Is there a way to write to csv using the above criteria? Or do I have to scrap it and rewrite the code within a specific package?
Any help is appreciated
EDIT: Text file sample:
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
Desired output: looking at only lines containing seq starting with 42, contains C
1293 83834 4100225908 C 538196620 9860890913 10 161.5 0 CREDIT 41 3-Aug-17 11:15:51
1294 83838 4100225911 C 538196620 25715845 10 138 0 CREDIT 41 3-Aug-17 11:15:51
Look at the ‘pandas‘ package, more specifically the class DataFrame. With a little cleverness you ought to be able to read your table using ‘pandas.read_table()‘ which returns a dataframe that you can output to csv with ‘to_csv()‘ effectively a 2 line solution. You’ll need to look at the docs to find the parameters you’ll need to properly read your table format, but should be a little easier than doing it manually.

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

python using data from a list, where by you call the data from its index number

I must open a file, compute the averages of a row and column and then the max of the data sheet. The data is imported from a text file. When I am done with the program, it should look like an excel sheet, only printed on my terminal.
Data file must be seven across by six down.
88 90 94 98 100 110 120
75 77 80 86 94 103 113
80 83 85 94 111 111 121
68 71 76 85 96 122 125
77 84 91 102 105 112 119
81 85 90 96 102 109 134
Later, I will have to print the above data. I the math is easy, my problem is selecting the number from the indexed list. Ex:
selecting index 0, 8, 16, 24, 32, 40. Which should be numbers 88, 75, 80, 68, 77, 81.
What I get when I input the index number is 0 = 8, 1 = 8, 2 = " "... ect.
What have I done wrong here? I have another problem where I had typed into the program the list, which works as I wanted this to work. That program was using the index numbers to select a month. 0= a blank index, 1 = january, 2 = Febuary, ect...
I hope this example made clear what I intended to do, but cannot seem to do. Again, the only difference between my months program and this program is that for the below code, I open a file to fill the list. Have I loaded the data poorly? Split and stripped the list poorly? Help is more useful than answers, as I can learn rather than be given the answer.
def main():
print("Program to output a report of noise for certain model cars.")
print("Written by censored.")
print()
fileName = input("Enter the name of the data file: ")
infile = open(fileName, "r")
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
print(list)
n = eval(input("Enter a index number: ", ))
print("The index is", line[n] + ".")
print("{0:>38}".format(str("Speed (MPH)")))
print("{0:>6}".format(str("Car")), ("{0:>3}".format(str(":"))),
("{0:>6}".format(str("30"))), ("{0:>4}".format(str("40"))),
("{0:>4}".format(str("50"))))
main()
Thank you for your time.
You keep overwriting your variables, and I wouldn't recommend masking a built-in (list).
infileData = infile.read()
line = infileData
#for line in infile:
list = line.split(',')
list = line.strip("\n")
should be:
infileData = list(map(int, infile.read().strip().split()))
This reads the file contents into a string, strips off the leading and trailing whitespace, splits it up into a list separated by whitespace, maps each element as an int, and creates a list out of that.
Or:
stringData = infile.read()
stringData = stringData.strip()
stringData = stringData.split()
infileData = []
for item in stringData:
infileData.append(int(item))
Storing each element as an integer lets you easily do calculations on it, such as if item > 65 or exceed = item - 65. When you want to treat them as strings, such as for printing, cast them as such with str():
print("The index is", str(infileData[n]) + ".")
Just to be clear, it looks like your data is space-separated not comma separated. So when you call,
list = line.split(',')
the list looks like this,
['88 90 94 98 100 110 120 75 77 80 86 94 103 113 80 83 85 94 111 111 121 68 71 76 85 96 122 125 77 84 91 102 105 112 119 81 85 90 96 102 109 134']
So therefore, when you access list[0], you will get '8' not '88', or when you access list[2] you will get ' ', not '94'
list = line.split() # this is what you should call (space-separated)
Again this answer is based on how your data is presented.

Categories