regex on reading a txt file in Python - python

I have a txt that contains data for classification purposes. The first column is the class, that is 0 or 1 and the other four columns contain the features of the class. Yet the features has numbers before them, that is 1: for feature 1, 2: for feature 2 etc. I tried to use regex in numpy split but I failed. How can I take only the columns I need? Below is the txt with the data.
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
1 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02
1 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02
1 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
1 1:1.451200e+02 2:2.088600e+02 3:-1.760859e-01 4:1.542257e+02
1 1:3.849699e+01 2:4.146600e+01 3:-1.886419e-01 4:1.239661e+02
1 1:2.927699e+01 2:1.072510e+02 3:1.149632e-01 4:1.077885e+02
1 1:2.886700e+01 2:1.090240e+02 3:-1.239433e-01 4:9.799130e+01
1 1:2.401300e+01 2:7.602000e+01 3:2.850990e-01 4:9.891692e+01
1 1:2.837900e+01 2:1.452160e+02 3:3.870011e-01 4:1.549975e+02
1 1:2.238140e+01 2:8.242810e+01 3:-2.814865e-01 4:8.998764e+01
1 1:1.232100e+02 2:4.561600e+02 3:-1.518468e-01 4:1.432996e+02
1 1:2.008405e+01 2:1.774510e+02 3:2.578101e-01 4:9.253101e+01
1 1:3.285699e+01 2:1.826750e+02 3:2.204406e-01 4:9.457175e+01
1 1:0.000000e+00 2:1.154780e+02 3:1.504970e-01 4:1.096315e+02
1 1:3.954504e+01 2:2.374420e+02 3:1.089429e-01 4:1.376333e+02
1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02
1 1:3.408200e+01 2:1.198280e+02 3:2.200156e-01 4:1.383639e+02
1 1:0.000000e+00 2:8.671080e+01 3:4.201880e-01 4:1.298851e+02
1 1:4.865997e+01 2:3.071500e+02 3:1.756066e-01 4:1.640174e+02
1 1:2.341090e+01 2:8.347140e+01 3:1.766868e-01 4:9.803250e+01
1 1:1.222390e+02 2:4.357930e+02 3:-1.812907e-01 4:1.687663e+02
1 1:1.624560e+01 2:4.830620e+01 3:5.508614e-01 4:2.632639e+01
1 1:4.389899e+01 2:2.421300e+02 3:2.006008e-01 4:1.331948e+02
1 1:6.143698e+01 2:2.338500e+02 3:2.758731e-01 4:1.612433e+02
1 1:5.952499e+01 2:2.176700e+02 3:-8.601014e-02 4:1.170831e+02
1 1:2.915850e+01 2:1.259875e+02 3:1.910455e-01 4:1.279927e+02
1 1:5.059702e+01 2:2.430620e+02 3:1.863443e-01 4:1.352273e+02
1 1:6.024097e+01 2:1.977340e+02 3:-1.319924e-01 4:1.320220e+02
1 1:2.620490e+01 2:6.270790e+01 3:-1.402450e-01 4:1.135866e+02
1 1:2.847198e+01 2:1.483760e+02 3:-1.868249e-01 4:1.672337e+02
1 1:2.707990e+01 2:7.770390e+01 3:-2.509235e-01 4:9.798032e+01
1 1:2.068600e+01 2:8.446800e+01 3:1.761782e-01 4:1.199423e+02
1 1:1.962450e+01 2:4.923090e+01 3:4.302725e-01 4:9.361318e+01
1 1:4.961401e+01 2:3.234850e+02 3:-1.963741e-01 4:1.622486e+02
1 1:7.982401e+01 2:2.017540e+02 3:-1.412161e-01 4:1.310716e+02
1 1:6.696402e+01 2:2.214030e+02 3:-1.187778e-01 4:1.416626e+02
1 1:5.842999e+01 2:1.348610e+02 3:2.876077e-01 4:1.286684e+02
1 1:6.982007e+01 2:3.693401e+02 3:-1.539849e-01 4:1.511659e+02
1 1:1.902200e+01 2:2.210120e+02 3:1.689450e-01 4:1.368066e+02
1 1:4.582898e+01 2:2.215950e+02 3:2.419124e-01 4:1.627100e+02

I do hate pandas but try these three lines:
import pandas as pd
# Use pandas read_csv; sep is interpreted as a regex
x=pd.read_csv('file.txt',sep='[: ]').to_numpy()
# Now select the required columns
output=x[:,(2,4,6,8)]
print(output)
"""
array([[ 5.707397e+01, 2.214040e+02, 8.607959e-02, 1.229114e+02],
[ 1.725900e+01, 1.734360e+02, -1.298053e-01, 1.250318e+02],
[ 2.177940e+01, 1.249531e+02, 1.538853e-01, 1.527150e+02],
[ 9.133997e+01, 2.935699e+02, 1.423918e-01, 1.605402e+02],
[ 5.537500e+01, 1.792220e+02, 1.654953e-01, 1.112273e+02],
[ 2.956200e+01, 1.913570e+02, 9.901439e-02, 1.034076e+02]])
"""
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can also check out How to use regex as delimiter function while reading a file into numpy array or similar
I rediscovered the solution below independently but this answer follows the same strategy via sep:
https://stackoverflow.com/a/51195469/1021819

You wish to parse a line like this:
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
Your parser reads the input with the usual idiom of:
for line in input_file:
Start by distinguishing the class label from the features.
label, *raw_features = line.split()
label = int(label)
Now it just remains to strip the unhelpful N: prefix from each feature.
features = [float(feat.split(':')[1])
for feat in raw_features]
Sure, you could solve this with a regex.
But that doesn't sound like the simplest solution, in this case.

I was bored :-) , So thought of writing a snippet for you. See below, which is kind of a dirty text processing and loading it to a dataframe.
lines = open("predictions.txt", "r").readlines()
column_lines = [
[fline[0]] + [feat[1] for feat in sorted([tuple(feature.split(":")) for feature in fline[1:]], key=lambda f: f[0])]
for fline in [line.split(" ") for line in lines]
]
import pandas as pd
table = pd.DataFrame(column_lines, columns = ["Class", "Feature1","Feature2","Feature3","Feature4"])
Instead of this, you can also think of tranforming the file to a csv, using a similar text processing and then use them directly to create a dataframe, so you dont need to run this code everytime.!
I hope this is helpful.

If you want to use regex to only extract you columns you can use this regex expression on each line:
import re
line = '1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02'
reg = re.compile(r'(-*\d+\.\d+e[+|-]\d+)')
# Your columns:
reg.findall(line)
>>> ['1.067980e+02', '3.237560e+02', '-1.509505e-01', '1.754021e+02']
# Assuming you also want numbers of those values:
list(map(float, reg.findall(line)))
>>> [106.798, 323.756, -0.1509505, 175.4021]
What is does:
(-*\d+\.\d+e[+|-]\d+) the first brackets are used to create groups. Inside the group first -* is the optional minus sign. Thereafter there is at least 1 number, but there can be more than 1 \d+. The number does have a decimal point with decimals therefore \.\d+. Then there is an exponent with either + or - e[+|-] following with number \d+.

Related

Regex in Python returns nothing (search parameters keywords for search for when using regex)

I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.
Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:
import re
import requests
import pdfplumber
import pandas as pd
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)
with pdfplumber.open(ap) as pdf:
page = pdf.pages[1]
text = page.extract_text()
#print(text)
new_vend_re = re.compile(r'\d{1} [A-D]')
for line in text.split('\n'):
if new_vend_re.match(line):
print(line)
When I run the code, I do not get anything in return. Printing the text though will print the whole page.
Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf
You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.
'9700/12 Cambridge International AS/A Level – Mark Scheme March 2019\nPUBLISHED \n \nQuestion Answer Marks \n1 A 1\n2 C 1\n3 C 1\n4 A 1\n5 A 1\n6 C 1\n7 A 1\n8 D 1\n9 A 1\n10 C 1\n11 B 1\n12 D 1\n13 B 1\n...
Change your regex to the following to match not only one, but one or more spaces:
new_vend_re = re.compile(r'\d{1}\s+[A-D]')
See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:
1 A 1
2 C 1
3 C 1
4 A 1
5 A 1
6 C 1
7 A 1
8 D 1
9 A 1
(You can also see here, that there are always two spaces instead of one).
//Edit: Fixed typo in regex

How to count word frequencies from an input file? [duplicate]

This question already has answers here:
How do I count the occurrences of a list item?
(29 answers)
Closed 2 years ago.
I'm trying to have my program read a single line formed by words separated by commas. For example if we have:
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
in the input file, the program would need to separate each word on a single line and ditch the commas. After that the program would count frequencies of the words in the input file.
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
for word in seperated:
freq = seperated.count(word)
print(word, freq)
The problem with this code is it prints the initial count for the same word that's counted twice. The output for this program would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 1
Hello 1
man 2
cat 2
woman 1
dog 2
Cat 1
hey 2
boy
1
The correct output would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 2
Hello 1
woman 1
Cat 1
Question is how do I make my output look more polished by having the final count instead of the initial one?
This is a common pattern and core programming skill. You should try collecting and counting words each time you encounter them, in a dictionary. I'll give you the idea, but it's best you practise the exact implementation yourself. Happy hacking!
(Also recommend the "pretty print" python built-in method)
import pprint
for word in file:
word_dict[word] += 1
pprint.pprint(word_dict)
A couple of extra tips - you may want to f.close() your file when you're finished, (E: I misread so disregard the rest...) and it looks like you want to look at converting your words to lower case so that different capitalisations aren't counted seperately. There are python built in methods to do this you can find by searching
try using a dictionary:
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
wordsDict = {}
for word in seperated:
if word not in wordsDict:
wordsDict[word] = 1
else:
wordsDict[word] = int(wordsDict.get(word)) + 1
for i in wordsDict:
print i, wordsDict[i]
)
Create a new dictionary. Add the word as key and the count of that as value to it
count_dict={}
for w in seperated:
count_dict[w]=seperated.count(w)
for key,value in count_dict.items():
print(key,value)

How do I remove blank lines from a string in Python?

Lets say I have a variable that's data contained blank lines how would I remove them without making every thong one long line?
How would I turn this:
1
2
3
Into this:
1
2
3
Without turning it into this:
123
import os
text = os.linesep.join([s for s in text.splitlines() if s])
You can simply do this by using replace() like data.replace('\n\n', '\n')
Refer this example for better understanding.!!
data = '1\n\n2\n\n3\n\n'
print(data)
data = data.replace('\n\n', '\n')
print(data)
Output
1
2
3
1
2
3
text = text.replace(r"\n{2,}","\n")

Interpreting a string received from a socket

I am trying to interpret a string that I have received from a socket. The first set of data is seen below:
2 -> 1
1 -> 2
2 -> 0
0 -> 2
0 -> 2
1 -> 2
2 -> 0
I am using the following code to get the numerical values:
for i in range(0,len(data)-1):
if data[i] == "-":
n1 = data[i-2]
n2 = data[i+3]
moves.append([int(n1),int(n2)])
But when a number greater than 9 appears in the data, the program only takes the second digit of that number (eg. with 10 the program would get 0). How would I get both of the digits from the code while maintaining the ability to get single digit numbers?
Well you just grab one character on each side ..
for the second value you can make it like this: data[i+3,len(data)-1]
for the first one: : data[0,i-2]
Use the split() function
numlist = data[i].split('->')
moves.append([int(numlist[0]),int(numlist[1])])
I assume each line is available as a (byte) string in a variable named line. If it's a whole bunch of lines then you can split it into individual lines with
lines = data.splitlines()
and work on each line inside a for statement:
for line in lines:
# do something with the line
If you are confident the lines will always be correctly formatted the easiest way to get the values you want uses the string split method. A full code starting from the data would then read like this.
lines = data.splitlines()
for line in lines:
first, _, second = line.split()
moves.append([int(first), int(second)])

Printing on the same line as a header in python

I am writing a function calculate a score for the matrix and output the score along with some other variables as a header. My code for output is as follows:
header=">"+motif+" "+gene+" "+str(score)
append_copy = open(newpwmfile, "r")
original_text = append_copy.read()
append_copy.close()
append_copy = open(newpwmfile, "w")
append_copy.write(header)
append_copy.write(original_text)
append_copy.close()
However the header is printing the score on the next line instead of the same line, as follows:
>ATGC ABC/CDF
5.8
0.23076923076923 0 0.69230769230769 0.076923076923077
0.46153846153846 0.23076923076923 0.23076923076923 0.076923076923077
0 0 1 0
0 1 0 0
1 0 0 0
What could be the reason? I also tried interchanging the variables and then the header is printed on the same line. However the sequence is relevant in this case.
When reading fields from a file, it is good practice to remove possible extra blank spaces using strip() function.
As an example, this is a typical workflow to manually get the fields from a csv file:
for line in open(fname).readlines():
linefields = [field.strip() for field in line.strip().split(',')]
This removes either the blankspace between lines and the blankspace between fields.

Categories