Python Regular Expression loop - python

I have this code wich will look for certain things in a file. The file looks like this:
name;lastname;job;5465465
name2;lastname2;job2;5465465
name3;lastname3;job3;5465465
This is the python code:
import re
import sys
filehandle = open('somefile.csv', 'r')
text = filehandle.read()
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
no it will only print:
lastname;name;job;5465465
It supposed to print the lastname first so i did that with groups. Now i need a loop to print all lines like this:
lastname;name;job;5465465
lastname2;name2;job2;5465465
lastname3;name3;job3;5465465l
i tried all kind of loops but it doesnt go trough the whole file... how do i need to do this?
it must be done with the re module. I know its easy in the csv module ;)

You need to process the file line by line.
import re
import sys
with open('somefile.csv', 'r') as filehandle:
for text in filehandle:
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
Your file has nicely semi-colon separated values, so it would be easier to just use split or the csv library as has been suggested.

No need for re, but a good job for csv:
import csv
with open('somefile.csv', 'r') as f:
for rec in csv.reader(f, delimiter=';'):
print (rec[1], rec[0], rec[2], rec[3])
You can use re if you want to check the validity of individual elements (valid phone number, no numbers in name, capitalized names, etc.).

The fault is not with the loops, but rather with your regex / capture group patterns. The class [a-zA-Z]+ will not match "lastname3" or "lastname2". This sample works:
import re
import sys
for line in open('somefile.csv', 'r'):
b = re.search("(\w+);(\w+);(\w*);([0-9^-]+)\n?",line)
if b:
print "%s;%s;%s;%s" % (b.group(2),b.group(1),b.group(3),b.group(4))

Seems as if you just want to reorder what you have, in which case I don't know whether regex are needed. I believe the following might be of use:
reorder = operator.itemgetter(1, 0, 2, 3)
http://docs.python.org/library/operator.html

Related

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Loop through fixed number of chars in text file

Let's say that I have a file.txt with consecutive characters (no spaces nor newlines), like this:
ABCDHELOABCDFOOOABCD
And I want to loop through the file, iterating through fixed amounts of 4 characters, like this:
[ABCD, HELO, ABCD, FOOO, ABCD]
A regular loop won't do: how can I achieve this?
You can read four characters from the file at a time, by using TextIOWrapper.read's optional size parameter. Here I'm using Python 3.8's "walrus" operator, but it's not strictly required:
with open("file.txt", "r") as file:
while chunk := file.read(4):
print(chunk)
A simple loop like this would work. Not very pythonic, but gets the job done
s = 'ABCDHELLOABCDFOOOABCD'
for i in range(0,len(s),3):
print(s[i:i+3])
There is built-in textwrap module which has wrap function. So one can accomplish tasks without loop this way:
import textwrap
with open('file.txt', 'r') as f:
chunked = textwrap.wrap(f.read(), 4)
# chunked -> ['ABCD', 'HELO', 'ABCD', 'FOOO', 'ABCD']
Assuming that you've read the input of your file and converted the entire chunk into a single string called data, you could iterate over it like so:
individual_strings = data[::4]
This gives you a list of strings as required which you can then loop over!
try this:
with open('file.txt', 'r') as f:
content = f.read()
splited_by_four_letters = [content[i:i+4] for i in range(len(content))]
// do whatever you want with your data here

Extracting substrings from CSV table

I'm trying to clean up the data from a csv table that looks like this:
KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000
I want to extract just the "#" handles, such as:
#katyperry
#justinbieber
#BarackObama
This is the code I've put togheter, but all it does is repeat the second line of the table over and over:
import csv
import re
with open('C:\\Users\\TK\\Steemit\\Scripts\\twitter.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
if i.isalpha():
stringafterword = re.split('\\#\\',row)[-1]
print(stringafterword)
If you are willing to use re, you can get a list of strings in one line:
import re
#content string added to make it a working example
content = """KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000"""
#solution using 're':
m = re.findall('#.*', content)
print(m)
#option without 're' but using string.find() based on your loop:
for row in content.split():
pos_of_at = row.find('#')
if pos_of_at > -1: #-1 indicates "substring not found"
print(row[pos_of_at:])
You should of course replace the contentstring with the file content.
Firstly the "#" symbol is a symbol. Therefore the if i.isalpha(): will return False as it is NOT a alpha character. Your re.split() won't even be called.
Try this:
import csv
import re
with open('C:\\Users\\input.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
stringafterword = re.findall('#.*',i)
print(stringafterword)
Here I have removed the if-condition and changed the re.split() index to 1 as that is the section you want.
Hope it works.

Extract numeric data from matched regular expression

I have some temperature data in a csv file and I want to extract only the temperature for a say the first month of the year, and so after processing I want a list of [1.4, -5.8] in the example below.
1866-01-01 00:00:01;1866-02-01 00:00:00;1866-01;1.4;G
1866-02-01 00:00:01;1866-03-01 00:00:00;1866-02;-3.0;G
1900-01-01 00:00:01;1900-01-01 00:00:00;1900-01;-5.8;G
I thought of doing this with python module re, but I always have issues getting to grips with regular expressions! For instance my quick test below returns all lines when I only expect it to return the entries from the first month of the year...
import numpy as np
import re
regex = '\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01;[-+]?\d*\.\d+|\d+;G'
with open('test.csv', 'rb') as fid:
for line in fid:
match = re.findall(regex,line)
if match:
print line
print match
Use the csv module, specifying ; as the delimiter. The third column in the data is YYYY-MM, so check whether it's the first month and print the temperature if it is:
import csv
with open('data') as f:
for row in csv.reader(f, delimiter=';'):
year, month = row[2].split('-')
if int(month) == 1:
print(row[3])
Output
1.4
-5.8
For comparison, here is the simplest regex that I could come up with to extract the required value:
import re
with open('data') as f:
temperature = re.findall(r'\d{4}-01;(.+?);', f.read())
print('\n'.join(temperature))
You can see how it takes more effort to read & understand the regex than it does the Python code.
There is an even easier way that relies on your data consisting of fixed width fields:
with open('data') as f:
for line in f:
if line[45:47] == '01':
print(line[48:-3])
I suggest the folling regex:
^(?:\d{4}-01-.*?)(-?\d+\.\d+)
Demo and explanation of behavior: regex101
The number is in the first capturing group.
Alternatively, with a positive lookahead:
^(?=\d{4}-01).*?(-?\d+\.\d+)
Demo and explanation of behavior: regex101
You have to put brackets around what you want to extract. So you should change the last part to ;([-+]?\d*\.\d+|\d+);G.
Try this code and tell me if it works:
import re
regex1 = re.compile('\d{4}-01-\d{2}')
regex2 = re.compile('([-+]?\d*\.\d+|\d+);G')
with open('test.csv', 'rb') as fid:
for line in fid:
match1 = re.findall(regex1,line)
if match1:
match2 = re.findall(regex2, line)
print line
print match2
Hope this helps.

Problem with replacing a word in a file, using Python

I have a .txt file containing data like this:
1,Rent1,Expense,16/02/2010,1,4000,4000
1,Car Loan1,Expense,16/02/2010,2,4500,9000
1,Flat Loan1,Expense,16/02/2010,2,4000,8000
0,Rent2,Expense,16/02/2010,1,4000,4000
0,Car Loan2,Expense,16/02/2010,2,4500,9000
0,Flat Loan2,Expense,16/02/2010,2,4000,8000
I want to replace the first item. If it is 1, means it should remain the same but if it is 0 means I want to change it to 1. So I have tried using the following code:
import fileinput
for line in fileinput.FileInput("sample.txt",inplace=1):
s=line.split(",")
print a
print ','.join(s)
But after successfully executed the program my .txt file looks like:
1,Rent1,Expense,16/02/2010,1,4000,4000
1,Car Loan1,Expense,16/02/2010,2,4500,9000
1,Flat Loan1,Expense,16/02/2010,2,4000,8000
0,Rent2,Expense,16/02/2010,1,4000,4000
0,Car Loan2,Expense,16/02/2010,2,4500,9000
0,Flat Loan2,Expense,16/02/2010,2,4000,8000
Now I want to remove the empty line. Is it possible, or is there any other way to replace the 0's?
print adds an extra newline after the input and you already have one newline there. You should either strip the existing newline (line.rstrip("\n")) or use sys.stdout.write() instead.
import fileinput
import re
p = re.compile(r'^0,')
for line in fileinput.FileInput("sample.txt",inplace=1):
print p.sub('1,', line.strip())
The existing code you have doesn't actually change the lines like you want; print a doesn't do anything if a isn't actually defined! So you end up just printing a blank line (the print a bit) and then printing the existing line, hence why you get a file that's unaltered except for the addition of some blank lines.
Either use rstrip to remove the trailing new lines before printing or use sys.stdout.write instead of print.
Also, if you only need to modify the first element, there is no need to split the entire line and join it again. You only need to split on the first comma:
line.split(',', 1)
If you want even better performance you could also just test the value of line[0] directly.
fixed = []
for l in file('sample.txt'):
parts = l.split(',',1)
if(parts[0] == '0'):
# not sure what you want to do here, but you want to "change this" number to 1?
parts[0] = 1
fixed.append(parts.join(','))
outp = file('sample.txt','w')
for f in fixed:
outp.write(f)
outp.close()
This is untested, but it should get you most of the way there.
Good luck
import fileinput
for line in fileinput.FileInput("sample.txt",inplace=1):
s=line.rstrip().split(",")
print a
print ','.join(s)
You have to use a comma at the end of your print so that it doesn't add a newline. Like so:
print "Hello",
This is what I came up with:
input = open('file.txt', 'r')
output = open('output.txt', 'w')
for line in input:
values = line.split(',')
if (values[0] == '0'):
values[0] = '1'
output.write(','.join(values))
If you want a better csv handling library you might want to use this instead of split.
The cleanest way to do it is to use the CSV parser :
import fileinput
import csv
f = fileinput.FileInput("test.txt",inplace=1)
fichiercsv = csv.reader(f, delimiter=',')
for line in fichiercsv:
line[0] = "1"
print ",".join(line)

Categories