I have one table (Please Ref Image) In this Table I want to remove "A" char from each Row How can I do in Python.
Below is my code using regexe_replace but code is not optimised I want optimised code
def re(s):
return regexp_replace(s, "A", "").cast("Integer")
finalDF = finalD.select(re(col("C0")).alias("C0"),col("C1"),
re(col("C2")).alias("C2"),
re(col("C3")).alias("C3"),col("C4"),
re(col("C5")).alias("C5"),
re(col("C6")).alias("C6"),col("C7"),
re(col("C8")).alias("C8"),
re(col("C9")).alias("C9"),col("C10"),
re(col("C11")).alias("C11"),col("C12"),
re(col("C13")).alias("C13"),
re(col("C14")).alias("C14"),col("C15"),
re(col("C16")).alias("16"),col("C17"),
re(col("C18")).alias("18"),
re(col("C19")).alias("C19"),col("Label"))
finalDF.show(2)
Thank you in Advance.
Why regex? Regex will be over kill.
If you have data in format you have given, then use replace function as below:
Content of master.csv:
A11| 6|A34|A43|
A11| 6|A35|A44|
Code :
with open('master.csv','r') as fh:
for line in fh.readlines():
print "Before - ",line
line = line.replace('A','')
print "After - ", line
print "---------------------------"
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Before - A11| 6|A34|A43|
After - 11| 6|34|43|
---------------------------
Before - A11| 6|A35|A44|
After - 11| 6|35|44|
---------------------------
Code with replacing 'A' from complete data in in one shot (without going line by line)
with open("master.csv",'r') as fh:
data = fh.read()
data_after_remove = data.replace('A','')
print "Before remove ..."
print data
print "After remove ..."
print data_after_remove
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Before remove...
A11| 6|A34|A43|
A11| 6|A35|A44|
After remove ...
11| 6|34|43|
11| 6|35|44|
C:\Users\dinesh_pundkar\Desktop>
Related
Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!
Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|
I have a text file which contains multiple number of lines. I want to check a specific line (calling xyz ...) in present or not in between two line(++ start line and -- exiting line). If the line (calling xyz ...) is present then it should return the line and if line is not present then NULL value should be return. I want to store the result in to a list.
Example file:
++ start line
22 15:36:53
dog, cat, monkey, rat
calling xxxxx
animal already added
-- exiting line
Above block of line should add calling xxxxx to list.
++ start line
12 12:56:34
cat, camel, cow, dog
animal already added
-- exiting line
In above block calling xyz is missing so it should add NULL to the list
Expected Output
calling xxxxx
NULL
You can use this regex to check the condition that you have mentioned:
^\+\+(?=(?:(?!\-\-).)*\s+(calling[^\n]+)).*?\s+--
Observe how the regex works here
If it matches then you get the calling line as group 1
Sample Source ( run here ):
import re
regex = r"(?:^\+\+(?=(?:(?!\-\-).)*\s+(calling[^\n]+)).*?\s+--)|(?:^\+\+(?=(?:(?!\-\-).)*\s+(?!calling[^\n]+)).*?\s+--)"
test_str = ("++ start line \n"
"22 15:36:53 \n"
"dog, cat, monkey, rat\n"
"calling xxxxx\n"
"animal already added\n"
"-- exiting line\n\n\n"
"++ start line \n"
"12 12:56:34 \n"
"cat, camel, cow, dog \n"
"animal already added\n"
"-- exiting line\n\n"
"++ start line \n"
"12 12:56:34 \n"
"cat, camel, cow, dog \n"
"calling pqr \n"
"animal already added\n"
"-- exiting line\n\n")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for match in matches:
print(match.group(1))
Output:
calling xxxxx
None
calling pqr
You may want to use multiple patterns, one for separating the block, one for search calling... in the block.
Expression for the block (see a demo here):
^\+\+
(?P<block>[\s\S]+?)
^--.+
Expression for calling...:
^calling.+
As a Python snippet:
import re
rx_block = re.compile(r'''
^\+\+
(?P<block>[\s\S]+?)
^--.+''', re.MULTILINE | re.VERBOSE)
rx_calling = re.compile(r'''
^calling.+
''', re.MULTILINE | re.VERBOSE)
numbers = [number.group(0) if number else None
for block in rx_block.finditer(your_string_here)
for number in [rx_calling.search(block.group('block'))]]
print(numbers)
Which yields
['calling xxxxx', None]
One can use split function to get sub-parts and check them:
outlist = []
with open("calling.txt", "r") as ff:
lines = ff.read()
records = lines.split("++ start line ")
records = list(filter(lambda x: len(x)>0, records))
for rec in records:
found = False
rows = rec.split("\n")
for row in rows:
if not found and row.startswith("calling"):
outlist.append(row.split(" ")[1])
found = True
if not found:
outlist.append("NULL")
print(outlist)
Output:
['xxxxx', 'NULL', 'pqr']
Python newbie here. I've been working my way through this code to basically create a string which includes a date. I have bits of the code working to get the data I want, however I need help formatting to string to tie in the data together.
This is what I have so far:
def get_rectype_count(filename, rectype):
return int(subprocess.check_output('''zcat %s | '''
'''awk 'BEGIN {FS=";"};{print $6}' | '''
'''grep -i %r | wc -l''' %
(filename, rectype), shell=True))
str = "MY VALUES ("
rectypes = 'click', 'bounce'
for myfilename in glob.iglob('*.gz'):
#print (rectypes)
print str.join(rectypes)
print (timestr)
print([get_rectype_count(myfilename, rectype)
for rectype in rectypes])
My output looks like this:
clickMY VALUES (bounce
'2015-07-01'
[222, 0]
I'm trying to create this output file:
MY VALUES ('2015-07-01', click, 222)
MY VALUES ('2015-07-01', bounce, 0)
When you call join on a string it joins together everything in the sequence passed to it, using itself as the separator.
>>> '123'.join(['click', 'bounce'])
click123bounce
Python supports formatting strings using replacement fields:
>>> values = "MY VALUES ('{date}', {rec}, {rec_count})"
>>> values.format(date='2015-07-01', rec='click', rec_count=222)
"MY VALUES ('2015-07-01', click, 222)"
With your code:
for myfilename in glob.iglob('*.gz'):
for rec in rectypes:
rec_count = get_rectype_count(myfilename, rec)
print values.format(date=timestr, rec=rec, rec_count=rec_count)
edit:
If you want to use join, you can join a newline, \n:
>>> print '\n'.join(['line1', 'line2'])
line1
line2
Putting it together:
print '\n'.join(values.format(date=timestr,
rec=rec,
rec_count=get_rectype_count(filename, rec))
for filename in glob.iglob('*.gz')
for rec in rectypes)
try this:
str1 = "MY VALUES ("
rectypes = ['click', 'bounce']
K=[]
for myfilename in glob.iglob('*.gz'):
#print (rectypes)
#print str.join(rectypes)
#print (timestr)
k=([get_rectype_count(myfilename, rectype)
for rectype in rectypes])
for i in range(0,len(rectypes)):
print str1+str(timestr)+","+rectypes[i]+","+str(k[i])+")"
I have a text file which contain some format like :
PAGE(leave) 'Data1'
line 1
line 2
line 2
...
...
...
PAGE(enter) 'Data1'
I need to get all the lines in between the two keywords and save it a text file. I have come across the following so far. But I have an issue with single quotes as regular expression thinks it as the quote in the expression rather than the keyword.
My codes so far:
log_file = open('messages','r')
data = log_file.read()
block = re.compile(ur'PAGE\(leave\) \'Data1\'[\S ]+\s((?:(?![^\n]+PAGE\(enter\) \'Data1\').)*)', re.IGNORECASE | re.DOTALL)
data_in_home_block=re.findall(block, data)
file = 0
make_directory("home_to_home_data",1)
for line in data_in_home_block:
file = file + 1
with open("home_to_home_" + str(file) , "a") as data_in_home_to_home:
data_in_home_to_home.write(str(line))
It would be great if someone could guide me how to implement it..
As pointed out by #JoanCharmant, it is not necessary to use regex for this task, because the records are delimited by fixed strings.
Something like this should be enough:
messages = open('messages').read()
blocks = [block.rpartition(r"PAGE\(enter\) 'Data1'")[0]
for block in messages.split(r"PAGE\(leave\) 'Data1'")
if block and not block.isspace()]
for count, block in enumerate(blocks, 1):
with open('home_to_home_%d' % count, 'a') as stream:
stream.write(block)
If it's single quotes what worry you, you can start the regular expression string with double quotes...
'hello "howdy"' # Correct
"hello 'howdy'" # Correct
Now, there are more issues here... Even when declared asr, you still must escape your regular expression's backslashes in the .compile (see What does the "r" in pythons re.compile(r' pattern flags') mean? ) Is just that without the r, you probably would need a lot more of backslashes.
I've created a test file with two "sections":
PAGE\(leave\) 'Data1'
line 1
line 2
line 3
PAGE\(enter\) 'Data1'
PAGE\(leave\) 'Data1'
line 4
line 5
line 6
PAGE\(enter\) 'Data1'
The code below will do what you want (I think)
import re
log_file = open('test.txt', 'r')
data = log_file.read()
log_file.close()
block = re.compile(
ur"(PAGE\\\(leave\\\) 'Data1'\n)"
"(.*?)"
"(PAGE\\\(enter\\\) 'Data1')",
re.IGNORECASE | re.DOTALL | re.MULTILINE
)
data_in_home_block = [result[1] for result in re.findall(block, data)]
for data_block in data_in_home_block:
print "Found data_block: %s" % (data_block,)
Outputs:
Found data_block: line 1
line 2
line 3
Found data_block: line 4
line 5
line 6
I want to put \n after every 20 character....
My_string = "aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeffffffffffffffffffff"
I tried with this: a = "\n".join(re.findall("(?s).{,20}", My_string))[0:-1]
When ever i am print like:
print '''
---------------------------------------------------------------
Value of a is
%s
---------------------------------------------------------------
''' % a
OUTPUT:
---------------------------------------------------------------
Value of a is
aaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbc
cccccccccccccccccccd
ddddddddddddddddddde
eeeeeeeeeeeeeeeeeeef
fffffffffffffffffff
---------------------------------------------------------------
I want output like:
---------------------------------------------------------------
Value of a is
aaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbc
cccccccccccccccccccd
ddddddddddddddddddde
eeeeeeeeeeeeeeeeeeef
fffffffffffffffffff
---------------------------------------------------------------
You want to create a list of all lines, both predefined and wrapped, then add space identation in front of each one (preferably in a single step to avoid duplicate code) and then join everything into a single string.
While regular expressions do the trick, have a look at a nice standard module textwrap, which allows you to wrap lines:
import textwrap
My_string = "aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeffffffffffffffffffff"
print '\n'.join(' {0}'.format(line) for line in [
'---------------------------------------------------------------',
'Value of a is'] + textwrap.fill(My_string, 20).split('\n') +
['---------------------------------------------------------------'])
prints
---------------------------------------------------------------
Value of a is
aaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbc
cccccccccccccccccccd
ddddddddddddddddddde
eeeeeeeeeeeeeeeeeeef
fffffffffffffffffff
---------------------------------------------------------------
try this:
# -*- encoding: utf-8 -*-
import re
My_string = "aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeffffffffffffffffffff"
split="\n "
a = split.join(re.findall("(?s).{,20}", My_string))[0:-1]
print ''' ---------------------------------------------------------------
Value of a is
%s ---------------------------------------------------------------''' % a
it looks like can meet your requirements