I am trying to parse a text document line by line and in doing so I stumbled onto some weird behavior which I believe is caused by the presence of some kind of ankh symbol (☥). I am not able to copy the real symbol here.
In my code I try to determine whether a '+' symbol is present in the first characters of each line. To see if this worked I added a print statement containing a boolean and this string.
The relevant part of my code:
with open(file_path) as input_file:
content = input_file.readlines()
for line in content:
plus = '+' in line[0:2]
print('Plus: {0}, line: {1}'.format(plus,line))
A file I could try to parse:
+------------------------------
row 1 with some content
+------+------+-------+-------
☥+------+------+-------+------
| col 1 | col 2 | col 3 ...
+------+------+-------+-------
|_ valu | val | | dsf |..
|_ valu | valu | ...
What I get as output:
Plus: True, line: +------------------------------
Plus: False, line: row 1 with some content
Plus: True, line: +------+------+-------+-------
♀+------+------+-------+------
Plus: False, line: | col 1 | col 2 | col 3 ...
Plus: True, line: +------+------+-------+-------
Plus: False, line: |_ valu | val | | dsf |..
Plus: False, line: |_ valu | valu | ...
So my question is why does it just print the line containing the symbol without the 'Plus: True/False'. How should I solve this?
Thanks.
What you are seeing is the gender symbol. It is from the original IBM PC character set and is encoded as 0x0c, aka FormFeed, aka Ctrl-L.
If you are parsing text data with these present, they likely were inserted to indicate to a printer to start a new page.
From wikipedia:
Form feed is a page-breaking ASCII control character. It forces the printer to eject the current page and to continue printing at the top of another. Often, it will also cause a carriage return. The form feed character code is defined as 12 (0xC in hexadecimal), and may be represented as control+L or ^L.
Related
Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!
Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|
I want to print the items from a list on the same line.
The code I have tried:
dice_art = ["""
-------
| |
| N |
| |
------- ""","""
-------
| |
| 1 |
| |
------- """] etc...
player = [0, 1, 2]
for i in player:
print(dice_art[i], end='')
output =
ASCII0
ASCII1
ASCII2
I want output to =
ASCII0 ASCII1 ASCII2
This code still prints the ASCII art representation of my die on a new line. I would like to print it on the same line to save space and show each player's roll on one screen.
Since the elements of dice_art are multiline strings, this is going to be harder than that.
First, remove newlines from the beginning of each string and make sure all lines in ASCII art have the same length.
Then try the following
player = [0, 1, 2]
lines = [dice_art[i].splitlines() for i in player]
for l in zip(*lines):
print(*l, sep='')
If you apply the described changes to your ASCII art, the code will print
------- ------- -------
| || || |
| N || 1 || 2 |
| || || |
------- ------- -------
The fact that your boxes are multiline changes everything.
Your intended output, as I understand it, is this:
------- -------
| || |
| N || 1 | ...and so on...
| || |
------- -------
You can do this like so:
art_split = [art.split("\n") for art in dice_art]
zipped = zip(*art_split)
for elems in zipped:
print("".join(elems))
# ------- -------
# | || |
# | N || 1 |
# | || |
# ------- -------
N.B. You need to guarantee that each line is the same length in your output. If the lines of hyphens are shorter than the other, your alignment will be off.
In the future, if you provide the intended output, you can get much better responses.
Change print(dice_art[i], end='') to:
print(dice_art[i], end=' '), (Notice the space inbetween the two 's and the , after your previous code)
If you want to print the data dynamically, use the following syntax:
print(dice_art[i], sep=' ', end='', flush=True),
A join command should do it.
dice_art = ['ASCII0', 'ASCII1', 'ASCII2']
print(" ".join(dice_art))
The output would be:
ASCII0 ASCII1 ASCII2
Hello i am not very familiar with programming and found Stackoverflow while researching my task. I want to do natural language processing on a .csv file that looks like this and has about 15.000 rows
ID | Title | Body
----------------------------------------
1 | Who is Jack? | Jack is a teacher...
2 | Who is Sam? | Sam is a dog....
3 | Who is Sarah?| Sarah is a doctor...
4 | Who is Amy? | Amy is a wrestler...
I want to read the .csv file and do some basic NLP operations and write the results back in a new or in the same file. After some research python and nltk seams to be the technologies i need. (i hope thats right). After tokenizing i want my .csv file to look like this
ID | Title | Body
-----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Sam" "is" "a" "dog"....
3 | "Who" "is" "Sarah" "?"| "Sarah" "is" "a" "doctor"...
4 | "Who" "is" "Amy" "?" | "Amy" "is" "a" "wrestler"...
What i have achieved after a day of research and putting pieces together looks like this
ID | Title | Body
----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Jack" "is" "a" "teacher"...
3 | "Who" "is" "Sarah" "?"| "Jack" "is" "a" "teacher"...
4 | "Who" "is" "Amy" "?" | "Jack" "is" "a" "teacher"...
My first idea was to read a specific cell in the .csv ,do an operation and write it back to the same cell. And than somehow do that automatically on all rows. Obviously i managed to read a cell and tokenize it. But i could not manage to write it back in that specific cell. And i am far away from "do that automatically to all rows". I would appreciate some help if possible.
My code:
import csv
from nltk.tokenize import word_tokenize
############Read CSV File######################
########## ID , Title, Body####################
line_number = 1 #line to read (need some kind of loop here)
column_number = 2 # column to read (need some kind of loop here)
with open('test10in.csv', 'rb') as f:
reader = csv.reader(f)
reader = list(reader)
text = reader[line_number][column_number]
stringtext = ''.join(text) #tokenizing just work on strings
tokenizedtext = (word_tokenize(stringtext))
print(tokenizedtext)
#############Write back in same cell in new CSV File######
with open('test11out.csv', 'wb') as g:
writer = csv.writer(g)
for row in reader:
row[2] = tokenizedtext
writer.writerow(row)
I hope i asked the question correctly and someone can help me out.
The pandas library will make all of this much easier.
pd.read_csv() will handle the input much more easily, and you can apply the same function to a column using pd.DataFrame.apply()
Here's a quick example of how the key parts you'll want work. In the .applymap() method, you can replace my lambda function with word_tokenize() to apply that across all elements instead.
In [58]: import pandas as pd
In [59]: pd.read_csv("test.csv")
Out[59]:
0 1
0 wrestler Amy dog is teacher dog dog is
1 is wrestler ? ? Sarah doctor teacher Jack
2 a ? Sam Sarah is dog Sam Sarah
3 Amy a a doctor Amy a Amy Jack
In [60]: df = pd.read_csv("test.csv")
In [61]: df.applymap(lambda x: x.split())
Out[61]:
0 1
0 [wrestler, Amy, dog, is] [teacher, dog, dog, is]
1 [is, wrestler, ?, ?] [Sarah, doctor, teacher, Jack]
2 [a, ?, Sam, Sarah] [is, dog, Sam, Sarah]
3 [Amy, a, a, doctor] [Amy, a, Amy, Jack]
Also see: http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application
You first need to parse your file and then process (tokenize, etc.) each field separately.
If our file really looks like your sample, I wouldn't call it a CSV. You could parse it with the csv module, which is specifically for reading all sorts of CSV files: Add delimiter="|" to the arguments of csv.reader(), to separate your rows into cells. (And don't open the file in binary mode.) But your file is easy enough to parse directly:
with open('test10in.csv', encoding="utf-8") as fp: # Or whatever encoding is right
content = fp.read()
lines = content.splitlines()
allrows = [ [ fld.strip() for fld in line.split("|") ] for line in lines ]
# Headers and data:
headers = allrows[0]
rows = allrows[2:]
You can then use nltk.word_tokenize() to tokenize each field of rows, and go on from there.
Whats an easy way convert the output of Python Pretty table to grammatically usable format such as CSV.
The output looks like this :
C:\test> nova list
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
Perhaps this will get you close:
nova list | grep -v '\-\-\-\-' | sed 's/^[^|]\+|//g' | sed 's/|\(.\)/,\1/g' | tr '|' '\n'
This will strip the --- lines
Remove the leading |
Replace all but the last | with ,
Replace the last | with \n
Here's a real ugly one-liner
import csv
s = """\
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
result = [tuple(filter(None, map(str.strip, splitline))) for line in s.splitlines() for splitline in [line.split("|")] if len(splitline) > 1]
with open('output.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerows(result)
I can unwrap it a bit to make it nicer:
splitlines = s.splitlines()
splitdata = line.split("|")
splitdata = filter(lambda line: len(line) > 1, data)
# toss the lines that don't have any data in them -- pure separator lines
header, *data = [[field.strip() for field in line if field.strip()] for line in splitdata]
result = [header] + data
# I'm really just separating these, then re-joining them, but sometimes having
# the headers separately is an important thing!
Or possibly more helpful:
result = []
for line in s.splitlines():
splitdata = line.split("|")
if len(splitdata) == 1:
continue # skip lines with no separators
linedata = []
for field in splitdata:
field = field.strip()
if field:
linedata.append(field)
result.append(linedata)
#AdamSmith's answer has a nice method for parsing the raw table string. Here are a few additions to turn it into a generic function (I chose not to use the csv module so there are no additional dependencies)
def ptable_to_csv(table, filename, headers=True):
"""Save PrettyTable results to a CSV file.
Adapted from #AdamSmith https://stackoverflow.com/questions/32128226
:param PrettyTable table: Table object to get data from.
:param str filename: Filepath for the output CSV.
:param bool headers: Whether to include the header row in the CSV.
:return: None
"""
raw = table.get_string()
data = [tuple(filter(None, map(str.strip, splitline)))
for line in raw.splitlines()
for splitline in [line.split('|')] if len(splitline) > 1]
if table.title is not None:
data = data[1:]
if not headers:
data = data[1:]
with open(filename, 'w') as f:
for d in data:
f.write('{}\n'.format(','.join(d)))
Here's a solution using a regular expression. It also works for an arbitrary number of columns (the number of columns is determined by counting the number of plus signs in the first input line).
input_string = """spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
import re, csv, sys
def pretty_table_to_tuples(input_str):
lines = input_str.split("\n")
num_columns = len(re.findall("\+", lines[0])) - 1
line_regex = r"\|" + (r" +(.*?) +\|"*num_columns)
for line in lines:
m = re.match(line_regex, line.strip())
if m:
yield m.groups()
w = csv.writer(sys.stdout)
w.writerows(pretty_table_to_tuples(input_string))
I have this CSV that I have to modify using Python.
The number of files varies each time. The input CSV files i have are just a list of coordinates (x, y, z) and i have to modify the file into a 'model' which contains the same coordinates but also some information/headers.
The model looks like this :
Number | 1 | |
Head | N | E | El
Begin | list | list | list
| . | . | .
| . | . | .
| . | . | .
| . | . | .
End | . | . | .
| . | . | .
BeginR | Ok | |
EndR | | |
The dots are the coordinates that are in the lists.
So far I've managed to write almost everything.
What's left is to write the Begin and the End in the first column.
Because the size of the list varies, I have difficulties to place it where they should be : Begin at the same line with the first coordinates and End at the second to last coordinate line.
This is my updated code :
for i in ficList:
with open(i, newline='') as f:
reader = csv.reader(f, delimiter = ';')
next(reader) # skip the header
for row in reader:
coord_x.append(row[0]) # X
coord_y.append(row [1]) # Y
coord_z.append(row[2]) # Z
list_list = [coord_x, coord_y, coord_z] # list of coordinates
len_x = len(coord_x) # length of list
with open(i, 'w', newline='') as fp:
writer = csv.writer(fp, delimiter = ';')
writer.writerow(['Number', number])
writer.writerow(['Head','N', 'E', 'El'])
for l in range(len_x):
if l == 0:
writer.writerow(['Begin',list_list[0][l], list_list[1][l], list_list[2][l]])
if l == len_x-2 :
writer.writerow(['End',list_list[0][l], list_list[1][l], list_list[2][l]])
writer.writerow(['',list_list[0][l], list_list[1][l], list_list[2][l]]) # write the coordinates
writer.writerow(['BeginR', 'Ok'])
writer.writerow(['EndR'])
coord_x.clear() # empty list x
coord_y.clear() # empty list y
coord_z.clear() # empty list z
You're probably better off to define the row labels in advance in a map, then look them up for each row. Also list_list is not really needed, you should just stick to the separate vectors:
...
with open(i, 'w', newline='') as fp:
writer = csv.writer(fp, delimiter = ';')
writer.writerow(['Number', number])
writer.writerow(['Head','N', 'E', 'El'])
row_label_map = {0:'Begin',len_x-2:'End'}
for l in range(len_x):
row_label = row_label_map.get(l,"")
writer.writerow([row_label, coord_x[l], coord_y[l], coord_z[l]])
writer.writerow(['BeginR', 'Ok'])
writer.writerow(['EndR'])
...
Also you don't need to clear the vectors coord_x etc. afterwards as they will be deleted when they go out of scope.
With your latest code, I am guessing the issue is because you are first writing the line with BEGIN tag and then without it, move the logic into a if..elif..else part -
for l in range(len_x):
if l == 0:
writer.writerow(['Begin',list_list[0][l], list_list[1][l], list_list[2][l]])
elif l == len_x-2 :
writer.writerow(['End',list_list[0][l], list_list[1][l], list_list[2][l]])
else:
writer.writerow(['',list_list[0][l], list_list[1][l], list_list[2][l]]) # write the coordinates
To me it seems like it would be easier to first modify the input CSV to include and extra column that has the Begin and End tags with sed like this:
sed -e 's/^/,/' -e '1s/^/Begin/' -e '$ s/^/End/' -e 's/^,/ ,/' test.csv
Then you can simply print the columns as they are without having to add logic for when to add the additional tags in python. This assumes that the input CSV is called test.csv