Why write() method writes unknown characters? - python

I have this simple txt file:
[header]
width=8
height=5
tilewidth=175
tileheight=150
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1
I want to separate the text by the "[header]", "[tilesets]" and "[layers]". Problem is, if I split it in this way:
m = open(self.fullPath, 'r+')
sliced = m.read().split() # Default = \n
print sliced
It shall separate each line, because read() always leave a '\n' at the end of every line:
['[header]', 'width=8', 'height=5', 'tilewidth=175', 'tileheight=150', '[tilesets]', 'tileset=../GFX/ts1.png,175,150,0,0', '[layer]', 'type=Tile', 'Layer', '1', 'data=', '1,1,1,1,1,1,1,1,', '1,0,0,0,0,0,0,1,', '1,0,0,0,0,1,1,1,', '1,0,0,0,6,0,0,1,', '1,1,1,1,4,1,1,1']
But it's possible to split perfectly if, instead a new-line-character, there was a "#" sign or whatever separating each section.
Then, I thought: "There are empty lines there, and they are new-line-characters, so I just need to test if the line equals to the new-line-character and replace it with '#'":
for line in m.readlines():
if line == '\n':
m.write('#')
for line in m.readlines():
print line
Perfect.. Except that.. Instead of achieving this:
[header]
width=8
height=5
tilewidth=175
tileheight=150
#
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
#
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1
I get this:
[header]
width=8
height=5
tilewidth=175
tileheight=150
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1##õÙÓ Z d Z d d l Z d d l Z d " Z d $ „ Z d - f d „ ƒ Y Z e H d ƒ Z e I j ƒ Æ Çîà  õÙÓ ; | j d ƒ } i < g d 6 g d 6 g d 6 } d = d d g } x 0 ·ð? | j ƒ D u tîà  õÙÓI À¶ð ) (–à W # "íà  õ#ÎÔ €·ðB | j ƒ D ú ú–à  õ(Tò `·ð } | C G H q | #·ð Ñ Ñ–à  õ#ÎÔ ¨ ¨–à  õ#ÎÔ
E G H | F j ƒ –à  õ#ÎÔ S V V–à  õ#ÎÔž ÿÿÿÿ t | j d ƒ } i g d 6g d 6g d 6} d d d g } x0 | j ƒ D]" } | d k rk | j d ƒ n qI Wx | j ƒ D] } | GHq| Wd
GH| j ƒ d S `:ð> >§à  õ#ÎÔÀ:ðà¢îðà:ð ;ð`ßî ;ð#;ð0ð`;ð £îXð# ï€;ð€ð ;ð`£îÀ;ðà;ð ð2 2›à  õ#ÎÔ`<ð€<ðà¤î <ð ?îÀ<ðà<ð =ð =ð#=ðÀ?î ïÐð`=ð¸ï€=ð =ðøðÀ=ðà=ð >ð >ð#>ð`>ð ð€>ð >ðÀ>ðà>ð ?ð#OÑ ?ð#?ð`?ð€?ð ?ðHðpðÀ?ð˜ðÀðà?ð #ðÀ£î##ð`#ð€#ð #ð PðHPðÀ#ðà#ð
It makes no sense :).

Simultaneously reading from and writing to a file tends to have unpredictable effects on what kind of output you get.
If your categories are always separated by two newlines, then just split on that, instead of doing any fancy find/replace operations.
m = open("input.txt", "r+")
sliced = m.read().split("\n\n")
print "data has been split into {} categories.".format(len(sliced))
#print the starting line of each category
for category in sliced:
print category.split("\n")[0]
Result:
data has been split into 3 categories.
[header]
[tilesets]
[layer]

Related

How I can delete extra whitespaces between characters in string python [duplicate]

This question already has answers here:
Python: substitute double space with one space and remove single space
(4 answers)
Closed 9 months ago.
How I can normalize a string that contains extras whitespaces between
characters
From this
'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
In to
'I need to delete the extra whitespace between characters'
Here's an option using split and join:
>>> s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
>>> ''.join(c if c else " " for c in s.split(" "))
'I need to delete the extra whitespace between characters'
You can use a regex, delete spaces not followed by spaces:
s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
import re
out = re.sub('\s(?!\s)', '', s)
output: 'I need to delete the extra whitespace between characters'
Alternative to handle any number of spaces, if more than one, only keep one, else delete:
s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
import re
out = re.sub('\s+?(\s)?', r'\1', s)
As discussed in the comments, you want to remove every space that is not followed by a non-whitespace character. Doing exactly this in the python shell:
>>> import re
>>> s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
>>> re.sub(r"\s(\S)", r"\1", s)
'I need to delete the extra whitespace between characters'
>>>
The r"\s(\S)" says "match a whitespace and then a non-whitespace. The non-whitespace is in group 1. The r"\1" says to replace this full match with the group 1.

Control signs appear as smileys in visual studio

When I type
print( "Hello ,\x00\x01\x02world!" )
in IDLE, the special singns are ignored, but when I do this in Visual Studio Code, i get this:
Hello ,☺☻world!
I was wondering just why this is.
for x in range(127):
print(chr(x), end = ' ')
Edit: When I run this code, this displays on the terminal in vscode:
☺ ☻ ♥ ♦ ♣ ♠
♫ ☼ ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ∟↔▲▼ 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F >G H I J K L M N O P Q R S T U V W X Y Z [ \ ]
^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
When i do this in IDLE, this shows up:
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; >< = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a >b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
Maybe this has to do with the vesrion?
im using python 3.9.1.
Oh and for some reason there are these rectangles in IDLE that get printed before the '!', but they can't be shown in the answer. (these rectangles represent the control codes: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Control_codes)
It seems that my VSC prints the first characters from code page 437
https://en.wikipedia.org/wiki/Code_page_437
Duplicate of Emoji symbols/emoticons in Python IDLE:
As per #mata:
Tcl (and therefore tkinter and idle) supports only characters in the 16bit rahge (U+0000-U+FFFF), so you can't. –

Alphabetical Grid using python3

how to write a function grid that returns an alphabetical grid of size NxN, where a = 0, b = 1, c = 2.... in python
example :
a b c d
b c d e
c d e f
d e f g
here I try to create a script using 3 for loops but it's going to print all the alphabets
def grid(N):
for i in range(N):
for j in range(N):
for k in range(ord('a'),ord('z')+1):
print(chr(k))
pass
Not the most elegant, but gets the job done.
import string
def grid(N):
i = 0
for x in range(N):
for y in string.ascii_lowercase[i:N+i]:
print(y, end=" ")
i += 1
print()
grid(4)
Output
a b c d
b c d e
c d e f
d e f g
Extending from #MichHeng's suggestion, and using list comprehension:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
print(' '.join([letters[i] for i in range(i,N+i)]))
grid(4)
output is
a b c d
b c d e
c d e f
d e f g
You have specified for k in range(ord('a'),ord('z')+1) which prints out the entire series from 'a' to 'z'. What you probably need is a reference list comprehension to pick your letters from, for example
[chr(x) for x in range(ord('a'),ord('z')+1)]
Try this:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
for j in range(i, N+i):
print(letters[j], end=' ')
if j==N+i-1:
print('') #to move to next line
grid(4)
Output
a b c d
b c d e
c d e f
d e f g
Do you need to add a check for N<=13 ?

how to speed up writing a large string into a file in python

So I have a 1 Gb input txt file (1 million lines * 10 columns) and I am using python to process this input to get some calculated information and add each information (out of 1 M lines) into a string, and eventually save it. I tried to run my script, but realized the process got slower and slower as the string got bigger. I am wondering is it possible to append each line into the output and remove the previous buffered line to reduce the memory usage? Thank you. An example of codes:
import pandas as pd
# main_df.txt has more than 1 million lines and 10 columns
main_df = pd.read_csv('main_df.txt')
"""
processing main_df into new_df, but new_df still has 1 M lines in the end
"""
sum_df = ''
# I'm guessing sum_df gets super big here as it goes, which uses up memory and slows the process .
# I have a bunch of complex loops, to simplify, I will just make an example for one single loop:
for i in range(len(new_df)):
sum_df += new_df.loc[i, 1] + '\t' + new_df.loc[i, 3] + '\t' + new_df.loc[i, 5] + '\n'
with open('out.txt', 'w') as w:
w.write(sum_df)
Hard to tell what your goal is here, but a few things might help. Here is an example df.
new_df = pd.DataFrame({0:np.random.choice(list(string.ascii_lowercase), size=(10)),
1:np.random.choice(list(string.ascii_lowercase), size=(10)),
2:np.random.choice(list(string.ascii_lowercase), size=(10)),
3:np.random.choice(list(string.ascii_lowercase), size=(10)),
4:np.random.choice(list(string.ascii_lowercase), size=(10)),
5:np.random.choice(list(string.ascii_lowercase), size=(10)),
6:np.random.choice(list(string.ascii_lowercase), size=(10))})
print(new_df)
0 1 2 3 4 5 6
0 z k o m s k w
1 x g k k h b v
2 o y m r g l r
3 i n m q o j h
4 r d s r s p s
5 t o d w e b a
6 t z w y q s n
7 r r d x b s s
8 g v h m w c l
9 r v y i w i z
Your code outputs:
sum_df = '' # this is a string, not a df
for i in range(len(new_df)):
sum_df += new_df.loc[i, 1] + '\t' + new_df.loc[i, 3] + '\t' + new_df.loc[i, 5] + '\n'
print(sum_df)
i k z
x g o
y l x
g s l
p h e
u s v
r u l
m j e
q k f
d p b
I'm not really sure what your other loops are supposed to do, but the one in your example looks like it's just taking columns 1, 3, and 5. So rather than a for loop, you could do something like this.
sum_df = new_df[[1,3,5]]
print(sum_df)
1 3 5
0 k m k
1 g k b
2 y r l
3 n q j
4 d r p
5 o w b
6 z y s
7 r x s
8 v m c
9 v i i
Then save it to a .txt with something like this.
sum_df.to_csv('new_df.txt', header=None, index=None, sep='\t')
Generally speaking you want to avoid looping over dfs. If you need to do something more complex than the example you can use pd.apply() to apply a custom function along an axis of the df. If you must loop over the df, df.itertuples or df.iterrows() are preferable to for loops as they use a generator like mentioned by Datanovice's comment.
I eventually figured it out...
w = open('out.txt', 'a')
for i in range(len(new_df)):
sum_df = new_df.loc[i, 1] + '\t' + new_df.loc[i, 3] + '\t' + new_df.loc[i, 5] + '\n'
w.write(sum_df)
w.close()

Trying to verify last position of a string

Im trying to verify if the last char is not on my list
def acabar_char(input):
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
tam = 0
tam = (len(input)-1)
for char in input:
if char[tam] in list_chars:
return False
else:
return True
When i try this i get this error:
if char[tam] in list_chars:
IndexError: string index out of range
you can index from the end (of a sting or a list) with negative numbers
def acabar_char(input, list_cars):
return input[-1] is not in list_chars
It seems that you are trying to assert that the last element of an input string (or also list/tuple) is NOT in a subset of disallowed chars.
Currently, your loop never even gets to the second and more iteration because you use return inside the loop; so the last element of the input only gets checked if the input has length of 1.
I suggest something like this instead (also using the string.ascii_letters definition):
import string
DISALLOWED_CHARS = string.ascii_letters + string.digits
def acabar_char(val, disallowed_chars=DISALLOWED_CHARS):
if len(val) == 0:
return False
return val[-1] not in disallowed_chars
Does this work for you?
you are already iterating through your list in that for loop, so theres no need to use indices. you can use list comprehension as the other answer suggest, but I'm guessing you're trying to learn python, so here would be the way to rewrite your function.
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
for char in input:
if char in list_chars:
return False
return True
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
def acabar_char(input):
if input in list_chars:
print('True')

Categories