How to retrieve the original string from UTF-8 file?

How to retrieve the original string from UTF-8 file? - python

I am doing some web scraping with python and BeautifulSoup.
body = soup.find("article")
tempvar = body.find()
fuu = open('tempfile', 'w')
tempvar = tempvar.encode('utf-8')
fuu.write(str(tempvar))
fuu.close()
fupa = open('tempfile')
joji = BeautifulSoup(fupa,'html.parser')
fupa.close()
print(joji)
tempvar would would contain html stuff , sometimes with emojis.
I want to use the contents of the file tempfile later in a real html file.
The print(joji) produces something like this:
<b>mencapai\xc2\xa0batas aksara 140</b>, tapi sudah tentu itu tidak termasuk semua <i>tweet </i>yang tak pernah dihantar kerana pengguna tidak boleh nak luahkan apa yang mereka mahukan. Selepas <b>mengaktifkan aksara 280</b> pada <b>sejumlah kecil akaun </b>yang bertuah, <b>Twitter </b>mengatakan <b>hanya 1%</b> sahaja <b>pengguna yang capai had aksara 280</b>. Tulis panjang\xc2\xb2 nak buat karangan ka. \xf0\x9f\x98\x9c<br/>\n<br/>\nIa juga jarang berlaku bagi pengguna untuk mencapai aksara 280, hanya <b>2%</b> dari <i>tweet </i><b>melebihi aksara 190</b>. <b>Had aksara tweet sebanyak 280 </b>juga <b>mendapat lebih <i>likes </i>dan <i>retweets </i></b>daripada had aksara <i>tweet </i>sebanyak 140. \xf0\x9f\x98\x8a<br/>\n<br/>

tempvar is a Unicode string. To write it correctly to a file:
with open('tempfile', 'w', encoding='utf8') as fuu:
fuu.write(tempvar)
Read it back in with:
with open('tempfile', encoding='utf8') as fupa:
...

Related

Check data column is same or not with Pandas

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error
tokenizing
lemmatization
check
[pergi, untuk, melakukan, penanganan, banjir]
[pergi, untuk, laku, tangan, banjir]
False
[baca, buku, itu, asik]
[baca, buku, itu, asik]
True
from spacy.lang.id import Indonesian
import pandas as pd
nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()
data = [
'pergi untuk melakukan penanganan banjir',
'baca buku itu asik'
]
df = pd.DataFrame({'text': data})
#Tokenization
def tokenizer(words):
return [token for token in nlp(words)]
#Lemmatization
def lemmatizer(token):
return [lem.lemma_ for lem in token]
df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df
How to compare?
result before error df.to_clipboard()
text tokenizing lemmatization
0 pergi untuk melakukan penanganan banjir [pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir]
1 baca buku itu asik [baca, buku, itu, asik] [baca, buku, itu, asik]
Update
The error is fixed. It is because typo. And after fixed the typo the result is like this the result is all False. What I want is like the table.

Base on your code, you forgot i on df['lemmatizaton'].
So that change
df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)
to
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
Then it may work.

Normalization words for sentiment analysis

I'm currently doing sentiment analysis and having a problem.
I have a big normalization for word and I want to normalization text before tokenize like this example:
data
normal
kamu knp sayang
kamu kenapa sayang
drpd sedih mending belajar
dari pada sedih mending belajar
dmna sekarang
di mana sekarang
knp: kenapa
drpd: dari pada
dmna: di mana
This is my code:
import pandas as pd
slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
normalisasi = {}
for index, row in slang.iterrows():
if row[0] not in normalisasi:
normalisasi[row[0]] = row[1]
def normalized_term(document):
return [normalisasi[term] if term in normalisasi else term for term in document]
df['normal'] = df['data'].apply(normalized_term)
df
But, the result like this:
result
I want the result like the example table.

There is a utility named str.replace in pandas that allows us to replace a substring with another or even find/replace patterns. You can find full documentation here. Your desired output would have appeared like this:
UPDATE
There were two things wrong with the answer:
You must only replace in whole word mode, not subword
After each entry in the slang file you must keep the changes not discard them
So it would be like this:
import pandas as pd
df = pd.read_excel('data bersih.xlsx')
slang = pd.read_excel('slang.xlsx')
df['normal'] = df.text
for idx, row in slang.iterrows():
df['normal'] = df.normal.str.replace(r"\b"+row['before']+r"\b", row['after'], regex=True)
output:
text \
0 hari ini udh mulai ppkm yaa
1 mohon info apakah pgs pasar turi selama ppkm b...
2 di rumah aja soalnya lagi ppkm entah bakal nga...
3 pangkal penanganan pandemi di indonesia yang t...
4 ppkm mikro anjingggggggg
... ...
9808 drpd nonton sinetron mending bagi duit kayak g...
9809 ppkm pelan pelan kalau masukin
9810 masih ada kepala desa camat bahkan kepala daer...
9811 aku suka ppkm tapi tanpa pp di depannya
9812 menteri ini perlu tidak dibayarkan gajinya set...
normal
0 hari ini sudah mulai ppkm yaa
1 mohon informasi apakah pgs pasar turi selama p...
2 di rumah saja soalnya lagi ppkm entah bakal se...
3 pangkal penanganan pandemi di indonesia yang t...
4 ppkm mikro anjingggggggg
... ...
9808 dari pada nonton sinema elektronik lebih baik ...
9809 ppkm pelan pelan kalau masukkan
9810 masih ada kepala desa camat bahkan kepala daer...
9811 aku suka ppkm tapi tanpa pulang pergi di depannya
9812 menteri ini perlu tidak dibayarkan gajinya set...
[9813 rows x 2 columns]

Append list to existing csv according the header

I Have CSV with header
KlM1,KLM2,KLM3
and i have List
['1', 'Tafsir An-Nas (114) Aya 6', '4-6. Aku berlindung kepada-Nya dari kejahatan bisikan setan yang bersembunyi pada diri manusia dan selalu bersamanya layaknya darah yang mengalir di dalam tubuhnya, yang membisikkan kejahatan dan kesesatan ke dalam dada manusia dengan cara yang halus, lihai, licik, dan menjanjikan secara terus-menerus. Aku berlindung kepada-Nya dari setan pembisik kejahatan dan kesesatan yang berasal dari golongan jin, yakni makhluk halus yang tercipta dari api, dan juga dari golongan manusia yang telah menjadi budak setan.']
but when I save with this code:
list1 = ['1',''+str(test)+'',''+arti2+'']
with open('Alquran.csv', 'a') as file:
coba=csv.writer(file)
coba.writerows(list1)
they just fill in KLM1
how can i save the list like as a header?

You need to provide all rows to writerows(...) - it takes a list of rows and each row is a list of columns::
import csv
header = ["h1","h2","h3"]
row1 = [11,12,13]
row2 = [21,22,23]
# always use newline="" for csv-module opened file
with open('Alquran.csv', 'a', newline="") as file:
coba=csv.writer(file)
# list of rows, each row is a list of columns
coba.writerows( [header, row1, row2] )
with open('Alquran.csv',"r") as r:
print(r.read())
Output:
h1,h2,h3
11,12,13
21,22,23
To later add more information:
another_row = [31,32,33]
with open('Alquran.csv', 'a', newline="") as file:
coba=csv.writer(file)
# list of 1 row, containing 3 columns
coba.writerows( [another_row] )
with open('Alquran.csv',"r") as r:
print(r.read())
Output after adding:
h1,h2,h3
11,12,13
21,22,23
31,32,33
Doku:
csv.writerows(..)
for seperate row-writing, read Difference between writerow() and writerows() methods of Python csv module

Spaces in middle of each row in file csv

I have a problem with me file csv. It's saving with spaces in middle of each row. I don't know why. How do I solve this problem? I'm asking because I don't find any answer and solutions to this.
Here is the code:
import csv
import random
def dict_ID_aeropuertos():
with open('AeropuertosArg.csv') as archivo_csv:
leer = csv.reader(archivo_csv)
dic_ID = {}
for linea in leer:
dic_ID.setdefault(linea[0],linea[1])
archivo_csv.close()
return dic_ID
def ruteoAleatorio():
dic_ID = dict_ID_aeropuertos()
lista_ID = list(dic_ID.keys())
cont = 0
lista_rutas = []
while (cont < 50):
r1 = random.choice(lista_ID)
r2 = random.choice(lista_ID)
if (r1 != r2):
t = (r1,r2)
if (t not in lista_rutas):
lista_rutas.append(t)
cont += 1
with open('rutasAeropuertos.csv', 'w') as archivo_rutas:
escribir = csv.writer(archivo_rutas)
escribir.writerows(lista_rutas)
archivo_rutas.close()
ruteoAleatorio()
Here is the file csv AeropuertosArg.cvs:
1,Aeroparque Jorge Newbery,Ciudad Autonoma de Buenos Aires,Ciudad Autonoma de Buenos Aires,-34.55803,-58.417009
2,Aeropuerto Internacional Ministro Pistarini,Ezeiza,Buenos Aires,-34.815004,-58.5348284
3,Aeropuerto Internacional Ingeniero Ambrosio Taravella,Cordoba,Cordoba,-31.315437,-64.21232
4,Aeropuerto Internacional Gobernador Francisco Gabrielli,Ciudad de Mendoza,Mendoza,-32.827864,-68.79849
5,Aeropuerto Internacional Teniente Luis Candelaria,San Carlos de Bariloche,Rio Negro,-41.146714,-71.16203
6,Aeropuerto Internacional de Salta Martin Miguel de Guemes,Ciudad de Salta,Salta,-24.84423,-65.478412
7,Aeropuerto Internacional de Puerto Iguazu,Puerto Iguazu,Misiones,-25.731778,-54.476181
8,Aeropuerto Internacional Presidente Peron,Ciudad de Neuquen,Neuquen,-38.952137,-68.140484
9,Aeropuerto Internacional Malvinas Argentinas,Ushuaia,Tierra del Fuego,-54.842237,-68.309701
10,Aeropuerto Internacional Rosario Islas Malvinas,Rosario,Santa Fe,-32.916887,-60.780391
11,Aeropuerto Internacional Comandante Armando Tola,El Calafate,Santa Cruz,-50.283977,-72.053641
12,Aeropuerto Internacional General Enrique Mosconi,Comodoro Rivadavia,Chubut,-45.789435,-67.467498
13,Aeropuerto Internacional Teniente General Benjamin Matienzo,San Miguel de Tucuman,Tucuman,-26.835888,-65.108361
14,Aeropuerto Comandante Espora,Bahia Blanca,Buenos Aires,-38.716152,-62.164955
15,Aeropuerto Almirante Marcos A. Zar,Trelew,Chubut,-43.209957,-65.273405
16,Aeropuerto Internacional de Resistencia,Resistencia,Chaco,-27.444926,-59.048739
17,Aeropuerto Internacional Astor Piazolla,Mar del Plata,Buenos Aires,-37.933205,-57.581518
18,Aeropuerto Internacional Gobernador Horacio Guzman,San Salvador de Jujuy,Jujuy,-24.385987,-65.093755
19,Aeropuerto Internacional Piloto Civil Norberto Fernandez,Rio Gallegos,Santa Cruz,-51.611788,-69.306315
20,Aeropuerto Domingo Faustino Sarmiento,San Juan,San Juan,-31.571814,-68.422568

Your problem is, that the csv-module writerows has its own "newline"-logic. It interferes with the default newline behaviour of open():
Fix like this:
with open('rutasAeropuertos.csv', 'w', newline='' ) as archivo_rutas:
# ^^^^^^^^^^
This is also documented in the example in the documentation: csv.writer(csvfile, dialect='excel', **fmtparams):
If csvfile is a file object, it should be opened with newline='' [1]
with a link to a footnote telling you:
[1] If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
You are using windows which does use \r\n which adds another \r which leads to your "wrong" output.
Full code with some optimizations:
import csv
import random
def dict_ID_aeropuertos():
with open('AeropuertosArg.csv') as archivo_csv:
leer = csv.reader(archivo_csv)
dic_ID = {}
for linea in leer:
dic_ID.setdefault(linea[0],linea[1])
return dic_ID
def ruteoAleatorio():
dic_ID = dict_ID_aeropuertos()
lista_ID = list(dic_ID.keys())
lista_rutas = set() # a set only holds unique values
while (len(lista_rutas) < 50): # simply check the length of the set
r1,r2 = random.sample(lista_ID, k=2) # draw 2 different ones
lista_rutas.add( (r1,r2) ) # you can not add duplicates, no need to check
with open('rutasAeropuertos.csv', 'w', newline='' ) as archivo_rutas:
escribir = csv.writer(archivo_rutas)
escribir.writerows(lista_rutas)
ruteoAleatorio()
Output:
9,3
16,10
15,6
[snipp lots of values]
13,14
13,7
20,4

Python: Error in function [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 9 years ago.
Improve this question
So I have a function called LoopingSpace. It takes no parameter
def loopingSpace():
for i in range (3):
print ""
i +=1
Whenever, it's called. It will print three blank lines.
For example; if I type
def loopingSpace():
for i in range (3):
print ""
i +=1
print"Hi"
loopingSpace()
print"Hi"
It will nicely output
Hi
Hi # As you can see three blanks.
However, when you put this function in a huge syntax like.
from random import randint
#Variabel
#--------------------------------------------------------------------
playername=[]
#Player disini mencatat skor. Jumlah Uang/Nilai Saham/Total harta/Health
playerOne=[0,0,0,0]
playerTwo=[0,0,0,0]
playerThree=[0,0,0,0]
#Stok disini mencatat banyak saham yang dimiliki peserta. A/B/C/D/E/F/G/H
stockOne=[0,0,0,0,0,0,0,0]
stockTwo=[0,0,0,0,0,0,0,0]
stockThree=[0,0,0,0,0,0,0,0]
#Tool disini mencatat apabila peserta memiliki barang. Diamond/Buy/Sell/Diamond/Poison
toolOne=[0,0,0,0,0]
toolTwo=[0,0,0,0,0]
toolThree=[0,0,0,0,0]
#Price disini mencatat harga saham. A/B/C/D/E/F/G/H
price=[0,0,0,0,0,0,0,0]
#clockTracker. Day dan turn counter
clockTracker=[0,0]
hari=["Senin","Selasa","Rabu","Kamis","Jumat","Sabtu","Minggu"]
listing=[playerOne,playerTwo,playerThree]
toolListing=[toolOne,toolTwo,toolThree]
#Nama saham
stockName=["Alama Inc.","Bwah! Bwah! Bwah! LCD","CUIT! CUIT CV","Dong Inc.","Eeeeeeeeeeeeah!","Foo il company.","Gogogogo Ind.","Halllo."
#--------------------------------------------------------------------
# THIS IS THE CODE. THIS IS THE CODE. THIS IS THE CODE
def loopingSpace():
for i in range (3):
print ""
i +=1
#
def startingTheGame():
loopingSpace()
print "Selamat datang di Stock Game."
print "Anda mau [M]ain atau Baca [A]turan?"
answer= raw_input(">")
answerRecognizerOne(answer)
#
def answerRecognizerOne(inbox):
inbox.lower()
if inbox=="a":
ruleExplainer()
elif inbox=="m":
gameStarter()
else:
loopingSpace()
print "Syntax tidak dimengerti. Mohon ulangi."
loopingSpace()
startingTheGame()
#
def answerRecognizerTwo(inbox):
inbox.lower()
if inbox=="y":
print "Kita akan mengambil kartu kesempatan lagi"
cekKartuKesempatan()
elif inbox=="n":
print "Game akan dilanjutkan"
#
def ruleExplainer():
loopingSpace()
print "Aturan:"
print "Dalam awal giliran kamu, kamu akan mengambil kartu kesempatan."
print "Kartu kesempatan kamu akan memberikan kamu hak untuk mengubah harga saham."
print "Lalu kamu bisa jual atau beli saham."
print "Kamu juga dapat bekerja pada Weekend, sehingga kamu dapat uang tambahan."
print "Setelah 33 hari. Peserta dengan uang tertinggi akan menang."
loopingSpace()
startingTheGame()
#
def gameSetUp():
playerOne=[250,0,0]
playerOne[2]=playerOne[0]+playerOne[1]
playerTwo=[250,0,0]
playerTwo[2]=playerTwo[0]+playerTwo[1]
playerThree=[250,0,0]
playerThree[2]=playerThree[0]+playerThree[1]
price=[30,30,30,30,30,30,30,30]
print playerOne
print playerTwo
print playerThree
print price
loopingSpace()
print "Saya akan memberi kamu semua $250 untuk berinvestasi."
print "Kamu juga akan memasuki dunia Wallsheet."
print "Sebuah bursa saham di dunia Kryxban."
print "Semoga beruntung."
loopingSpace()
answer= raw_input ("Tekan enter untuk melanjutkan")
clockTracker=[1,0]
print "Good Luck"
loopingSpace()
kartuKesempatan()
#
def refreshScore():
playerOne[1]=0
for i in range(8):
playerOne[1]+=(stockOne[i]*price[i])
playerTwo [1]=0
for i in range(8):
playerTwo[1]+=(stockTwo[i]*price[i])
playerThree [1]=0
for i in range(8):
playerThree[1]+=(stockThree[i]*price[i])
playerOne[2] = playerOne[0] + playerOne[1]
playerTwo[2] = playerTwo[0] + playerTwo[1]
playerThree[2] = playerThree[0] + playerThree[1]
# Fungsi printScore -> Mengprint skor
def printScore():
print playername[0]+": Uang: $"+str(playerOne[0])+" Saham: $"+str(playerOne[1])+" Total: $"+str(playerOne[2])
print playername[1]+": Uang: $"+str(playerTwo[0])+" Saham: $"+str(playerTwo[1])+" Total: $"+str(playerTwo[2])
print playername[2]+": Uang: $"+str(playerThree[0])+" Saham: $"+str(playerThree[1])+" Total: $"+str(playerThree[2])
# Fungsi kartuKesempatan -> Menjalankan fase Kartu Kesempatan
def kartuKesempatan ():
refreshScore()
print "Sekarang adalah giliran " + playername[clockTracker[1]]
print ""
printScore()
loopingSpace()
print "Kamu mengambil kartu kesempatan"
answer = raw_input ("Kamu siap? Tekan enter jika kamu siap?")
cekKartuKesempatan()
#
def gameStarter():
print loopingSpace()
for i in range(3):
answer= raw_input("Masukan nama pemain ke " + str(i+1)+ ">")
playername.append(answer)
gameSetUp()
#
def cekKartuKesempatan():
foo = randint(1,2)
if foo==1:
print "**KAMU MENDAPAT $25**"
print "Dompet kamu tiba-tiba memberat."
print "Kamu mengecek dompetmu."
print "Ada $25 muncul!"
(listing[clockTracker[1]][0])+=25
updateScore()
checkForToool()
elif foo==2:
print "**SAHAM NAIK 10%**"
woo=0
for i in range(8):
woo= floor.(price[i]/10)
if woo != 0:
price[i] += woo
print "Saham "+stockName[i]+" naik $"+woo
checkForTool()
def checkForTool()
if (toolListing[clockTracker[1]][0])!=0:
print "Anda mempunyai 'Kesempatan Extra'. Mau dipakai? ( Anda punya "+str.(toolListing[clockTracker[1]][0])+" ) [y]es /[n]o"
answer= raw_input(">")
answerRecognizerTwo(answer)
It threw this error.
File "main.py", line 31
def loopingSpace():
^
SyntaxError: invalid syntax
Note: I use online executor
Note: The online executor respectively use 2.7.4 and 2.7.5
Where do I do wrong?
NOTE: When I delete the loopingSpace(): function from the code completely, it will then complain about the next function in line which in this case startingTheGame(), and so on
PS: This pyt is formed as a game. I will call one function (startingTheGame()) and this function will call other function and so on. I pretty sure you get the point.
Sincerely,
DEO

The syntax error is actually on the line before it:
stockName=["Alama Inc.","Bwah! Bwah! Bwah! LCD","CUIT! CUIT CV",
"Dong Inc.","Eeeeeeeeeeeeah!","Foo il company.",
"Gogogogo Ind.","Halllo."
You have to add a closing bracket ]:
stockName=["Alama Inc.","Bwah! Bwah! Bwah! LCD","CUIT! CUIT CV",
"Dong Inc.","Eeeeeeeeeeeeah!","Foo il company.",
"Gogogogo Ind.","Halllo."]

Currently, your list is defined as:
stockName=["Alama Inc.","Bwah! Bwah! Bwah! LCD","CUIT! CUIT CV","Dong Inc.","Eeeeeeeeeeeeah!","Foo il company.","Gogogogo Ind.","Halllo."
Notice how there is no ] at the end of the list. Thus, you have technically not defined the list yet, and so python tries to continue making the list. It then reads the next line with text, which is def loopingSpace():. As the syntax [..., "Halllo." def loopingSpace() isn't proper syntax, a SyntaxError is raised at the line of definition.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.