Normalization words for sentiment analysis - python

I'm currently doing sentiment analysis and having a problem.
I have a big normalization for word and I want to normalization text before tokenize like this example:
data
normal
kamu knp sayang
kamu kenapa sayang
drpd sedih mending belajar
dari pada sedih mending belajar
dmna sekarang
di mana sekarang
knp: kenapa
drpd: dari pada
dmna: di mana
This is my code:
import pandas as pd
slang = pd.DataFrame({'before': ['knp', 'dmna', 'drpd'], 'after': ['kenapa', 'di mana', 'dari pada']})
df = pd.DataFrame({'data': ['kamu knp sayang', 'drpd sedih mending bermain']})
normalisasi = {}
for index, row in slang.iterrows():
if row[0] not in normalisasi:
normalisasi[row[0]] = row[1]
def normalized_term(document):
return [normalisasi[term] if term in normalisasi else term for term in document]
df['normal'] = df['data'].apply(normalized_term)
df
But, the result like this:
result
I want the result like the example table.

There is a utility named str.replace in pandas that allows us to replace a substring with another or even find/replace patterns. You can find full documentation here. Your desired output would have appeared like this:
UPDATE
There were two things wrong with the answer:
You must only replace in whole word mode, not subword
After each entry in the slang file you must keep the changes not discard them
So it would be like this:
import pandas as pd
df = pd.read_excel('data bersih.xlsx')
slang = pd.read_excel('slang.xlsx')
df['normal'] = df.text
for idx, row in slang.iterrows():
df['normal'] = df.normal.str.replace(r"\b"+row['before']+r"\b", row['after'], regex=True)
output:
text \
0 hari ini udh mulai ppkm yaa
1 mohon info apakah pgs pasar turi selama ppkm b...
2 di rumah aja soalnya lagi ppkm entah bakal nga...
3 pangkal penanganan pandemi di indonesia yang t...
4 ppkm mikro anjingggggggg
... ...
9808 drpd nonton sinetron mending bagi duit kayak g...
9809 ppkm pelan pelan kalau masukin
9810 masih ada kepala desa camat bahkan kepala daer...
9811 aku suka ppkm tapi tanpa pp di depannya
9812 menteri ini perlu tidak dibayarkan gajinya set...
normal
0 hari ini sudah mulai ppkm yaa
1 mohon informasi apakah pgs pasar turi selama p...
2 di rumah saja soalnya lagi ppkm entah bakal se...
3 pangkal penanganan pandemi di indonesia yang t...
4 ppkm mikro anjingggggggg
... ...
9808 dari pada nonton sinema elektronik lebih baik ...
9809 ppkm pelan pelan kalau masukkan
9810 masih ada kepala desa camat bahkan kepala daer...
9811 aku suka ppkm tapi tanpa pulang pergi di depannya
9812 menteri ini perlu tidak dibayarkan gajinya set...
[9813 rows x 2 columns]

Related

how to extract specific content from dataframe based on condition python

Consider the following pandas dataframe:
this is an example of ingredients_text :
farine de blé 34% (france), pépites de chocolat 20g (ue) (sucre, pâte de cacao, beurre de cacao, émulsifiant lécithines (tournesol), arôme) (cacao : 44% minimum), matière grasse végétale (palme), sucre, 8,5% chocolat(sucre, pâte de cacao, cacao et cacao maigre en poudre) (cacao: 38% minimum), 5,5% éclats de noix de pécan (non ue), poudres à lever : diphosphates carbonates de sodium, blancs d’œufs, fibres d'acacia, lactose et protéines de lait, sel. dont lait.
oignon 18g oil hell: kartoffelstirke, milchzucker, maltodextrin, reismehl. 100g produkt enthalten: 1559KJ ,energie 369 kcal lt;0.5g lt;0.1g 909 fett davon gesättigte fettsāuren kohlenhydrate davon ,zucker 26g
I separated the ingredients of each line into words with the folowing code :
for i in df['ingredients_text'][:].index:
words = df["ingredients_text"][i].split(',')
df["ingredients_text"][i]=words
Any idea of how to extract the ingredients with % and g from the text in onether column called 'ingredient' ?
For instance, the desired output should be:
['farine de blé 34%', 'pépites de chocolat 20g','cacao : 44%' ,'8,5% chocolat' ,'cacao: 38%', '5,5% éclats de noix de pécan']
['oignon 18g oil hell', '100g produkt enthalten', 'lt;0.5g', 'lt;0.1g' , '26g zucker']
df = pd.DataFrame({'ingredient_text': ['a%bgC, abc, a%, cg', 'xyx']})
ingredient_text
0 a%bgC, abc, a%, cg
1 xyx
Split the ingredients into a list
df['ingredient_text'] = df['ingredient_text'].str.split(',')
ingredient_text
0 [a%bgC, abc, a%, cg]
1 [xyx]
Search for your strings in the list
df['ingredient'] = df['ingredient_text'].apply(lambda x: [s for s in x if ('%' in s) or ('g' in s)])
ingredient_text ingredient
0 [a%bgC, abc, a%, cg] [a%bgC, a%, cg]
1 [xyx] []

how to create groups from string? [duplicate]

This question already has answers here:
How do I split a list into equally-sized chunks?
(66 answers)
Closed 2 years ago.
I have string and I will do split for this string and then I will get 320 elements. I need to create 4 groups. Every group will be with 100 elements and the last group must be with 20 last elements. And the last step is that all groups must be string and not list. how can I do that?
I can do that if I know how many elements I have:
s_l = 'AA AAL AAOI ACLS ADT ADTX ADVM AEL AIG ALEC ALLY ALT AMCX ANGI APA APDN APLS APPS APRN AQUA ARMK ARNC ARVN ATNM ATOM ATRA ATSG AVCT AVT AX AXDX BCLI BE BEAM BJRI BKE BKU BLDR BLNK BMRA BOOT BXS BYD CAKE CALX CAPR CARG CARR CARV CATM CC CCL CELH CEQP CFG CHEF CHRS CIT CLDX CLR CLSK CNK CNST CODX COLB COOP CPE CRS CTVA CUK CVET CVI CVM CYTK DAL DBX DCP DDS DEI DISCA DISCK DK DNB DRNA DVAX DXC ECOM EIGR ELAN ELF ELY ENVA EQ EQT EXEL FE FHI FIXX FL FLWS FMCI FORM FOX FOXA FRTA FUN GBX GIII GM GNMK GOCO GPRE GRAF GRPN GRWG GTHX GWB HALO HCC HCSG HEAR HFC HGV HIBB HMSY HOG HOME HP HSC HTH HWC IMUX IMVT INO INOV INSG INSM INT IOVA IRDM ITCI JELD JWN KMT KODK KPTI KSS KTB KTOS KURA LAKE LB LCA LL LPI LPRO LSCC LYFT MAXR MBOT MCRB MCS MD MDP MGM MGNX MIC MLHR MOS MRSN MTOR MXL MYGN NCLH NCR NK NKTR NLS NMIH NOVA NTLA NTNX NUAN NVST NXTC ODP OFC OKE OMER OMF OMI ONEM OSPN OSUR OXY OZK PACW PD PDCE PDCO PEAK PGNY PLAY PLCE PLT PLUG PPBI PRPL PRTS PRVB PS PSNL PSTX PSXP PTGX PVAC RCUS REAL REZI RKT RMBL RPAY RRGB RRR RVLV RVP RXN SANM SAVE SBGI SC SCPL SEAS SEM SFIX SFM SGMS SGRY SHLL SHOO SHYF SIX SKX SLQT SMCI SNAP SNDX SNV SONO SPAQ SPCE SPR SPWH SPWR SRG SRNE SSNT SSSS STOR SUM SUN SUPN SVMK SWBI SYF SYRS TBIO TCDA TCF TCRR TDC TEX TFFP TGTX THC TMHC TRGP TRIP TSE TUP TVTY UBX UCBI UCTT UFS UNFI UONE UPWK URBN USFD VCRA VERI VIAC VIRT VIVO VREX VSLR VSTO VXRT WAFD WBS WFC WHD WIFI WKHS WORK WORX WRK WRTC WW WWW WYND XEC XENT XPER XRX YELP ZGNX ZUMZ ZYXI'
split_s_l = s_l.split(" ")
part_1 = ' '.join(split_s_l[:100])
part_2 = ' '.join(split_s_l[100:200])
part_3 = ' '.join(split_s_l[200:300])
part_4 = ' '.join(split_s_l[300:])
for part in (part_1, part_2, part_3, part_4):
print(part)
but I don't know how to do that If I have many elements in list.
For a variable number of items, you can loop using:
sep = ' '
num = 100
split_s_l = s_l.split(sep)
for i in range(0, len(split_s_l), num):
part = sep.join(split_s_l[i : i+num])
print(part)
Bear in mind that for the last slice in the example case ([300:400]) it does not matter that there are only 320 elements -- just the last 20 items will be included (no error).
Something like this?
def break_up(s, nwords, separator):
words = s.split()
return [separator.join(words[n:n+nwords]) for n in range(0, len(words), nwords)]
print(break_up('a b c d e f g h', 3, ' '))
Result:
['a b c', 'd e f', 'g h']
Of course, you might call as print(break_up(s_l, 100, ' '))
Below
s_l = 'AA AAL AAOI ACLS ADT ADTX ADVM AEL AIG ALEC ALLY ALT AMCX ANGI APA APDN APLS APPS APRN AQUA ARMK ARNC ARVN ATNM ATOM ATRA ATSG AVCT AVT AX AXDX BCLI BE BEAM BJRI BKE BKU BLDR BLNK BMRA BOOT BXS BYD CAKE CALX CAPR CARG CARR CARV CATM CC CCL CELH CEQP CFG CHEF CHRS CIT CLDX CLR CLSK CNK CNST CODX COLB COOP CPE CRS CTVA CUK CVET CVI CVM CYTK DAL DBX DCP DDS DEI DISCA DISCK DK DNB DRNA DVAX DXC ECOM EIGR ELAN ELF ELY ENVA EQ EQT EXEL FE FHI FIXX FL FLWS FMCI FORM FOX FOXA FRTA FUN GBX GIII GM GNMK GOCO GPRE GRAF GRPN GRWG GTHX GWB HALO HCC HCSG HEAR HFC HGV HIBB HMSY HOG HOME HP HSC HTH HWC IMUX IMVT INO INOV INSG INSM INT IOVA IRDM ITCI JELD JWN KMT KODK KPTI KSS KTB KTOS KURA LAKE LB LCA LL LPI LPRO LSCC LYFT MAXR MBOT MCRB MCS MD MDP MGM MGNX MIC MLHR MOS MRSN MTOR MXL MYGN NCLH NCR NK NKTR NLS NMIH NOVA NTLA NTNX NUAN NVST NXTC ODP OFC OKE OMER OMF OMI ONEM OSPN OSUR OXY OZK PACW PD PDCE PDCO PEAK PGNY PLAY PLCE PLT PLUG PPBI PRPL PRTS PRVB PS PSNL PSTX PSXP PTGX PVAC RCUS REAL REZI RKT RMBL RPAY RRGB RRR RVLV RVP RXN SANM SAVE SBGI SC SCPL SEAS SEM SFIX SFM SGMS SGRY SHLL SHOO SHYF SIX SKX SLQT SMCI SNAP SNDX SNV SONO SPAQ SPCE SPR SPWH SPWR SRG SRNE SSNT SSSS STOR SUM SUN SUPN SVMK SWBI SYF SYRS TBIO TCDA TCF TCRR TDC TEX TFFP TGTX THC TMHC TRGP TRIP TSE TUP TVTY UBX UCBI UCTT UFS UNFI UONE UPWK URBN USFD VCRA VERI VIAC VIRT VIVO VREX VSLR VSTO VXRT WAFD WBS WFC WHD WIFI WKHS WORK WORX WRK WRTC WW WWW WYND XEC XENT XPER XRX YELP ZGNX ZUMZ ZYXI'
words= s_l.split(" ")
num_of_100_groups = int(len(words) / 100)
groups = []
for i in range(0,num_of_100_groups):
groups.append(words[i * 100 : (i+1) * 100])
groups.append(words[num_of_100_groups * 100:])
for sub_group in groups:
print(' '.join(sub_group))

Append list to existing csv according the header

I Have CSV with header
KlM1,KLM2,KLM3
and i have List
['1', 'Tafsir An-Nas (114) Aya 6', '4-6. Aku berlindung kepada-Nya dari kejahatan bisikan setan yang bersembunyi pada diri manusia dan selalu bersamanya layaknya darah yang mengalir di dalam tubuhnya, yang membisikkan kejahatan dan kesesatan ke dalam dada manusia dengan cara yang halus, lihai, licik, dan menjanjikan secara terus-menerus. Aku berlindung kepada-Nya dari setan pembisik kejahatan dan kesesatan yang berasal dari golongan jin, yakni makhluk halus yang tercipta dari api, dan juga dari golongan manusia yang telah menjadi budak setan.']
but when I save with this code:
list1 = ['1',''+str(test)+'',''+arti2+'']
with open('Alquran.csv', 'a') as file:
coba=csv.writer(file)
coba.writerows(list1)
they just fill in KLM1
how can i save the list like as a header?
You need to provide all rows to writerows(...) - it takes a list of rows and each row is a list of columns::
import csv
header = ["h1","h2","h3"]
row1 = [11,12,13]
row2 = [21,22,23]
# always use newline="" for csv-module opened file
with open('Alquran.csv', 'a', newline="") as file:
coba=csv.writer(file)
# list of rows, each row is a list of columns
coba.writerows( [header, row1, row2] )
with open('Alquran.csv',"r") as r:
print(r.read())
Output:
h1,h2,h3
11,12,13
21,22,23
To later add more information:
another_row = [31,32,33]
with open('Alquran.csv', 'a', newline="") as file:
coba=csv.writer(file)
# list of 1 row, containing 3 columns
coba.writerows( [another_row] )
with open('Alquran.csv',"r") as r:
print(r.read())
Output after adding:
h1,h2,h3
11,12,13
21,22,23
31,32,33
Doku:
csv.writerows(..)
for seperate row-writing, read Difference between writerow() and writerows() methods of Python csv module

How to retrieve the original string from UTF-8 file?

I am doing some web scraping with python and BeautifulSoup.
body = soup.find("article")
tempvar = body.find()
fuu = open('tempfile', 'w')
tempvar = tempvar.encode('utf-8')
fuu.write(str(tempvar))
fuu.close()
fupa = open('tempfile')
joji = BeautifulSoup(fupa,'html.parser')
fupa.close()
print(joji)
tempvar would would contain html stuff , sometimes with emojis.
I want to use the contents of the file tempfile later in a real html file.
The print(joji) produces something like this:
<b>mencapai\xc2\xa0batas aksara 140</b>, tapi sudah tentu itu tidak termasuk semua <i>tweet </i>yang tak pernah dihantar kerana pengguna tidak boleh nak luahkan apa yang mereka mahukan. Selepas <b>mengaktifkan aksara 280</b> pada <b>sejumlah kecil akaun </b>yang bertuah, <b>Twitter </b>mengatakan <b>hanya 1%</b> sahaja <b>pengguna yang capai had aksara 280</b>. Tulis panjang\xc2\xb2 nak buat karangan ka. \xf0\x9f\x98\x9c<br/>\n<br/>\nIa juga jarang berlaku bagi pengguna untuk mencapai aksara 280, hanya <b>2%</b> dari <i>tweet </i><b>melebihi aksara 190</b>. <b>Had aksara tweet sebanyak 280 </b>juga <b>mendapat lebih <i>likes </i>dan <i>retweets </i></b>daripada had aksara <i>tweet </i>sebanyak 140. \xf0\x9f\x98\x8a<br/>\n<br/>
tempvar is a Unicode string. To write it correctly to a file:
with open('tempfile', 'w', encoding='utf8') as fuu:
fuu.write(tempvar)
Read it back in with:
with open('tempfile', encoding='utf8') as fupa:
...

Parsing XML file with many children and grandchildren

I am very new to python. I have this very large xml file and I want to extract some data from it. Here is an excerpt:
<program>
<id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id>
<programID>3853</programID>
<orchestra>New York Philharmonic</orchestra>
<season>1842-43</season>
<concertInfo>
<eventType>Subscription Season</eventType>
<Location>Manhattan, NY</Location>
<Venue>Apollo Rooms</Venue>
<Date>1842-12-07T05:00:00Z</Date>
<Time>8:00PM</Time>
</concertInfo>
<worksInfo>
<work ID="52446*">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
<conductorName>Hill, Ureli Corelli</conductorName>
</work>
<work ID="8834*4">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="3642*">
<composerName>Hummel, Johann</composerName>
<workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle>
<soloists>
<soloist>
<soloistName>Scharfenberg, William</soloistName>
<soloistInstrument>Piano</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Hill, Ureli Corelli</soloistName>
<soloistInstrument>Violin</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Derwort, G. H.</soloistName>
<soloistInstrument>Viola</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Boucher, Alfred</soloistName>
<soloistInstrument>Cello</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Rosier, F. W.</soloistName>
<soloistInstrument>Contrabass</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="0*">
<interval>Intermission</interval>
</work>
<work ID="8834*3">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>Overture</movement>
<conductorName>Etienne, Denis G.</conductorName>
</work>
<work ID="8835*1">
<composerName>Rossini, Gioachino</composerName>
<workTitle>ARMIDA</workTitle>
<movement>Duet</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8837*6">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>FIDELIO, OP. 72</workTitle>
<movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8336*4">
<composerName>Mozart, Wolfgang Amadeus</composerName>
<workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle>
<movement>"Ach Ich liebte," Konstanze (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="5543*">
<composerName>Kalliwoda, Johann W.</composerName>
<workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle>
<conductorName>Timm, Henry C.</conductorName>
</work>
</worksInfo>
</program>
<program>
What I would like to do is extract the following pieces of information: programID, orchestra, season, eventType, work ID, soloistName, solositInstrument, soloistRole
Here is the code I am using:
import csv
import xml.etree.cElementTree as ET
tree = ET.iterparse('complete.xml.txt')
#root = tree.getroot()
for program in root.iter('program'):
ID = program.findtext('id')
programID = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for worksInfo in program.findall('worksInfo'):
for work in worksInfo.iter('work'):
workid = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
soloname = soloist.findtext('soloistName')
soloinstrument = `soloist.findtext('soloistInstrument')`
solorole = soloist.findtext('soloistRoles')
#print(soloname, soloinstrument, solorole)
#print(workid)
#print(event)
#print(programID , " , " , orchestra , " , " , season)
with open("nyphil.txt","a") as nyphil:
nyphilwriter = csv.writer(nyphil)
nyphilwriter.writerow([programID, orchestra, season, event, workid, `soloname.encode('utf-8'), soloinstrument, solorole])
nyphil.close()
When I run this code I only get the last soloistName and soloistInstrumet. The outcome that I have in mind is sort of like a repeated observations for each program. So I'd have something like:
13918, New York Philharmonic, 1842-43, Subscription Season, 52446*, Otto, Antoinette, Soprano, S
13918,...., 3642*, Scharfenberg, William , Piano, A
13918,...., 3642*, Hill, Ureli Corelli , Violin, A
and so on until the last work ID:
13918,...., 8336*4 , Otto, Antoinette, Soprano, S
What I am getting is only the last work:
13918, New York Philharmonic, 1842-43, Subscription Season, 8336*, Otto, Antoinette, Soprano, S
In the file there are over 15,000 programs like the example I posted. I want to parse all of them and extract the information I mentioned above. I am not entirely sure how to go about doing this, I've scoured the internet for a way to do this, but everything I tried just doesn't work!!
Your problem here is that you misunderstand the way loops work. Specifically, the values only change while you're in the loop:
for x in range(10):
pass
print(x) # prints 9
vs
for x in range(10):
print(x)
Those are two different things. You're doing the former. What you need to do is something like this:
with open('nyphil.txt', 'w') as f:
nyphilwriter = csv.writer(f)
for program in root.iter('program'):
id_ = program.findtext('id')
program_id = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for info in program.findall('worksInfo'):
for work in info.iter('work'):
work_id = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
# Change this line to whatever you want to write out
nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')])
The 13918 does not appear in your data. Leaving that aside, here's what I wrote, which appears to process your data successfully.
from lxml import etree
tree = etree.parse('test.xml')
programs = tree.xpath('.//program')
for program in programs:
programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']]
print (programID, orchestra, season)
works = program.xpath('worksInfo/work')
for work in works:
workID = work.attrib['ID']
soloistItems = work.xpath('soloists/soloist')
for soloistItem in soloistItems:
print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text)
The script produces the following output.
3853 New York Philharmonic 1842-43
8834*4 Otto, Antoinette Soprano S
3642* Scharfenberg, William Piano A
3642* Hill, Ureli Corelli Violin A
3642* Derwort, G. H. Viola A
3642* Boucher, Alfred Cello A
3642* Rosier, F. W. Contrabass A
8835*1 Otto, Antoinette Soprano S
8835*1 Horn, Charles Edward Tenor S
8837*6 Horn, Charles Edward Tenor S
8336*4 Otto, Antoinette Soprano S
One other thing to note: I put a tag at the beginning of your XML and a at the end since the real data would contain multiple elements.

Categories