Many emoji characters are not read by python file read - python

I have a list of 1500 emoji character dictionary in a json file, and I wanted to import those to my python code, I did a file read and convert it to a python dictionary but now I have only 143 records. How can I import all the emoji to my code, this is my code.
import sys
import ast
file = open('emojidescription.json','r').read()
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
emoji_dictionary = ast.literal_eval(file.translate(non_bmp_map))
#word = word.replaceAll(",", " ");
keys = list(emoji_dictionary["emojis"][0].keys())
values = list(emoji_dictionary["emojis"][0].values())
file_write = open('output.txt','a')
print(len(keys))
for i in range(len(keys)):
try:
content = 'word = word.replace("{0}", "{1}")'.format(keys[i],values[i][0])
except Exception as e:
content = 'word = word.replace("{0}", "{1}")'.format(keys[i],'')
#file.write()
#print(keys[i],values[i])
print(content)
file_write.close()
This is my input sample
{
"emojis": [
{
"๐Ÿ‘จโ€๐ŸŽ“": ["Graduate"],
"ยฉ": ["Copy right"],
"ยฎ": ["Registered"],
"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง": ["family"],
"๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ": ["love"],
"โ„ข": ["trademark"],
"๐Ÿ‘จโ€โคโ€๐Ÿ‘จ": ["love"],
"โŒš": ["time"],
"โŒ›": ["wait"],
"โญ": ["star"],
"๐Ÿ˜": ["Elephant"],
"๐Ÿ•": ["Cat"],
"๐Ÿœ": ["ant"],
"๐Ÿ”": ["cock"],
"๐Ÿ“": ["cock"],
This is my result, and the 143 denotes number of emoji.
143
word = word.replace("๏ฟฝโ€๏ฟฝโ€๏ฟฝโ€๏ฟฝ", "family")
word = word.replace("โ“‚", "")
word = word.replace("โ™ฅ", "")
word = word.replace("โ™ ", "")
word = word.replace("โŒ›", "wait")

I'm not sure why you're seeing only 143 records from an input of 1500 (your sample doesn't seem to display this behavior).
The setup doesn't seem to do anything useful, but what you're doing boils down to (simplified and skipping lots of details):
d = ..read json as python dict.
keys = d.keys()
values = d.values()
for i in range(len(keys)):
key = keys[i]
value = values[i]
and that should be completely correct. There are better ways to do this in Python, however, like using the zip function:
d = ..read json as python dict.
keys = d.keys()
values = d.values()
for key, value in zip(keys, values): # zip picks pair-wise elements
...
or simply asking the dict for its items:
for key, value in d.items():
...
The json module makes reading and writing json much simpler (and safer), and using the idiom from above the problem reduces to this:
import json
emojis = json.load(open('emoji.json', 'rb'))
with open('output.py', 'wb') as fp:
for k,v in emojis['emojis'][0].items():
val = u'word = word.replace("{0}", "{1}")\n'.format(k, v[0] if v else "")
fp.write(val.encode('u8'))

Why do you replace all emojis with 0xfffd in the lines:
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
emoji_dictionary = ast.literal_eval(file.translate(non_bmp_map))
Just don't to this!
Using json:
import json
with open('emojidescription.json', encoding="utf8") as emojis:
emojis = json.load(emojis)
with open('output.txt','a', encoding="utf8") as output:
for emoji, text in emojis["emojis"][0].items():
text = "" if not text else text[0]
output.write('word = word.replace("{0}", "{1}")\n'.format(emoji, text))

Related

Python formatting data to csv file

I'll try to look for help once more, so my base code is ready, in the very beginning, it converts all the negative values to 0, and after that, it does calculate the sum and cumulative values of the csv data:
import csv
from collections import defaultdict, OrderedDict
def convert(data):
try:
return int(data)
except ValueError:
return 0
with open('MonthData1.csv', 'r') as file1:
read_file = csv.reader(file1, delimiter=';')
delheader = next(read_file)
data = defaultdict(int)
for line in read_file:
valuedata = max(0, sum([convert(i) for i in line[1:5]]))
data[line[0].split()[0]] += valuedata
for key in OrderedDict(sorted(data.items())):
print('{};{}'.format(key, data[key]))
print("")
previous_values = []
for key, value in OrderedDict(sorted(data.items())).items():
print('{};{}'.format(key, value + sum(previous_values)))
previous_values.append(value)
This code prints:
1.5.2018 245
2.5.2018 105
4.5.2018 87
1.5.2018 245
2.5.2018 350
4.5.2018 437
That's how I want it to print the data. First the sum of each day, and then the cumulative value. My question is, how can I format this data so it can be written to a new csv file with the same format as it prints it? So the new csv file should look like this:
I have tried to do it myself (with dateime), and searched for answers but I just can't find a way. I hope to get a solution this time, I'd appreciate it massively.
The data file as csv: https://files.fm/u/2vjppmgv
Data file in pastebin https://pastebin.com/Tw4aYdPc
Hope this can be done with default libraries
Writing a CSV is simply a matter of writing values separated by commas (or semi-colons in this case. A CSV is a plain text file (a .txt if you will). You can read it and write using python's open() function if you'd like to.
You could actually get rid of the CSV module if you wish. I included an example of this in the end.
This version uses only the libraries that were available in your original code.
import csv
from collections import defaultdict, OrderedDict
def convert(data):
try:
return int(data)
except ValueError:
return 0
file1 = open('Monthdata1.csv', 'r')
file2 = open('result.csv', 'w')
read_file = csv.reader(file1, delimiter=';')
delheader = next(read_file)
data = defaultdict(int)
for line in read_file:
valuedata = max(0, sum([convert(i) for i in line[1:5]]))
data[line[0].split()[0]] += valuedata
for key in OrderedDict(sorted(data.items())):
file2.write('{};{}\n'.format(key, data[key]))
file2.write('\n')
previous_values = []
for key, value in OrderedDict(sorted(data.items())).items():
file2.write('{};{}\n'.format(key, value + sum(previous_values)))
previous_values.append(value)
file1.close()
file2.close()
There is a gotcha here, though. As I didn't import the os module (that is a default library) I used the characters \n to end the line. This will work fine under Linux and Mac, but under windows you should use \r\n. To avoid this issue you should import the os module and use os.linesep instead of \n.
import os
(...)
file2.write('{};{}{}'.format(key, data[key], os.linesep))
(...)
file2.write('{};{}{}'.format(key, value + sum(previous_values), os.linesep))
As a sidenote this is an example of how you could read your CSV without the need for the CSV module:
data = [i.split(";") for i in open('MonthData1.csv').read().split('\n')]
If you had a more complex CSV file, especially if it had strings that could have semi-colons within, you'd better go for the CSV module.
The pandas library, mentioned in other answers is a great tool. It will most certainly be able to handle any need you might have to deal with CSV data.
This code creates a new csv file with the same format as what's printed.
import pandas as pd #added
import csv
from collections import defaultdict, OrderedDict
def convert(data):
try:
return int(data)
except ValueError:
return 0
keys = [] #added
data_keys = [] #added
with open('MonthData1.csv', 'r') as file1:
read_file = csv.reader(file1, delimiter=';')
delheader = next(read_file)
data = defaultdict(int)
for line in read_file:
valuedata = max(0, sum([convert(i) for i in line[1:5]]))
data[line[0].split()[0]] += valuedata
for key in OrderedDict(sorted(data.items())):
print('{} {}'.format(key, data[key]))
keys.append(key) #added
data_keys.append(data[key]) #added
print("")
keys.append("") #added
data_keys.append("") #added
previous_values = []
for key, value in OrderedDict(sorted(data.items())).items():
print('{} {}'.format(key, value + sum(previous_values)))
keys.append(key) #added
data_keys.append(value + sum(previous_values)) #added
previous_values.append(value)
df = pd.DataFrame(data_keys,keys) #added
df.to_csv('new_csv_file.csv', header=False) #added
This is the version that does not use any imports at all
def convert(data):
try:
out = int(data)
except ValueError:
out = 0
return out ### try to avoid multiple return statements
with open('Monthdata1.csv', 'rb') as file1:
lines = file1.readlines()
data = [ [ d.strip() for d in l.split(';')] for l in lines[ 1 : : ] ]
myDict = dict()
for d in data:
key = d[0].split()[0]
value = max(0, sum([convert(i) for i in d[1:5]]))
try:
myDict[key] += value
except KeyError:
myDict[key] = value
s1=""
s2=""
accu = 0
for key in sorted( myDict.keys() ):
accu += myDict[key]
s1 += '{} {}\n'.format( key, myDict[key] )
s2 += '{} {}\n'.format( key, accu )
with open( 'out.txt', 'wb') as fPntr:
fPntr.write( s1 + "\n" + s2 )
This uses non-ordered dictionaries, though, such that sorted() may result in problems. So you actually might want to use datetime giving, e.g.:
import datetime
with open('Monthdata1.csv', 'rb') as file1:
lines = file1.readlines()
data = [ [ d.strip() for d in l.split(';')] for l in lines[ 1 : : ] ]
myDict = dict()
for d in data:
key = datetime.datetime.strptime( d[0].split()[0], '%d.%m.%Y' )
value = max(0, sum([convert(i) for i in d[1:5]]))
try:
myDict[key] += value
except KeyError:
myDict[key] = value
s1=""
s2=""
accu = 0
for key in sorted( myDict.keys() ):
accu += myDict[key]
s1 += '{} {}\n'.format( key.strftime('%d.%m.%y'), myDict[key] )
s2 += '{} {}\n'.format( key.strftime('%d.%m.%y'), accu )
with open( 'out.txt', 'wb') as fPntr:
fPntr.write( s1 + "\n" + s2 )
Note that I changed to the 2 digit year by using %y instead of %Y in the output. This formatting also adds a 0 to day and month.

split() issues with pdf extractText()

I'm working on a minor content analysis program that I was hoping that I could have running through several pdf-files and return the sum of frequencies that some specific words are mentioned in the text. The words that are searched for are specified in a separate text file (list.txt) and can be altered. The program runs just fine through files with .txt format, but the result is completely different when running the program on a .pdf file. To illustrate, the test text that I have the program running trhough is the following:
"Hello
This is a product development notice
Weโ€™re working with innovative measures
A nice Innovation
The world that we live in is innovative
We are currently working on a new process
And in the fall, you will experience our new product development introduction"
The list of words grouped in categories are the following (marked in .txt file with ">>"):
innovation: innovat
product: Product, development, introduction
organization: Process
The output from running the code with a .txt file is the following:
Whereas the ouput from running it with a .pdf is the following:
As you can see, my issue is pertaining to the splitting of the words, where in the .pdf output i can have a string like "world" be split into 'w','o','rld'. I have tried to search for why this happens tirelessly, without success. As I am rather new to Python programming, I would appreciate any answe or direction to where I can fin and answer to why this happens, should you know any source.
Thanks
The code for the .txt is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.txt'
textfile = open(f)
text = textfile.read().split() # lowercase the text
print (text)
textfile.close()
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
While the code for the .pdf is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.pdf'
textfile = open(f, 'rb')
text = PyPDF2.PdfFileReader(textfile)# lowercase the text
for pageNum in range(0, text.numPages):
texts = text.getPage(pageNum)
textfile = texts.extractText().split()
print (textfile)
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in textfile:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])

Python extract values from text using keys

I have a text file in the following format of Key Value
--START--
FirstName Kitty
LastName McCat
Color Red
random_data
Meow Meow
--END--
I'm wanting to extract specific values from the text into a variable or a dict. For example if I want to extract the values of LastName and Color what would be the best way to do this?
The random_data may be anywhere in the file and span multiple lines.
I've considered using regex but am concerned with performance and readability as in the real code I have many different keys to extract.
I could also loop over each line and check for each key but it's quite messy when having 10+ keys. For example:
if line.startswith("LastName"):
#split line at space and handle
if line.startswith("Color"):
#split line at space and handle
Hoping for something a little cleaner
tokens = ['LastName', 'Color']
dictResult = {}
with open(fileName,'r') as fileHandle:
for line in fileHandle:
lineParts = line.split(" ")
if len(lineParts) == 2 and lineParts[0] in tokens:
dictResult[lineParts[0]] = lineParts[1]
Assuming your file is in something called sampletxt.txt, this would work. It creates a dictionary mapping from key -> list of values.
import re
with open('sampletxt.txt', 'r') as f:
txt = f.read()
keys = ['FirstName', 'LastName', 'Color']
d = {}
for key in keys:
d[key] = re.findall(key+r'\s(.*)\s*\n*', txt)
This version allows you to optionally specify the tokens
import re
โ€‹
s = """--START--
FirstName Kitty
LastName McCat
Color Red
random_data
Meow Meow
--END--"""
tokens = ["LastName", "Color"]
if len(tokens) == 0:
print(re.findall("({0}) ({0})".format("\w+"), s))
else:
print( list((t, re.findall("{} (\w+)".format(t), s)[0]) for t in tokens))
Output
[('LastName', 'McCat'), ('Color', 'Red')]
Building off the other answers, this function would use regular expressions to take any text key and return the value if found:
import re
file_name = 'test.txt'
def get_text_value(text_key, file_name):
match_str = text_key + "\s(\w+)\n"
with open(file_name, "r") as f:
text_to_check = f.readlines()
text_value = None
for line in text_to_check:
matched = re.match(match_str, line)
if matched:
text_value = matched.group(1)
return text_value
if __name__ == "__main__":
first_key = "FirstName"
first_value = get_text_value(first_key, file_name)
print('Check for first key "{}" and value "{}"'.format(first_key,
first_value))
second_key = "Color"
second_value = get_text_value(second_key, file_name)
print('Check for first key "{}" and value "{}"'.format(second_key,
second_value))

how to read a specific line which starts in "#" from file in python

how can i read a specific line which starts in "#" from file in python and
set that line as a key in a dictionary (without the "#") and set all the lines after that line until the next "#" as a value is the dictionary
please help me
here is the file :
from collections import defaultdict
key = 'NOKEY'
d = defaultdict(list)
with open('thefile.txt', 'r') as f:
for line in f:
if line.startswith('#'):
key = line.replace('#', '')
continue
d[key].append(line)
Your dictionary will have a list of lines under each key. All lines that come before the first line starting with '#' would be stored under the key 'NOKEY'.
You could make use of Python's groupby function as follows:
from itertools import groupby
d = {}
key = ''
with open('input.txt', 'r') as f_input:
for k, g in groupby(f_input, key=lambda x: x[0] == '#'):
if k:
key = next(g).strip(' #\n')
else:
d[key] = ''.join(g)
print d
This would give you the following kind of output:
{'The Piper at the gates of dawn': '*Lucifer sam....\nsksdlkdfslkj\ndkdkfjoiupoeri\nlkdsjforinewonre\n', 'A Saucerful of Secrets': '*Let there be\nPeople heard him say'}
Tested using Python 2.7.9
A pretty simple version
filename = 'test'
results = {}
with open(filename, 'r') as f:
while (1):
text = f.readline()
if (text == ''):
break
elif (text[0] == "#"):
key = text
results[key] = ''
else:
results[key] += text
From (ignoring additional blank lines, a bi-product of the Answer formatting):
#The Piper at the gates of dawn
*Lucifer sam....
sksdlkdfslkj
dkdkfjoiupoeri
lkdsjforinewonre
# A Saucerful of Secrets
*Let there be
People heard him say
Produces:
{'#The Piper at the gates of dawn\n': '*Lucifer sam....\nsksdlkdfslkj\ndkdkfjoiupoeri\nlkdsjforinewonre\n', '# A Saucerful of Secrets \n': '*Let there be\nPeople heard him say\n'}

Turning a python dict. to an excel sheet

I am having an issue with the below code.
import urllib2
import csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.ny.com/clubs/nightclubs/index.html').read())
clubs = []
trains = ["A","C","E","1","2","3","4","5","6","7","N","Q","R","L","B","D","F"]
for club in soup.find_all("dt"):
clubD = {}
clubD["name"] = club.b.get_text()
clubD["address"] = club.i.get_text()
text = club.dd.get_text()
nIndex = text.find("(")
if(text[nIndex+1]=="2"):
clubD["number"] = text[nIndex:nIndex+15]
sIndex = text.find("Subway")
sIndexEnd = text.find(".",sIndex)
if(text[sIndexEnd-1] == "W" or text[sIndexEnd -1] == "E"):
sIndexEnd2 = text.find(".",sIndexEnd+1)
clubD["Subway"] = text[sIndex:sIndexEnd2]
else:
clubD["Subway"] = text[sIndex:sIndexEnd]
try:
cool = clubD["number"]
except (ValueError,KeyError):
clubD["number"] = "N/A"
clubs.append(clubD)
keys = [u"name", u"address",u"number",u"Subway"]
f = open('club.csv', 'wb')
dict_writer = csv.DictWriter(f, keys)
dict_writer.writerow([unicode(s).encode("utf-8") for s in clubs])
I get the error ValueError: dict contains fields not in fieldnames. I dont understand how this could be. Any assistance would be great. I am trying to turn the dictionary into an excel file.
clubs is a list of dictionaries, whereas each dictionary has four fields: name, address, number, and Subway. You will need to encode each of the fields:
# Instead of:
#dict_writer.writerow([unicode(s).encode("utf-8") for s in clubs])
# Do this:
for c in clubs:
# Encode each field: name, address, ...
for k in c.keys():
c[k] = c[k].encode('utf-8').strip()
# Write to file
dict_writer.writerow(c)
Update
I looked at your data and some of the fields have ending new line \n, so I updated the code to encode and strip white spaces at the same time.

Categories