Removing Duplicates in Json file - python

I have a Json file which contain some duplicates and I am looking for the way to remove them. Two examples of the beginning of my Json texts:
"date": "May 16, 2012 Wednesday", "body": "THE future of one of Scotland's most important listed buildings .... World Monuments Fund. o See a picture gallery of Mavisbank House at scotsman.com/scotland ", "title": "Rescue deal to bring Adam mansion back from brink"
"date": "May 16, 2012 Wednesday", "body": "The future of one of Scotland's most important listed buildings .... World Monuments Fund.", "title": "Rescue deal to bring Adam mansion back from brink"
I have cut the text in the middle due to the extension of it and irrelevance since they match perfectly. As we can see the text matches almost 100% except at the beginning THE vs The and at the end (extra sentence: o See a picture gallery of Mavisbank House at scotsman.com/scotland). In this line I will like to come with a way to I) Find the duplicates and II) remove one of the duplicates (note that they can also be more than one duplicate). I just started programming in Python and I am not sure how to handle this problem. Any help is really appreciated!
kind regards!

I think it would be better if you first convert your json String into a model object.
After that you can simply iterate over the elements and remove the duplicates (to whatever level). You can ignore case while comparing each individual elements.
Also, you can simply convert each of your body/title elements to a consistent case and add them in a set for duplicate check, while iterating, as #ForceBru pointed out in comments.
Following link will point you in appropriate direction for json to object conversion.
Is there a python json library can convert json to model objects, similar to google-gson?
Hope this helps.

Related

REGEX to find all matches inside a given string

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item)) <class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets [].
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches), but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update:
thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of #FailSafe worked for me:screenshot of my jupyter notebook
here the result with list
But yes I totally agree with # Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)
this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)

How to loop using .split() function on a text file python

I have a html file with different team names written throughout the file. I just want to grab the team names. The team names always occur after certain text and end before certain text, so I've split function to find the team name. I'm a beginner, and I'm sure I'm making this harder than it is. Data is the file
teams = data.split('team-away">')[1].split("</sp")[0]
for team in teams:
print team
This returns each individual character for the first team that it finds (so for example, if teams = San Francisco 49ers, it prints "S", then "A", etc. instead of what I need it to do: Print "San Francisco 49ers" then on the next line the next team "Carolina Panthers", etc.
Thank you!
"I'm a beginner, and I'm sure I'm making this harder than it is."
Well, kind of.
import re
teams = re.findall('team-away">(.*)</sp', data)
(with credit to Kurtis, for a simpler regular expression than I originally had)
Though an actual HTML parser would be best practice.
Don't re-invent the wheel! Look into BeautifulSoup, it'll to the job for you.

How do I find the max value in a dict when Python thinks the values are strings?

I'm working with the Tumblr API -- I want to build a little doo-dad that will pull all of my posts to my various Tumblrs and archive them to my WordPress blog once a week.
So each time the script runs, I want to log the max post id. So I'm trying to get something like this to work:
jdata = json.loads(rawjson)
jposts = jdata['response']['posts']
for post in jposts:
print post['id']
print max(jposts['id'])
What I'm getting is a lot of post ids (expected) followed by list indices must be integers, not str but if I do print type(post['id']) Python recognizes them as <type 'long'>. So ... what am I doing wrong here?
Here's a snippet of Tumblr's sample output that you can use for rawjson
{
"meta":{
"status":200,
"msg":"OK"
},
"response":{
"blog":{
"title":"Scipsy",
"name":"scipsy",
"posts":8524,
"url":"http:\/\/scipsy.tumblr.com\/",
"updated":1365196814,
"description":"\u0022Science is interesting and if you don\u0027t agree, fuck off\u0022",
"ask":true,
"ask_anon":true,
"is_nsfw":false,
"share_likes":false
},
"posts":[
{
"blog_name":"scipsy",
"id":47218422365,
"post_url":"http:\/\/scipsy.tumblr.com\/post\/47218422365\/you-are-missed-that-is-all",
"slug":"you-are-missed-that-is-all",
"type":"answer",
"date":"2013-04-05 21:20:14 GMT",
"timestamp":1365196814,
"state":"published",
"format":"html",
"reblog_key":"slI4NU3a",
"tags":[
"maybe I should start another blog",
"or maybe not",
"Maybe I have a concussion from all the throws received during judo and I\u0027m not thinking straight",
"yeah I think that\u0027s it"
],
"short_url":"http:\/\/tmblr.co\/ZW3EPyh_R-9T",
"highlighted":[
],
"note_count":35,
"asking_name":"psydoctor8",
"asking_url":"http:\/\/psydoctor8.tumblr.com\/",
"question":"you are missed. that is all.",
"answer":"\u003Cp\u003EAlthough this break from tumblr have had some positive effects on me (I\u2019ve stopped from compulsively sifting through the NASA\u2019s archive like a maniac and I no longer fell the need to stay up all night trying to answer the most absurd questions) I have to admit that I miss all this sciency stuff and I miss a lot the awesome people I’ve known through this silly blog.\u003C\/p\u003E"
},
{
"blog_name":"scipsy",
"id":32267453988,
"post_url":"http:\/\/scipsy.tumblr.com\/post\/32267453988\/in-the-last-months-this-blog-has-experienced-a",
"slug":"in-the-last-months-this-blog-has-experienced-a",
"type":"text",
"date":"2012-09-25 16:31:00 GMT",
"timestamp":1348590660,
"state":"published",
"format":"html",
"reblog_key":"0EMwke5R",
"tags":[
],
"short_url":"http:\/\/tmblr.co\/ZW3EPyU3IaOa",
"highlighted":[
],
"note_count":246,
"title":null,
"body":"\u003Cp\u003EIn the last months this blog has experienced a progressive decline in the number of produced posts. There are several reasons that come to my mind to explain why is that, but probably the better one is about the fact that I lost motivation. Unexpectedly, despite the lack of regular updates the blog gained more and more followers. [That’s flattering, but at the same time makes me suspicious about the relationship between quality of a blog and number of followers.]\u00a0\u003C\/p\u003E\n\u003Cp\u003EAnyway, I always feel a little lost when I follow a blog and it slowly fades away, and then it just stops posting, without saying anything, so I thought to make this post.\u003C\/p\u003E\n\u003Cp\u003EI’m not going to update \u003Cem\u003Escipsy\u003C\/em\u003E anymore. This could change, but for now I don’t feel like posting here anymore. I’m not going to delete it.\u003C\/p\u003E\n\u003Cp\u003EIf someone would like to stay in touch, just send a message or something. This is my mail: \u003Cem\u003Edr.scipsy#gmail.com\u003C\/em\u003E\u003C\/p\u003E\n\u003Cp\u003EIf someone is wondering: “\u003Cem\u003EWho will fill my dash with science now?\u003C\/em\u003E\u0022 here’s a list of \u003Cem\u003Esciency\u003C\/em\u003E tumblr I followed:\u003C\/p\u003E\n\u003Cul\u003E\u003Cli\u003E\u003Ca href=\u0022http:\/\/psydoctor8.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Epsydoctor8\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/electricorchid.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eelectricorchid\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/gradmom.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Egradmom\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/www.itsokaytobesmart.com\/\u0022 target=\u0022_blank\u0022\u003Eitsokaytobesmart\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/realcleverscience.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Erealcleverscience\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/scientistintraining.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Escientistintraining\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/crookedindifference.com\/\u0022 target=\u0022_blank\u0022\u003Ecrookedindifference\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/ohyeahdevelopmentalbiology.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eohyeahdevelopmentalbiology\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/scienceisbeauty.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Escienceisbeauty\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/fuckyeahneuroscience.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Efuckyeahneuroscience\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/climateadaptation.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eclimateadaptation\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/exp.lore.com\/\u0022 target=\u0022_blank\u0022\u003Eexp.lore\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/xenogifh.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Exenogifh\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/www.ziyadnazem.info\/\u0022 target=\u0022_blank\u0022\u003Eziyadnazem\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/intothecontinuum.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eintothecontinuum\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/approachingsignificance.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eapproachingsignificance\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/sciencesoup.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Esciencesoup\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/blog.matthen.com\/\u0022 target=\u0022_blank\u0022\u003Ematthen\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/ulaulaman.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eulaulaman\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/mindovermatterzine.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Emindovermatterzine\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/doctorswithoutborders.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Edoctorswithoutborders\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/mothernaturenetwork.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Emothernaturenetwork\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/wnycradiolab.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Ewnycradiolab\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/blog.nysci.org\/\u0022 target=\u0022_blank\u0022\u003Enysci\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/amnhnyc.tumblr.com\u0022 target=\u0022_blank\u0022\u003Eamnhnyc\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/discoverynews.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Ediscoverynews\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/tumblr.poptech.org\/\u0022 target=\u0022_blank\u0022\u003Epoptech\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/retina.smithsonianmag.com\/\u0022 target=\u0022_blank\u0022\u003Eretina.smithsonianmag\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/onearth.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Eonearth\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/huffpostscience.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Ehuffpostscience\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/bpod-mrc.tumblr.com\/\u0022 target=\u0022_blank\u0022\u003Ebpod-mrc\u003C\/a\u003E\u003C\/li\u003E\n\u003Cli\u003E\u003Ca href=\u0022http:\/\/blog.tedx.com\/\u0022 target=\u0022_blank\u0022\u003Etedx\u003C\/a\u003E\u003C\/li\u003E\n\u003C\/ul\u003E\u003Cp\u003EThat’s it, I think.\u003C\/p\u003E\n\u003Cp\u003E\u003Csmall\u003E\u003Cem\u003ESo long, and thanks for all the fish.\u003C\/em\u003E\u003C\/small\u003E\u003C\/p\u003E"
}
],
"total_posts":8524
}
}
jPosts is a list
jPosts = [{'id':123},{'id':54233},etc] #example
so you cannot say jPosts['id'] since it is a list not a dictionary, however you could say something like jPosts[0]['id'] since you are now using id on the 0th index of jPost which is a dictionary
I think you want
max(jPosts,key=lambda item:item['id']) #compare based on EACH items 'id' field
You can use max() with a key parameter specified:
from operator import itemgetter
print max(jPosts, key=itemgetter('id'))['id'] # prints 47218422365

Python 3 Extracting Candidate Words from a Debate File

This is my first post, so I'm sorry if I do anything wrong. That said, I searched for the question and found something similar that was never answered due to the OP not giving sufficient information. This is also homework, so I'm just looking for a hint. I really want to get this on my own.
I need to read in a debate file (.txt), and pull and store all of the lines that one candidate says to put in a word cloud. The file format is supposed to help, but I'm blanking on how to do this. The hint is that each time a new person speaks, their name followed by a colon is the first word in the first line. However, candidates' data can span multiple lines. I am supposed to store each person's lines separately. Here is a sample of the file:
LEHRER: This debate and the next three -- two presidential, one vice
presidential -- are sponsored by the Commission on Presidential
Debates. Tonight's 90 minutes will be about domestic issues and will
follow a format designed by the commission. There will be six roughly
15-minute segments with two-minute answers for the first question,
then open discussion for the remainder of each segment.
Gentlemen, welcome to you both. Let's start the economy, segment one,
and let's begin with jobs. What are the major differences between the
two of you about how you would go about creating new jobs?
LEHRER: You have two minutes. Each of you have two minutes to start. A
coin toss has determined, Mr. President, you go first.
OBAMA: Well, thank you very much, Jim, for this opportunity. I want to
thank Governor Romney and the University of Denver for your
hospitality.
There are a lot of points I want to make tonight, but the most
important one is that 20 years ago I became the luckiest man on Earth
because Michelle Obama agreed to marry me.
This is what I have for a function so far:
def getCandidate(myFile):
file = open(myFile, "r")
obama = []
romney = []
lehrer = []
file = file.readlines()
I'm just not sure how to iterate through the data so that it separates each person's words correctly. I created a dummy file to create the word cloud, and I'm able to do that fine, so all I am wondering is how to extract the information I need.
Thank you! If there is more information I can offer please let me know. This is a beginning Python course.
EDIT: New code added from a response. This works to an extent, but only grabs the first line of each candidate's response, not their entire response. I need to write code that continues to store each line under that candidate until a new name is at the start of a line.
def getCandidate(myFile, candidate):
file = open(myFile, "r")
OBAMA = []
ROMNEY = []
LEHRER = []
file = file.readlines()
for line in file:
if line.startswith("OBAMA:"):
OBAMA.append(line)
if line.startswith("ROMNEY:"):
ROMNEY.append(line)
if line.startswith("LEHRER:"):
LEHRER.append(line)
if candidate == "OBAMA":
return OBAMA
if candidate == "ROMNEY":
return ROMNEY
EDIT: I now have a new question. How can I generalize the file so that I can open any debate file between two people and a moderator? I am having a lot of trouble with this one.
I've been given a hint to look at the beginning of the line and see if the last word of each line to see if it ends in ":", but I'm still not sure how to do this. I tried splitting each line on spaces and then looking at the first item in the line, but that's as far as I've gotten.
The hint is this: after you split your lines, iterate over them and check with the string function startswith for each candidate, then append.
The iteration over a file is very simple:
for row in file:
do_something_with_row
EDIT:
To keep putting the lines until you find a new candidate, you have to keep track with a variable of the last candidate seen and if you don't find any match at the beginning of the line, you stick with the same candidate as before.
if line.startswith('OBAMA'):
last_seen=OBAMA
OBAMA.append(line)
elif blah blah blah
else:
last_seen.append(line)
By the way, I would change the definitio of the function: instead of take the name of the candidate and returning only his lines, it would be better to return a dictionary with the candidate name as keys and their lines as values, so you wouldn't need to parse the file more than once. When you will work with bigger file this could be a lifesaver.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

Categories