I got a list in Python with Twitter user information and exported it with Pandas to an Excel file.
One row is one Twitter user with nearly all information of the user (name, #-tag, location etc.)
Here is my code to create the list and fill it with the user data:
def get_usernames(userids, api):
fullusers = []
u_count = len(userids)
try:
for i in range(int(u_count/100) + 1):
end_loc = min((i + 1) * 100, u_count)
fullusers.extend(
api.lookup_users(user_ids=userids[i * 100:end_loc])
)
print('\n' + 'Done! We found ' + str(len(fullusers)) + ' follower in total for this account.' + '\n')
return fullusers
except:
import traceback
traceback.print_exc()
print ('Something went wrong, quitting...')
The only problem is that every row is in JSON object and therefore one long comma-seperated string. I would like to create headers (no problem with Pandas) and only write parts of the string (i.e. ID or name) to colums.
Here is an example of a row from my output.xlsx:
User(_api=<tweepy.api.API object at 0x16898928>, _json={'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')
I have two ideas, but I don't know how to realize them due to my lack of skills and experience with Python.
Create a loop which saves certain parts ('id','name' etc.) from the JSON-string in colums.
Cut off the User(_api=<tweepy.api. API object at 0x16898928>, _json={ at the beginning and ) at the end, so that I may export they file as CSV.
Could anyone help me out with one of my two solutions or suggest a "simple" way to do this?
fyi: I want to do this to gather data for my thesis.
Try the python json library:
import json
jsonstring = "{'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')"
jsondict = json.loads(jsonstring)
# type(jsondict) == dictionary
Now you can just extract the data you want from it:
id = jsondict["id"]
name = jsondict["name"]
newdict = {"id":id,"name":name}
Related
I'm writing a python program to save and retrieve a customer data in cloud datastore. My entity looks like below:
entity.update({
'customerId': args['customerId'],
'name': args['name'],
'email': args['email'],
'city': args['city'],
'mobile': args['mobile']
})
datastore_client.put(entity)
I'm successfully saving the data. Now, I want to retrieve a random email id from the a record. I have written the below code:
def get_customer():
query = datastore_client.query(kind='CustomerKind')
results = list(query.fetch())
chosen_customer = random.choice(results)
print(chosen_customer)
But instead getting only one random email id, I'm getting the entire row like this:
<Entity('CustomerKind', 6206716152643584) {'customerId': '103', 'city': 'bhubaneswar', 'name': 'Amit', 'email': 'amit#gmail.com', 'mobile': '7879546732'}>
Can anyone suggest how can I get only 'email': 'amit#gmail.com' ? I'm new to datastore.
When using
query = datastore_client.query(kind='CustomerKind')
results = list(query.fetch())
you are retrieving all the properties from all the entities that will be returned.
Instead, you can use a projection query, which allows you to retrieve only the specified properties from the entities:
query = client.query(kind="CustomerKind")
query.projection = ["email"]
results = list(query.fetch())
Using projection queries is recommended for cases like this, in which you only need some properties as they reduce cost and latency.
I have this method that writes json data to a file. The title is based on books and data is the book publisher,date,author, etc. The method works fine if I wanted to add one book.
Code
import json
def createJson(title,firstName,lastName,date,pageCount,publisher):
print "\n*** Inside createJson method for " + title + "***\n";
data = {}
data[title] = []
data[title].append({
'firstName:', firstName,
'lastName:', lastName,
'date:', date,
'pageCount:', pageCount,
'publisher:', publisher
})
with open('data.json','a') as outfile:
json.dump(data,outfile , default = set_default)
def set_default(obj):
if isinstance(obj,set):
return list(obj)
if __name__ == '__main__':
createJson("stephen-king-it","stephen","king","1971","233","Viking Press")
JSON File with one book/one method call
{
"stephen-king-it": [
["pageCount:233", "publisher:Viking Press", "firstName:stephen", "date:1971", "lastName:king"]
]
}
However if I call the method multiple times , thus adding more book data to the json file. The format is all wrong. For instance if I simply call the method twice with a main method of
if __name__ == '__main__':
createJson("stephen-king-it","stephen","king","1971","233","Viking Press")
createJson("william-golding-lord of the flies","william","golding","1944","134","Penguin Books")
My JSON file looks like
{
"stephen-king-it": [
["pageCount:233", "publisher:Viking Press", "firstName:stephen", "date:1971", "lastName:king"]
]
} {
"william-golding-lord of the flies": [
["pageCount:134", "publisher:Penguin Books", "firstName:william","lastName:golding", "date:1944"]
]
}
Which is obviously wrong. Is there a simple fix to edit my method to produce a correct JSON format? I look at many simple examples online on putting json data in python. But all of them gave me format errors when I checked on JSONLint.com . I have been racking my brain to fix this problem and editing the file to make it correct. However all my efforts were to no avail. Any help is appreciated. Thank you very much.
Simply appending new objects to your file doesn't create valid JSON. You need to add your new data inside the top-level object, then rewrite the entire file.
This should work:
def createJson(title,firstName,lastName,date,pageCount,publisher):
print "\n*** Inside createJson method for " + title + "***\n";
# Load any existing json data,
# or create an empty object if the file is not found,
# or is empty
try:
with open('data.json') as infile:
data = json.load(infile)
except FileNotFoundError:
data = {}
if not data:
data = {}
data[title] = []
data[title].append({
'firstName:', firstName,
'lastName:', lastName,
'date:', date,
'pageCount:', pageCount,
'publisher:', publisher
})
with open('data.json','w') as outfile:
json.dump(data,outfile , default = set_default)
A JSON can either be an array or a dictionary. In your case the JSON has two objects, one with the key stephen-king-it and another with william-golding-lord of the flies. Either of these on their own would be okay, but the way you combine them is invalid.
Using an array you could do this:
[
{ "stephen-king-it": [] },
{ "william-golding-lord of the flies": [] }
]
Or a dictionary style format (I would recommend this):
{
"stephen-king-it": [],
"william-golding-lord of the flies": []
}
Also the data you are appending looks like it should be formatted as key value pairs in a dictionary (which would be ideal). You need to change it to this:
data[title].append({
'firstName': firstName,
'lastName': lastName,
'date': date,
'pageCount': pageCount,
'publisher': publisher
})
How can I add several entries to a document at once in Flask using MongoEngine/Flask-MongoEngine?
I tried to iterate over the dictionary that contains my entries. I simplified the example a bit, but originally the data is a RSS file that my Wordpress spits out and that I parsed via feedparser.
But the problem obviously is that I cannot dynamically generate variables that hold my entries before being saved to the database.
Here is what I tried so far.
How can I add the entries to my MongoDB database in bulk?
# model
class Entry(db.Document):
created_at = db.DateTimeField(
default=datetime.datetime.now, required=True),
title = db.StringField(max_length=255, required=True)
link = db.StringField(required=True)
# dictionary with entries
e = {'entries': [{'title': u'title1',
'link': u'http://www.me.com'
},
{'title': u'title2',
'link': u'http://www.me.com/link/'
}
]
}
# multiple entries via views
i = 0
while i<len(e['entries']):
post[i] = Entry(title=e['entries'][i]['title'], link=e['entries'][i]['title'])
post[i].save();
i += 1
Edit 1:
I thought about skipping the variables alltogether and translate the dictionary to the form that mongoengine can understand.
Because when I create a list manually, I can enter them in bulk into MongoDB:
newList = [RSSPost(title="test1", link="http://www.google.de"),
RSSPost(title="test2", link="http://www.test2.com")]
RSSPost.objects.insert(newList)
This works, but I could not translate it completely to my problem.
I tried
f = []
for x in e['entries']:
f.append("insert " + x['link'] + " and " + x['title'])
But as you see I could not recreate the list I need.
How to do it correctly?
# dictionary with entries
e = {'entries': [{'title': u'title1',
'link': u'http://www.me.com'
},
{'title': u'title2',
'link': u'http://www.me.com/link/'
}
]
}
How is your data/case different from the examples you posted? As long as I'm not missing something you should be able to instantiate Entry objects like:
entries = []
for entry in e['entries']:
new_entry = Entry(title=entry['title'], link=entry['link'])
entries.append(new_entry)
Entry.objects.insert(entries)
Quick and easy way:
for i in e['entries']:
new_e = Entry(**i)
new_e.save()
The csv file works fine. So does the dictionary but I can't seem to check the values in the csv file to make sure I'm not adding duplicate entries. How can I check this? The code I tried is below:
def write_csv():
csvfile = csv.writer(open("address.csv", "a"))
check = csv.reader(open("address.csv"))
for item in address2:
csvfile.writerow([address2[items]['address']['value'],address2[items]['address']['count'],items, datetime.datetime.now()])
def check_csv():
check = csv.reader(open("address.csv"))
csvfile = csv.writer(open("address.csv", "a"))
for stuff in address2:
address = address2[str(stuff)]['address']['value']
for sub in check:
if sub[0] == address:
print "equals"
try:
address2[stuff]['delete'] = True
except:
address2[stuff]['delete'] = True
else:
csvfile.writerow([address2[stuff]['address']['value'], address2[stuff]['address']['count'], stuff, datetime.datetime.now()])
Any ideas?
Your CSV and dict structures are a little wonky - I'd love to know if that is set or if you can change them to be more useful. Here is an example that does basically what you want -- you'll have to change some things to fit your format. The most important change is probably not writing to a file that you are reading - that is going to lead to headaches.
This does what you asked with the delete flag -- is there an external need for this? If not there is almost certainly a better way (removing the bad rows, saving the good rows somewhere else, etc - depends on what you are doing).
Anyway, here is the example. I used just the commented block to create the csv file in the first place, then added the new address to the list and ran the rest. Instead of looping through the file over and over it makes a lookup dict by address and stores the row number, which it then uses to update the delete flag if it is found when it reads the csv file. You'll want to take the prints out and uncomment the last line to actually write the new rows.
import csv, datetime
addresses = [
{'address': {'value': '123 road', 'count': 1}, 'delete': False},
{'address': {'value': '456 road', 'count': 1}, 'delete': False},
{'address': {'value': '789 road', 'count': 1}, 'delete': False},
{'address': {'value': '1 new road', 'count': 1}, 'delete': False},
]
now = datetime.datetime.now()
### create the csv
##with open('address.csv', 'wb') as csv_file:
## writer = csv.writer(csv_file)
## for row in addresses:
## writer.writerow([ row['address']['value'], row['address']['count'], now.strftime('%Y-%m-%d %H:%M:%S') ])
# make lookup keys for the dict
address_lookup = {}
for i in range(len(addresses)):
address_row = addresses[i]
address_lookup[address_row['address']['value']] = i
# read csv once
with open('address.csv', 'rb') as csv_file:
reader = csv.reader(csv_file)
for row in reader:
print row
# if address is found in the dict, set delete flag to true
if row[0] in address_lookup:
print 'flagging address as old: %s' % row[0]
addresses[ address_lookup[row[0]] ]['delete'] = True
with open('address.csv', 'ab') as csv_file:
# go back through addresses and add any that shouldnt be deleted to the csv
writer = csv.writer(csv_file)
for address_row in addresses:
if address_row['delete'] is False:
print 'adding row: '
print address_row
#writer.writerow([ row['address']['value'], row['address']['count'], now.strftime('%Y-%m-%d %H:%M:%S') ])
I have created a script which scrapes many pdfs for abstract and keywords. I also have a collection of bibtex-files in which I want to place the texts I've extracted. What I'm looking for is a way of adding elements to the bibtex files.
I have written a short parser:
#!/usr/bin/python
#-*- coding: utf-8
import os
from pybtex.database.input import bibtex
dir_path = "nime_archive/nime/bibtex/"
num_texts = 0
class Bibfile:
def __init__(self,bibs):
self.bibs = bibs
for a in self.bibs.entries.keys():
num_text += 1
print bibs.entries[a].fields['title']
#Need to implement a way of getting just the nime-identificator
try:
print bibs.entries[a].fields['url']
except:
print "couldn't find URL for text: %s " % a
print "creating new bibfile"
bibfiles = []
parser = bibtex.Parser()
for infile in os.listdir(dir_path):
if infile.endswith(".bib"):
print infile
bibfiles = Bibfile(parser.parse_file(dir_path+infile))
My question is if there is possible to use Pybtex to add elements into the existing bibtex-files (or create a copy) so I can merge my extractions with what is already available. If this is not possible in Pybtex, what other bibtex parser can I use?
I've never used pybtex, but from a quick glance, you can add entries. Since self.bibs.entries appears to be a dict, you can come up with a unique key, and add more entries to it. Example:
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
self.bibs.entries[key] = new_entry
(caveat: untested)
If you wonder where I got this example form: have a look in the tests/ subdirectory of the source of pybtex. I got the above code example mainly from tests/database_test/data.py. Tests can be a good source of documentation if the actual documentation is lacking.
.data.add_entry(key, entry) works for me. Here I used an entry manually created (taken from Evert's example) but you can copy an existing entry from another bib that you're also parsing.
from pybtex.database.input.bibtex import Parser
from pybtex.core import Entry, Person
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
newbib_parser = Parser()
newbib_parser.data.add_entry(key, new_entry)
print newbib_parser.data