I have a csv file in which each line contains a person's ID # and then a bunch of attributes. I want to be able to create a tuple for each person that contains all their attributes and then name the tuple some variation of their ID #.
All these tuples will then be added to a set in redis for storage.
I can't seem to figure out how to create a tuple that is named after the persons ID#.
I know its not best practice to dynamically name variables, but I would rather not put all the tuples in a list or set to then put into a redis set (which is a must); it just seems inefficient and cumbersome.
This is the code I have now:
with open('personlist.csv','rb') as f:
for line in f:
row = line.split(',')
personID = row[0]
attrb1 = row[1]
attrb2 = row[2]
attrb3 = row[3]
# Need to name tuple here and define as (attrb1, attrb2, attrb3)
r.lpush('allpersonslist',tuple)
This example needs additional code to function. I'm assuming you are using a redis API such as redis-py. The variable r is an open connection to redis.
import pickle
with open('personlist.csv', 'rb') as f:
for line in f:
row = line.split(',')
personID = row[0]
attrb1 = row[1]
attrb2 = row[2]
attrb3 = row[3]
#put the attributes in a tuple
tuple = (attrb1, attrb2, attrb3)
#serialize the tuple before adding it to the set
r.set("person/%d" %personID,pickle.dumps(tuple,-1))
def getPerson(Id):
return pickle.loads(r.get("person/%d" %Id))
You can call getPerson(5) to return the tuple associated with a person of ID 5.
If each person have max N attribute, there is language-independent solution based on hash. Here list 3 commands to save/read/delete values for a person.
HMSET 'allpersonshash' personID:0 personID:1 ......
HMGET 'allpersonshash' personID:0 personID:1 personID:2 ... personID:N
HDEL 'allpersonshash' personID:0 personID:1 personID:2 ... personID:N
A fairly general way to do it would be to use sorted sets with json blobs, eg:
ZADD userid, '{field1:value1,field2:value2}'
Related
class Customer:
def __init__(self,custid,name,addr,city,state,zipcode):
self.custid=custid
self.name=name
self.addr=addr
self.city=city
self.state=state
self.zipcode=zipcode
self.memberLevel=BasicMember()
self.monthlySpending =0
Well i am able to read a file and then split it such that in dictionary key is my customerid and value is the customer object. But i can't retrieve the attributes for each object stored in my dictionary. How to get each objects attributes from dictionary.
for line in open('customers.dat','r'):
item=line.rstrip(',')
intput =line.split(',')
cc=Customer.Customer(*intput)
s2=item.split(',',1)[0]
d[s2]=[cc]
sample data of customer is :
619738273,Admiral Ackbar,383 NeiMoidian Road,Utapau,MA,01720
118077058,Padme Amidala,846 Amani Road,D'Qar,MA,01508
360513913,Wedge Antilles,700 NeiMoidian Road,D'Qar,MA,01508
while my output after storing each object in dictionary is :
{'739118188': [<Customer.Customer object at 0x005FF8B0>],
'578148567': [<Customer.Customer object at 0x005FF9B0>]}
So how to get attributes for the object stored in the dictionary.
I'm not sure why you wrapped each one in a list, but simply access them as normal:
>>> d['619738273'][0].name
'Admiral Ackbar'
I'd recommend not wrapping each one in a list:
d[s2] = cc
Then you don't need the [0]:
>>> d['619738273'].name
'Admiral Ackbar'
You can also streamline the parsing step:
with open('customers.dat') as f:
for line in f:
k,*data = line.split(',')
d[k] = Customer.Customer(k, *data)
Although it'd be better to use csv, since it looks like you're working with a CSV file:
import csv
with open('customers.dat') as f:
reader = csv.reader(f)
for k,*data in reader:
d[k] = Customer.Customer(k, *data)
I have a list of dictionaries that maps different IDs to a central ID. I have a document with these different IDs associated with terms. I have created a function that now has a key the central ID from the different IDs in the document. The goFile is the document where in the first column there's an ID and in the second one there's a GOterm. The mappingList is a list containing dictionaries in which the ID in the goFile is mapped to a main ID.
My expected output is a dictionary with a main ID as a key and a set with the go terms associated with it as value.
def parseGO(mappingList, goFile):
# open the file
file = open(goFile)
# this will be the dictionary that this function returns
# entries will have as a key an Ensembl ID
# and the value will be a set of GO terms
GOdict = {}
GOset = set()
for line in file:
splitline = line.split(' ')
GO_term = splitline[1]
value_ID = splitline[0]
for dict in mappingList:
if value_ID in dict:
ENSB_term = dict[value_ID]
#my best try
for dict in mappingList:
for key in GOdict.keys():
if value_ID in dict and key == dict[value_ID]:
GOdict[ENSB_term].add(GO_term)
GOdict[ENSB_term] = GOset
return GOdict
My problem is that now I have to add to the central ID in my GOdict the terms that are associated in the document to the different IDs. To avoid duplicates i use a set (GOset). How do I do it? All my try end having all the terms mapped to all the main IDs.
Some sample:
mappingList = [{'1234': 'mainID1', '456': 'mainID2'}, {'789': 'mainID2'}]
goFile:
1234 GOTERM1
1234 GOTERM2
456 GOTERM1
456 GOTERM3
789 GOTERM1
expected output:
GOdict = {'mainID1': set([GOTERM1, GOTERM2]), 'mainID2': set([GOTERM1, GOTERM3])}
First off, you shouldn't use the variable name 'dict', as it shadows the built-in dict class, and will cause you problems at some point.
The following should work for you:
from collections import defaultdict
def parse_go(mapping_list, go_file):
go_dict = defaultdict(set)
with open(go_file) as f: # Better garbage handling using 'with'
for line in f:
(value_id, go_term) = line.split() # Feel free to change the split behaviour
# work better for you.
for map_dict in mapping_list:
if value_id in map_dict:
go_dict[map_dict[value_id]].add(go_term)
return go_dict
The code is fairly straightforward, but here's a breakdown anyway.
We use a default dictionary instead of a normal dictionary so we can eliminate all that if in or setdefault() boilerplate.
For each line in the file, we check if the first item (value_id) is a key in any of the mapping dictionaries, and if so, adds the lines second item (go_term) to that value_id's set in the dictionary.
EDIT: Request for doing this without defaultdict(). Assume that go_dict is just a normal dictionary (go_dict = {}), your for loop would look like:
for map_dict in mapping_list:
if value_id in map_dict:
esnb_entry = go_dict.setdefault(map_dict[value_id], set())
esnb_entry.add(go_term)
I have a .tsv file of text data, named world_bank_indicators.
I have another tsv file, which contains additional information that I need to append to a list item in my script. that file is named world_bank_regions
So far, I have code (thanks to some of the good people on this site) that filters the data that I need from world bank indicators and writes it as a 2D list to the variable mylist. additionally, I have code that reads in the second file as a dictionary. code is below:
from math import log
import csv
import re
#filehandles for spreadsheets
fhand=open("world_bank_indicators.txt", "rU")
fhand2=open("world_bank_regions.txt", "rU")
#csv reader objects for files
reader=csv.reader(fhand, dialect="excel", delimiter="\t")
reader2=csv.reader(fhand2, dialect="excel", delimiter="\t")
#empty list for appending data into
#appending into this will create a 2d list, or "a list OF lists"
mylist=list()
mylist2=list()
mydict=dict()
myset=set()
newset=set()
#filters data by iterating over each row in the reader object
#note that this IGNORES headers. This will need to be appended later
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
#plug columns into specific variables, for easier coding
#replaces "," with empty space for columns that need to be converted to floats
name=row[0]
date=row[1]
pop=row[9].replace(",",'')
mobile=row[4].replace(",",'')
health=row[6]
internet=row[5]
gdp=row[19].replace(",",'')
#only appends rows that have COMPLETE rows of data
if name != '' and date != '' and pop != '' and mobile != '' and health != '' and internet != '' and gdp != '':
#declare calculated variables
mobcap=(float(mobile)/float(pop))
gdplog=log(float(gdp))
healthlog=log(float(health))
#re-declare variables as strings, rounds decimal points to 5th place
#this could have been done once in above step, merely re-coded here for easier reading
mobcap=str(round(mobcap, 5))
gdplog=str(round(gdplog, 5))
healthlog=str(round(healthlog,5))
#put all columns into 2d list (list of lists)
newrow=[name, date, pop, mobile, health, internet, gdp, mobcap, gdplog, healthlog]
mylist.append(newrow)
myset.add(name)
for row in reader2:
mydict[row[2]]=row[0]
what I need to do now is
1. read the country name from the mylist variable,
2.look up that string in the key value of mydict, and
3. append the value of that key back to mylist.
I'm totally stumped on how to do this.
should I make both data structures dictionaries? I still wouldn't know how to execute the above steps.
thanks for any insights.
It depends what you mean by "append the value of that key back to mylist". Do you mean, append the value we got from mydict to the list that contains the country name we used to look it up? Or do you mean to append that value from mydict to mylist itself?
The latter would be a strange thing to do, since mylist is a list of lists, wheras the value we are talking about ("row[0]") is a string. I can't intuit why we would append some strings to a list of lists, even though this is what your description says to do. So I'm assuming the former :)
Let's assume that your mylist is actually called "indicators", and mydict is called "region_info"
for indicator in indicators:
try:
indicator.append(region_info[indicator[0]])
except:
print "there is no region info for country name %s" % indicator[0]
Another comment on readability. I think that the elements of mylist would be better being dicts than lists. I would do this:
newrow={"country_name" : name,
"date": date,
"population": pop,
#... etc
because then when you use these things, you can use them by name instead of number, which will be more readable:
for indicator in indicators:
try:
indicator["region_info"] = region_info[indicator["country_name"]]
except:
print "there is no region info for country name %s" % indicator["country_name"]
I am reading a text file with python, formatted where the values in each column may be numeric or strings.
When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).
What would be an efficient way to do it?
Use a defaultdict with a default value factory that generates new ids:
ids = collections.defaultdict(itertools.count().next)
ids['a'] # 0
ids['b'] # 1
ids['a'] # 0
When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.
collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.
Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.
defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:
ids = collections.defaultdict(functoools.partial(next, itertools.count()))
Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.
Here I am assuming the second column is the one you want to scan for text or integers.
seen = set()
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
int(row[1])
except ValueError:
seen.add(row[1]) # adds string to set
# print the unique ids for each string
for id,text in enumerate(seen):
print("{}: {}".format(id, text))
Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:
unique_strings = [set(), set(), set()]
with open('file.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for column,value in enumerate(row):
try:
int(value)
except ValueError:
# It is not an integer, so it must be
# a string
unique_strings[column].add(value)
I am writing a script that looks through my inventory, compares it with a master list of all possible inventory items, and tells me what items I am missing. My goal is a .csv file where the first column contains a unique key integer and then the remaining several columns would have data related to that key. For example, a three row snippet of my end-goal .csv file might look like this:
100001,apple,fruit,medium,12,red
100002,carrot,vegetable,medium,10,orange
100005,radish,vegetable,small,10,red
The data for this is being drawn from a couple sources. 1st, a query to an API server gives me a list of keys for items that are in inventory. 2nd, I read in a .csv file into a dict that matches keys with item name for all possible keys. A snippet of the first 5 rows of this .csv file might look like this:
100001,apple
100002,carrot
100003,pear
100004,banana
100005,radish
Note how any key in my list of inventory will be found in this two column .csv file that gives all keys and their corresponding item name and this list minus my inventory on hand yields what I'm looking for (which is the inventory I need to get).
So far I can get a .csv file that contains just the keys and item names for the items that I don't have in inventory. Give a list of inventory on hand like this:
100003,100004
A snippet of my resulting .csv file looks like this:
100001,apple
100002,carrot
100005,radish
This means that I have pear and banana in inventory (so they are not in this .csv file.)
To get this I have a function to get an item name when given an item id that looks like this:
def getNames(id_to_name, ids):
return [id_to_name[id] for id in ids]
Then a function which gives a list of keys as integers from my inventory server API call that returns a list and I've run this function like this:
invlist = ServerApiCallFunction(AppropriateInfo)
A third function takes this invlist as its input and returns a dict of keys (the item id) and names for the items I don't have. It also writes the information of this dict to a .csv file. I am using the set1 - set2 method to do this. It looks like this:
def InventoryNumbers(inventory):
with open(csvfile,'w') as c:
c.write('InvName' + ',InvID' + '\n')
missinginvnames = []
with open("KeyAndItemNameTwoColumns.csv","rb") as fp:
reader = csv.reader(fp, skipinitialspace=True)
fp.readline() # skip header
invidsandnames = {int(id): str.upper(name) for id, name in reader}
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
print missinginvnameswithids
with open(csvfile,'a') as c:
for invname, invid in missinginvnameswithids.iteritems():
c.write(invname + ',' + str(invid) + '\n')
return missinginvnameswithids
Which I then call like this:
InventoryNumbers(invlist)
With that explanation, now on to my question here. I want to expand the data in this output .csv file by adding in additional columns. The data for this would be drawn from another .csv file, a snippet of which would look like this:
100001,fruit,medium,12,red
100002,vegetable,medium,10,orange
100003,fruit,medium,14,green
100004,fruit,medium,12,yellow
100005,vegetable,small,10,red
Note how this does not contain the item name (so I have to pull that from a different .csv file that just has the two columns of key and item name) but it does use the same keys. I am looking for a way to bring in this extra information so that my final .csv file will not just tell me the keys (which are item ids) and item names for the items I don't have in stock but it will also have columns for type, size, number, and color.
One option I've looked at is the defaultdict piece from collections, but I'm not sure if this is the best way to go about what I want to do. If I did use this method I'm not sure exactly how I'd call it to achieve my desired result. If some other method would be easier I'm certainly willing to try that, too.
How can I take my dict of keys and corresponding item names for items that I don't have in inventory and add to it this extra information in such a way that I could output it all to a .csv file?
EDIT: As I typed this up it occurred to me that I might make things easier on myself by creating a new single .csv file that would have date in the form key,item name,type,size,number,color (basically just copying in the column for item name into the .csv that already has the other information for each key.) This way I would only need to draw from one .csv file rather than from two. Even if I did this, though, how would I go about making my desired .csv file based on only those keys for items not in inventory?
ANSWER: I posted another question here about how to implement the solution I accepted (becauseit was giving me a value error since my dict values were strings rather than sets to start with) and I ended up deciding that I wanted a list rather than a set (to preserve the order.) I also ended up adding the column with item names to my .csv file that had all the other data so that I only had to draw from one .csv file. That said, here is what this section of code now looks like:
MyDict = {}
infile = open('FileWithAllTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in missinginvids: #note that this is the list I was using as the keys for my dict which I was zipping together with a corresponding list of item names to make my dict before.
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict
it sounds like what you need is a dict mapping ints to sets, ie,
MyDict = {100001: set([apple]), 100002: set([carrot])}
you can add with update:
MyDict[100001].update([fruit])
which would give you: {100001: set([apple, fruit]), 100002: set([carrot])}
Also if you had a list of attributes of carrot... [vegetable,orange]
you could say MyDict[100002].update([vegetable, orange])
and get: {100001: set([apple, fruit]), 100002: set([carrot, vegetable, orange])}
does this answer your question?
EDIT:
to read into CSV...
infile = open('MyFile.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[spl_line[0]].update(spl_line[1:])
This isn't an answer to the question, but here is a possible way of simplifying your current code.
This:
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
Can be replaced with:
invonhandset = set(inventory)
missinginvnameswithids = {k: v for k, v in invidsandnames.iteritems() if k in in inventory}
Or:
invonhandset = set(inventory)
for key in invidsandnames.keys():
if key not in invonhandset:
del invidsandnames[key]
missinginvnameswithids = invidsandnames
Have you considered making a temporary RDB (python has sqlite support baked in) and for reasonable numbers of items I don't think you would have a performance issues.
I would turn each CSV file and the result from the web-api into a tables (one table per data source). You can then do everything you want to do with some SQL queries + joins. Once you have the data you want, you can then dump it back to CSV.