Assign strings to IDs in Python - python

I am reading a text file with python, formatted where the values in each column may be numeric or strings.
When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).
What would be an efficient way to do it?

Use a defaultdict with a default value factory that generates new ids:
ids = collections.defaultdict(itertools.count().next)
ids['a'] # 0
ids['b'] # 1
ids['a'] # 0
When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.
collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.
Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:
ids = collections.defaultdict(functoools.partial(next, itertools.count()))

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.
Here I am assuming the second column is the one you want to scan for text or integers.
seen = set()
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
int(row[1])
except ValueError:
seen.add(row[1]) # adds string to set
# print the unique ids for each string
for id,text in enumerate(seen):
print("{}: {}".format(id, text))
Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:
unique_strings = [set(), set(), set()]
with open('file.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for column,value in enumerate(row):
try:
int(value)
except ValueError:
# It is not an integer, so it must be
# a string
unique_strings[column].add(value)

Related

How to extract all occurrences of a JSON object that share a duplicate key:value pair?

I am writing a python script that reads a large JSON file containing data from an API, and iterates through all the objects. I want to extract all objects that have a specific matching/duplicate "key:value", and save it to a separate JSON file.
Currently, I have it almost doing this, however the one flaw in my code that I cannot fix is that it skips the first occurrence of the duplicate object, and does not add it to my dupObjects list. I have an OrderedDict keeping track of unique objects, and a regular list for duplicate objects. I know this means that when I add the second occurrence, I must add the first (unique) object, but how would I create a conditional statement that only does this once per unique object?
This is my code at the moment:
import collections import OrderedDict
import json
with open('input.json') as data:
data = json.load(data)
uniqueObjects = OrderedDict()
dupObjects = list()
for d in data:
value = d["key"]
if value in uniqueObjects:
# dupObjects.append(uniqueObjects[hostname])
dupHostnames.append(d)
if value not in uniqueObjects:
uniqueObjects[value] = d
with open('duplicates.json', 'w') as g:
json.dump(dupObjects, g, indent=4)
Where you see that one commented line is where I tried to just add the object from the OrderedList to my list, but that causes it to add it as many times as there are duplicates. I only want it to add it one time.
Edit:
There are several unique objects that have duplicates. I'm looking for some conditional statement that can add the first occurrence of an object that has duplicates, once per unique object.
You could group by key.
Using itertools:
def by_key(element):
return ["key"]
grouped_by_key = itertools.groupby(data, key_func=by_key)
Then is just a matter of finding groups that have more than one element.
For details check: https://docs.python.org/3/howto/functional.html#grouping-elements
In this line you forgot .keys(), so you skip need values
if value in uniqueObjects.keys():
And this line
if value not in uniqueObjects.keys():
Edit #1
My mistake :)
You need to add first duplicate object from uniqueObjects in first if
if value in uniqueObjects:
if uniqueObjects[value] != -1:
dupObjects.append(uniqueObjects[value])
uniqueObjects[value] = -1
dupHostnames.append(d)
Edit #2
Try this option, it will write only the first occurrence in duplicates
if value in uniqueObjects:
if uniqueObjects[value] != -1:
dupObjects.append(uniqueObjects[value])
uniqueObjects[value] = -1

How to duplicate a python DictReader object?

I'm currently trying to modify a DictReader object to strip all the spaces for every cell in the csv. I have this function:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
value.strip()
return csv_reader
However, the issue with this function is that the reader that is returned has already been iterated through, so I can't re-iterate through it (as I would be able to if I just called csv.DictReader(input_file). I want to be able to create a new object that is exactly like the DictReader (i.e., has the fieldnames attribute too), but with all fields stripped of white space. Any tips on how I can accomplish this?
Two things: firstly, the reader is a lazy iterator object which is exhausted after one full run (meaning it will be empty once you return it at the end of your function!), so you have to either collect the modified rows in a list and return that list in the end or make the function a generator producing the modified rows. Secondly, str.strip() does not modify strings in-place (strings are immutable), but returns a new stripped string, so you have to rebind that new value to the old key:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
row[key] = value.strip() # reassign
yield row
Now you can use that generator function like you did the DictReader:
reader = read_the_csv(input_file)
for row in reader:
# process data which is already stripped
I prefer using inheritance, make a subclass of DictReader as follows:
from csv import DictReader
from collections import OrderedDict
class MyDictReader(DictReader):
def __next__(self):
return OrderedDict({k: v.strip()
for k, v in super().__next__().items()})
Usage, just as DictReader:
with open('../data/risk_level_model_5.csv') as input_file:
for row in MyDictReader(input_file):
print(row)

python loop through a dictionary to see if values exist

I am trying to loop though a python dictionary to see if values that I am getting from a csv file already exist in the dictionary, If the values do not exist I want to add them to the dictionary. then append this to a list.
I am getting the error list indices must be integers, not str.
example input
first name last name
john smith
john smith
example output
first_name john last name smith
user_list =[]
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary = {
'first_name':row['first_name'],
'last_name':row['last_name']
}
user_list.append(user_dictionary)
Currently, your code is creating a new dictionary on every iteration of the for-loop. If each value of the dictionary is a list, then you can append to that list via the key:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_dictionary = {"first_name":["name1, "name2", ...], "last_name":["name3", name4", ....]}
for row in reader:
if row['first_name'] not in user_dictionary['first_name'] and not in row['last_name'] not in user_dictionary['last_name']:
user_dictionary["first_name"].append(row['first_name'])
user_dictionary['last_name'].append(row['last_name'])
Generally, you can use a membership test (x in y) on dict.values() view to check if the value already exists in your dictionary.
However, if you are trying to add all unique users from your CSV file to a list of users, that has nothing to do with dictionary values testing, but a list membership testing.
Instead of iterating over the complete list each time for a slow membership check, you can use a set that will contain "ids" of all users added to a list and enable a fast O(n) (amortized) time check:
with open(input_path,'rU') as csvfile:
reader = csv.DictReader(csvfile)
user_list = []
user_set = set()
for row in reader:
user_id = (row['first_name'], row['last_name'])
if user_id not in user_set:
user = {
'first_name': row['first_name'],
'last_name': row['last_name'],
# something else ...
}
user_list.append(user)
user_set.add(user_id)
The error "list indices must be integers, not str" makes the problem clear: On the line that throws the error, you have a list that you think is a dict. You try to use a string as a key for it, and boom!
You don't give enough information to guess which dict it is: It could be user_dictionary, it could be that you're using csv.reader and not csv.DictReader as you say you do. It could even be something else-- there's no telling what else you left out of your code. But it's a list that you're using as if it's a dict.

Dynamically naming tuples for redis

I have a csv file in which each line contains a person's ID # and then a bunch of attributes. I want to be able to create a tuple for each person that contains all their attributes and then name the tuple some variation of their ID #.
All these tuples will then be added to a set in redis for storage.
I can't seem to figure out how to create a tuple that is named after the persons ID#.
I know its not best practice to dynamically name variables, but I would rather not put all the tuples in a list or set to then put into a redis set (which is a must); it just seems inefficient and cumbersome.
This is the code I have now:
with open('personlist.csv','rb') as f:
for line in f:
row = line.split(',')
personID = row[0]
attrb1 = row[1]
attrb2 = row[2]
attrb3 = row[3]
# Need to name tuple here and define as (attrb1, attrb2, attrb3)
r.lpush('allpersonslist',tuple)
This example needs additional code to function. I'm assuming you are using a redis API such as redis-py. The variable r is an open connection to redis.
import pickle
with open('personlist.csv', 'rb') as f:
for line in f:
row = line.split(',')
personID = row[0]
attrb1 = row[1]
attrb2 = row[2]
attrb3 = row[3]
#put the attributes in a tuple
tuple = (attrb1, attrb2, attrb3)
#serialize the tuple before adding it to the set
r.set("person/%d" %personID,pickle.dumps(tuple,-1))
def getPerson(Id):
return pickle.loads(r.get("person/%d" %Id))
You can call getPerson(5) to return the tuple associated with a person of ID 5.
If each person have max N attribute, there is language-independent solution based on hash. Here list 3 commands to save/read/delete values for a person.
HMSET 'allpersonshash' personID:0 personID:1 ......
HMGET 'allpersonshash' personID:0 personID:1 personID:2 ... personID:N
HDEL 'allpersonshash' personID:0 personID:1 personID:2 ... personID:N
A fairly general way to do it would be to use sorted sets with json blobs, eg:
ZADD userid, '{field1:value1,field2:value2}'

looking up values and adding to data structure

I have a .tsv file of text data, named world_bank_indicators.
I have another tsv file, which contains additional information that I need to append to a list item in my script. that file is named world_bank_regions
So far, I have code (thanks to some of the good people on this site) that filters the data that I need from world bank indicators and writes it as a 2D list to the variable mylist. additionally, I have code that reads in the second file as a dictionary. code is below:
from math import log
import csv
import re
#filehandles for spreadsheets
fhand=open("world_bank_indicators.txt", "rU")
fhand2=open("world_bank_regions.txt", "rU")
#csv reader objects for files
reader=csv.reader(fhand, dialect="excel", delimiter="\t")
reader2=csv.reader(fhand2, dialect="excel", delimiter="\t")
#empty list for appending data into
#appending into this will create a 2d list, or "a list OF lists"
mylist=list()
mylist2=list()
mydict=dict()
myset=set()
newset=set()
#filters data by iterating over each row in the reader object
#note that this IGNORES headers. This will need to be appended later
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
#plug columns into specific variables, for easier coding
#replaces "," with empty space for columns that need to be converted to floats
name=row[0]
date=row[1]
pop=row[9].replace(",",'')
mobile=row[4].replace(",",'')
health=row[6]
internet=row[5]
gdp=row[19].replace(",",'')
#only appends rows that have COMPLETE rows of data
if name != '' and date != '' and pop != '' and mobile != '' and health != '' and internet != '' and gdp != '':
#declare calculated variables
mobcap=(float(mobile)/float(pop))
gdplog=log(float(gdp))
healthlog=log(float(health))
#re-declare variables as strings, rounds decimal points to 5th place
#this could have been done once in above step, merely re-coded here for easier reading
mobcap=str(round(mobcap, 5))
gdplog=str(round(gdplog, 5))
healthlog=str(round(healthlog,5))
#put all columns into 2d list (list of lists)
newrow=[name, date, pop, mobile, health, internet, gdp, mobcap, gdplog, healthlog]
mylist.append(newrow)
myset.add(name)
for row in reader2:
mydict[row[2]]=row[0]
what I need to do now is
1. read the country name from the mylist variable,
2.look up that string in the key value of mydict, and
3. append the value of that key back to mylist.
I'm totally stumped on how to do this.
should I make both data structures dictionaries? I still wouldn't know how to execute the above steps.
thanks for any insights.
It depends what you mean by "append the value of that key back to mylist". Do you mean, append the value we got from mydict to the list that contains the country name we used to look it up? Or do you mean to append that value from mydict to mylist itself?
The latter would be a strange thing to do, since mylist is a list of lists, wheras the value we are talking about ("row[0]") is a string. I can't intuit why we would append some strings to a list of lists, even though this is what your description says to do. So I'm assuming the former :)
Let's assume that your mylist is actually called "indicators", and mydict is called "region_info"
for indicator in indicators:
try:
indicator.append(region_info[indicator[0]])
except:
print "there is no region info for country name %s" % indicator[0]
Another comment on readability. I think that the elements of mylist would be better being dicts than lists. I would do this:
newrow={"country_name" : name,
"date": date,
"population": pop,
#... etc
because then when you use these things, you can use them by name instead of number, which will be more readable:
for indicator in indicators:
try:
indicator["region_info"] = region_info[indicator["country_name"]]
except:
print "there is no region info for country name %s" % indicator["country_name"]

Categories