Join two CSV files in python using dictreader

Join two CSV files in python using dictreader - python

I realise the info to answer this question is probably already on here, but as a python newby I've been trying to piece together the info for a few weeks now and I'm hitting some trouble.
this question Python "join" function like unix "join" answers how to do a join on two lists easily, but the problem is that dictreader objects are iterables and not straightforward lists, meaning that there's an added layer of complications.
I basically am looking for an inner join on two CSV files, using the dictreader object. Here's the code I have so far:
def test(dictreader1, dictreader2):
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member']=dictline2['member']:
matchedlist.append(dictline1, dictline2)
else: continue
return matchedlist
This is giving me an error at the if statement, but more importantly, I don't seem to be able to access the ['member'] element of the dictionary within the iterable, as it says it has no attribute "getitem".
Does anyone have any thoughts on how to do this? For reference, I need to keep the lists as iterables because each file is too big to fit in memory. The plan is to control this entire function within another for loop that only feeds it a few lines at a time to iterate over. So it will read one line of the left hand file, iterate over the whole second file to find a member field that matches and then join the two lines, similar to an SQL join statement.
Thanks for any help in advance, please forgive any obvious errors on my part.

A few thoughts:
Replace the = with ==. The latter is used for equality tests; the former for assignments.
Add a line a the beginning, dictreader2 = list(dictreader2). That will make it possible to loop over the dictionary entries more than once.
Add a second pair of parenthese to matchedlist.append((dictline1, dictline2)). The list.append method takes just one argument, so you want to create a tuple out of dictline1 and dictline2.
The final else: continue is unnecessary. A for-loop will automatically loop for you.
Use a print statement or somesuch to verify that dictline1 and dictline2 are both dictionary objects that have member as a key. It could be that your function is correct, but is being called with something other than a dictreader object.
Here is a worked out example using a list of dicts as input (similar to what a DictReader would return):
>>> def test(dictreader1, dictreader2):
dictreader2 = list(dictreader2)
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member'] == dictline2['member']:
matchedlist.append((dictline1, dictline2))
return matchedlist
>>> dr1 = [{'member': 2, 'value':'abc'}, {'member':3, 'value':'def'}]
>>> dr2 = [{'member': 4, 'tag':'t4'}, {'member':3, 'tag':'t3'}]
>>> test(dr1, dr2)
[({'member': 3, 'value': 'def'}, {'member': 3, 'tag': 't3'})]
A further suggestion is to combine the two dictionaries into a single entry (this is closer to what an SQL inner join would do):
>>> def test(dictreader1, dictreader2):
dictreader2 = list(dictreader2)
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member'] == dictline2['member']:
entry = dictline1.copy()
entry.update(dictline2)
matchedlist.append(entry)
return matchedlist
>>> test(dr1, dr2)
[{'member': 3, 'tag': 't3', 'value': 'def'}]
Good luck with your project :-)

Related

How to continue iteration of JSON object/list with "for" loop nested in another "for" loop?

I'm trying to iterate through a JSON object/list to get the prices from the list. It works fine for the first "for" loop, but then for my program I need to have another "for" loop nested inside the first loop that effectively continues iterating again where the first loop left off.
edit: I apologise as I probably wasn't very clear as why I'm structuring this code this way. It probably isn't the best solution as I'm not very experienced with python but I'm trying to simulate the looping flow of price data based on historical prices from this JSON. So when the price is equal to something in the first loop, I was trying to continue the flow of data from where the last loop left off (the last price it read) in a nested loop with different if statements, now that the first one has been triggered. I understand this is probably a very poor way of getting around this problem and I'd be happy to hear suggestions on how to do it in a much simpler way.
for i in json:
time.sleep(1)
price = i['p']
if price == "0.00000183":
print("foo")
for i in json:
time.sleep(1)
price = float(i['p'])
if price == "0.00000181":
print("sold")
else:
continue
elif price == "0.00000185":
print ("sold")
break
else:
continue
Example segment of the JSON list:
[
{
"a":4886508,
"p":"0.00000182",
"q":"10971.00000000",
"f":5883037,
"l":5883037,
"T":1566503952430,
"m":1,
"M":1
},
{
"a":4886509,
"p":"0.00000182",
"q":"10971.00000000",
"f":5883038,
"l":5883038,
"T":1566503953551,
"m":1,
"M":1
},
{
"a":4886510,
"p":"0.00000182",
"q":"10971.00000000",
"f":5883039,
"l":5883039,
"T":1566503954431,
"m":1,
"M":1
}
]

You have 3 nested dicts inside the list and you’re using one value price to store all 3; therefore overwriting the value for every iteration.
I suggest using a simple list comprehension to get the price for each sub-dict.
[i['p'] for i in json]
This results in:
['0.00000182', '0.00000182', '0.00000182']
Although as a forewarning there is a builtin module called json so should you want to use that in the future you will need to use a more descriptive name; as well as your other naming conventions are very non informative in the first place consider better conventions.
P.S the price values are strings thus it’s meaningless to compare to int’s

Your sample code doesn't make much sense, so it's not clear what you're try to accomplish.
Ignoring that, if you use json.load() to read your file, the result will be a Python list of dictionaries (which are equivalent to JSON objects). Afterwards you can then just iterate through this list — you don't need to do anything else to go to the next dictionary/object.
Within each dictionary/object you can randomly access the key/value pairs it contains. If yoy want to iterate over them for some reason, you can used obj.keys(), obj.values(), or obj.items(), as you can with any dictionary.
Here's a small example of what I'm saying:
import json
filename = 'list_sample.json'
with open(filename) as json_file:
object_list = json.load(json_file)
for obj in object_list:
if obj['p'] == "0.00000183":
print("foo")
price = float(obj['p'])
print(price)

Use pandas
import pandas as pd
The question was how do I iterate through a json
I think the real question and goal is, how do I access my data from this object and gather insight from it.
That's the power of pandas. Now all of the information is available and easy to use.
If the data is inside a file, just as it's shown:
df = pd.read_json('file.json')
If the data is just a named object (e.g. data = [{...}, {...}, {...}])
df = pd.DataFrame(data)
Output of df:
Get p, or any of the other information:
p_list = df.p.tolist()
print(p_list)
>>> ['0.00000182', '0.00000182', '0.00000182']

You can write two loops using the same iterator object. You just need to create the iterator from your json list of dictionaries:
iterator = iter(json_data)
for i in iterator:
# stuff
for j in iterator: # start looping from where we left off
... # more stuff
Note that this only makes sense if the inner loop will be interrupted at some point by a break statement. Otherwise it will consume all of the iterator's contents and the outer loop will have nothing left. If you were expecting that, you should probably just break out of the outer loop and write the (formerly) inner loop at the same level:
for i in iterator:
if something():
break
for j in iterator: # iterate over the rest of the iterator's values
...
Note that I used json_data in my example code, rather than your variable name json. Using json as a variable name may be a bad idea because it may be the name of the module you're using to parse your data. Reusing the name may get confusing (json = json.load(filename) for instance can only be run once).

Apologies guys, I should've made it more clear that what I was looking for was probably simpler than what I made it out to be or how I described it. I basically just wanted to continue the second loop from where the first left off, with something like this which I have now found to be an alright solution.
elementCount = 0
for i in json:
time.sleep(1)
price = i['p']
elementCount = elementCount + 1
if price == "0.00000183":
print("foo")
for i in json[elementCount:]:
time.sleep(1)
price = float(i['p'])
if price == "0.00000181":
print("sold")
else:
continue
elif price == "0.00000185":
print ("sold")
break
else:
continue
I basically just didn't understand the format for continuing the iteration from a certain element in the list, like this [40:]. Thanks for all your answers and help.

Changing Json from API with Python

I have a json code from API and I want to get new chat members with the code below but I only get the first two results and not the last (Tester). Why? It should itereate through the whole json file, shouldn't it?
r = requests.get("https://api.../getUpdates").json()
chat_members = []
a = 0
for i in r:
chat_members.append(r['result'][a]['message']['new_chat_members'][0]['last_name'])
a = a + 1
Json here:
{"ok":true,"result":[{"update_id":213849278,
"message":{"message_id":37731,"from":{"id":593029363,"is_bot":false,"first_name": "#tutu"},"chat":{"id":-1001272017174,"title":"tester account","username":"v_glob","type":"supergroup"},"date":1537470595,"new_chat_participant":{"id":593029363,"is_bot":false,"first_name":"tutu "},"new_chat_member":{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutupeng"},"new_chat_members":[{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutu"}]}},{"update_id":213849279,
"message":{"message_id":37732,"from":{"id":658150956,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"chat":{"id":-10012720,"title":"v glob OFFICIAL","username":"v_glob","type":"supergroup"},"date":1537484441,"new_chat_participant":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_member":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_members":[{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"}]}},{"update_id":213849280,
"message":{"message_id":12,"from":{"id":696749142,"is_bot":false,"first_name":"daniel","language_code":"cs-cz"},"chat":{"id":696749142,"first_name":"daniel","type":"private"},"date":1537537013,"text":"/stat","entities":[{"offset":0,"length":5,"type":"bot_command"}]}},{"update_id":213849281,
"message":{"message_id":37740,"from":{"id":669620,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"chat":{"id":-100127201,"title":"test account","username":"v_glob","type":"supergroup"},"date":1537537597,"new_chat_participant":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_member":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_members":[{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"}]}}]}

Because you iterate over the entire response dict. The top level only has two items, so that's what you iterate over. Note that you don't actually use the iterator variable, and you have a completely unnecessary separate counter.
Instead, you should be iterating over the result dict:
for result in r['result']:
if "new_chat_members" in result['message']:
chat_members.append(result['message']['new_chat_members'][0]['last_name'])

A colleague of mine has come up with a solution:
for i in l['result']:
chat_members.append(i['message']['new_chat_member']['first_name'])
To sum up: Iterate through 'result' with no positional arguments

Django turn lists into one list

I want to put this data into one list so i can sort it by timestamp. I tried itertools chain but that didn't really work.
Thank you for your help :)
I'm very bad at making clear what i want to do so im sorry upfront if this takes some explaning.
If i try a chain i get the value back like this.
I want to display it on the html page like this :
date, name , rating, text (newline)
likes comments
Which would work the way i did it but if i want to sort it by time, it wouldn't work so i tried to think of a way to make it into a sortable list. Which can be displayed. Is that understandable ?
['Eva Simia', 'Peter Alexander', {'scale': 5, 'value': 5}, {'scale': 5, 'value': 5}, 1, 0, 1, 0]
it should look like this:
['Peter Alexander, scale:5, value:5, 1,0]
['Eva Simia, scale:5, value:5, 1,0]
for i in user:
name.append(i['name'])
for i in next_level:
rating_values.append(i['rating'])
for i in comment_values:
comments_count.append(i['count'])
for i in likes_values:
likes_count.append(i['count'])
for s in rating_values:
ratings.append(s['value'])
for s in date:
ratings.append(s['date'])
ab = itertools.chain([name], [rating_values],
[comments_count], [likes_values],
[comment_values], [date])
list(ab)

Updated after clarification:
The problem as I understand it:
You have a dataset that is split into several lists, one list per field.
Every list has the records in the same order. That is, user[x]'s rating value is necessarily rating_values[x].
You need to merge that information into a single list of composite items. You'd use zip() for that:
merged = zip(user, next_level, comment_values, likes_values, rating_values, date)
# merged is now [(user[0], next_level[0], comment_values[0], ...),
# (user[1], next_level[1], comment_values[1], ...),
# ...]
From there, you can simply sort your list using sorted():
result = sorted(merged, key=lambda i: (i[5], i[0]))
The key argument must be a function. It is given each item in the list once, and must return the key that will be used to compare items. Here, we build a short function on the fly, that returns the date and username, effectively telling things will be sorted, first by date and if dates are equal, then by username.
[Past answer about itertools.chain, before the clarification]
ab = list(itertools.chain(
(i['name'] for i in user),
(i['rating'] for i in next_level),
(i['count'] for i in comment_values),
(i['count'] for i in likes_values),
(i['value'] for i in rating_values),
(i['date'] for i in date),
))
The point of using itertools.chain is usually to avoid needless copies and intermediary objects. To do that, you want to pass it iterators.
chain will return an iterator that will iterate through each of the given iterators, one at a time, moving to the next iterator when current stops.
Note every iterator has to be wrapped in parentheses, else python will complain. Do not make it square brackets, at it would cause an intermediary list to be built.

You can join list by simply using +.
l = name + rating_values + comments_count + ...

date, rating_values,likes_values,comment_values,next_level,user = (list(t) for t in zip(*sorted(zip(date, rating_values,likes_values,comment_values,next_level,user))))

Parse string with three-level delimitation into dictionary

I've found how to split a delimited string into key:value pairs in a dictionary elsewhere, but I have an incoming string that also includes two parameters that amount to dictionaries themselves: parameters with one or three key:value pairs inside:
clientid=b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0&keyid=987654321&userdata=ip:192.168.10.10,deviceid:1234,optdata:75BCD15&md=AMT-Cam:avatar&playbackmode=st&ver=6&sessionid=&mk=PC&junketid=1342177342&version=6.7.8.9012
Obviously these are dummy parameters to obfuscate proprietary code, here. I'd like to dump all this into a dictionary with the userdata and md keys' values being dictionaries themselves:
requestdict {'clientid' : 'b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0', 'keyid' : '987654321', 'userdata' : {'ip' : '192.168.10.10', 'deviceid' : '1234', 'optdata' : '75BCD15'}, 'md' : {'Cam' : 'avatar'}, 'playbackmode' : 'st', 'ver' : '6', 'sessionid' : '', 'mk' : 'PC', 'junketid' : '1342177342', 'version' : '6.7.8.9012'}
Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries? What would the syntax be? If not, I suppose I'll have to split by & and then check & handle splits that contain : but even then I can't figure out the syntax. Can someone help? Thanks!

I basically took Kyle's answer and made it more future-friendly:
def dictelem(input):
parts = input.split('&')
listing = [part.split('=') for part in parts]
result = {}
for entry in listing:
head, tail = entry[0], ''.join(entry[1:])
if ':' in tail:
entries = tail.split(',')
result.update({ head : dict(e.split(':') for e in entries) })
else:
result.update({head: tail})
return result

Here's a two-liner that does what I think you want:
dictelem = lambda x: x if ':' not in x[1] else [x[0],dict(y.split(':') for y in x[1].split(','))]
a = dict(dictelem(x.split('=')) for x in input.split('&'))

Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries?
Of course you can, but (a) you probably don't want to, because nested comprehensions beyond two levels tend to get unreadable, and (b) this super-simple syntax won't work for cases like yours, where only some of the data can be turned into a dict.
For example, what should happen with 'PC'? Do you want to make that into {'PC': None}? Or maybe the set {'PC'}? Or the list ['PC']? Or just leave it alone? You have to decide, and write the logic for that, and trying to write it as an expression will make your decision very hard to read.
So, let's put that logic in a separate function:
def parseCommasAndColons(s):
bits = [bit.split(':') for bit in s.split(',')]
try:
return dict(bits)
except ValueError:
return bits
This will return a dict like {'ip': '192.168.10.10', 'deviceid': '1234', 'optdata': '75BCD15'} or {'AMT-Cam': 'avatar'} for cases where each comma-separated component has a colon inside it, but a list like ['1342177342'] for cases where any of them don't.
Even this may be a little too clever; I might make the "is this in dictionary format" check more explicit instead of just trying to convert the list of lists and see what happens.
Either way, how would you put that back into your original comprehension?
Well, you want to call it on the value in the line.split('='). So let's add a function for that:
def parseCommasAndColonsForValue(keyvalue):
if len(keyvalue) == 2:
return keyvalue[0], parseCommasAndColons(keyvalue[1])
else:
return keyvalue
requestDict = dict(parseCommasAndColonsForValue(line.split('='))
for line in clientRequest.split('&'))
One last thing: Unless you need to run on older versions of Python, you shouldn't often be calling dict on a generator expression. If it can be rewritten as a dictionary comprehension, it will almost certainly be clearer that way, and if it can't be rewritten as a dictionary comprehension, it probably shouldn't be a 1-liner expression in the first place.
Of course breaking expressions up into separate expressions, turning some of them into statements or even functions, and naming them does make your code longer—but that doesn't necessarily mean worse. About half of the Zen of Python (import this) is devoted to explaining why. Or one quote from Guido: "Python is a bad language for code golf, on purpose."
If you really want to know what it would look like, let's break it into two steps:
>>> {k: [bit2.split(':') for bit2 in v.split(',')] for k, v in (bit.split('=') for bit in s.split('&'))}
{'clientid': [['b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0']],
'junketid': [['1342177342']],
'keyid': [['987654321']],
'md': [['AMT-Cam', 'avatar']],
'mk': [['PC']],
'playbackmode': [['st']],
'sessionid': [['']],
'userdata': [['ip', '192.168.10.10'],
['deviceid', '1234'],
['optdata', '75BCD15']],
'ver': [['6']],
'version': [['6.7.8.9012']]}
That illustrates why you can't just add a dict call for the inner level—because most of those things aren't actually dictionaries, because they had no colons. If you changed that, then it would just be this:
{k: dict(bit2.split(':') for bit2 in v.split(',')) for k, v in (bit.split('=') for bit in s.split('&'))}
I don't think that's very readable, and I doubt most Python programmers would. Reading it 6 months from now and trying to figure out what I meant would take a lot more effort than writing it did.
And trying to debug it will not be fun. What happens if you run that on your input, with missing colons? ValueError: dictionary update sequence element #0 has length 1; 2 is required. Which sequence? No idea. You have to break it down step by step to see what doesn't work. That's no fun.
So, hopefully that illustrates why you don't want to do this.

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.

Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]

The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])

The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).

The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join two CSV files in python using dictreader - python

Related

How to continue iteration of JSON object/list with "for" loop nested in another "for" loop?

Changing Json from API with Python

Django turn lists into one list

Parse string with three-level delimitation into dictionary

How to compare an element of a tuple (int) to determine if it exists in a list

Categories

Resources