KeyError in DataFrame groupby - python

I've seen a few similar answers concerning KeyErrors from using groupby in DataFrame. However, their solutions and explanations don't fit my issue properly.
I find this particularly strange because I can't replicate the exception when I'm testing the script in the Python console, by feeding it with code line by line. The groupby attempt works normally when tested with one-entry examples.
Additionally, I had earlier used the same script for a similarly-formatted json, though with significantly smaller data size - and it worked seamlessly.
What am I trying to do?
I have a nested json string that I'm attempting to format using DataFrame, count the number of times a specific value appears in each column.
The json when put through a converter looks like this:
action,timestamp,campaign_id,title,type,url
open,2019-02-08T08:57:59+00:00,192a39071b,[CAMPAIGN TITLE],,
sent,2019-02-08T07:00:00+00:00,192a39071b,[CAMPAIGN TITLE],regular,
sent,2019-02-07T11:00:00+00:00,2159592071,[CAMPAIGN TITLE],regular,
open,2019-02-07T08:33:44+00:00,214d84380b,[CAMPAIGN TITLE],,
open,2019-02-07T08:33:19+00:00,56ab3a5934,[CAMPAIGN TITLE],,
open,2019-02-07T08:32:33+00:00,811ac6cae3,[CAMPAIGN TITLE],,
sent,2019-02-07T02:45:00+00:00,214d84380b,[CAMPAIGN TITLE],regular,
sent,2019-02-05T02:30:00+00:00,56ab3a5934,[CAMPAIGN TITLE],regular,
(in case it's relevant - the json is pulled directly from an API and not written in a csv or anything)
Specifically, I want to count the number of times "open", and "sent" appear under the column for "action".
This is the relevant snippet of the code I used:
dretrieved = json.loads(response.text)
dframed = pandas.DataFrame(dretrieved['activity'])
actionssummary = dframed.groupby('action').size()
try: opencount = actionssummary['open']
except:
opencount = 0
try: sentcount = actionssummary['sent']
except:
sentcount = 0
And this was the traceback:
Traceback (most recent call last):
File "MC_member_data_list_api.py", line 92, in <module>
actionssummary = dframed.groupby('action').size()
File "C:\Users\username\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 7622, in groupby
observed=observed, **kwargs)
File "C:\Users\username\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\groupby\groupby.py", line 2110, in groupby
return klass(obj, by, **kwds)
File "C:\Users\username\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\groupby\groupby.py", line 360, in __init__
mutated=self.mutated)
File "C:\Users\username\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\groupby\grouper.py", line 578, in _get_grouper
raise KeyError(gpr)
KeyError: 'action'
Anyone has a clue what's going on?
EDIT: Following suvayu's comment about Heisenbugs caused by unexpected or malformed input, I added an Exception-pass to my code as below, so that it would record a zero (0) for that user's entry and move on to the next:
dframed = pandas.DataFrame(dretrieved['activity'])
# identify action portion of the json (go to next if error)
try: actionssummary = dframed.groupby('action').size()
except:
pass
# identify count of opens
try: opencount = actionssummary['open']
except:
opencount = 0

Related

How to fix "TypeError: 'int' object is not iterable" error in concurrent.futures threading?

My goal is to scrape some links and using threads to do it faster.
When I try to make threads, it raises TypeError: 'int' object is not iterable.
Here is our script:
import requests
import pandas
import json
import concurrent.futures
from from collections import Iterable
# our profiles that we will scrape
profile = ['kaid_329989584305166460858587','kaid_896965538702696832878421','kaid_1016087245179855929335360','kaid_107978685698667673890057','kaid_797178279095652336786972','kaid_1071597544417993409487377','kaid_635504323514339937071278','kaid_415838303653268882671828','kaid_176050803424226087137783']
# lists of the data that we are going to fill up with each profile
total_project_votes=[]
def scraper(kaid):
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
sum_votes=[]
try:
data=data.json()
for item in data['scratchpads']:
try :
sum_votes=item['sumVotesIncremented']
except KeyError:
pass
sum_votes=map(int,sum_votes) # change all items of the list in integers
print(isinstance(sum_votes, Iterable)) #to check if it is an iterable element
print(isinstance(sum_votes, int)) # to check if it is a int element
sum_votes=list(sum_votes) # transform into a list
sum_votes=map(abs,sum_votes) # change all items in absolute value
sum_votes=list(sum_votes) # transform into a list
sum_votes=sum(sum_votes) # sum all items in the list
sum_votes=str(sum_votes) # transform into a string
total_project_votes=sum_votes
except json.decoder.JSONDecodeError:
total_project_votes='NA'
return total_project_votes
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
for future in concurrent.futures.as_completed(future_kaid):
kaid = future_kaid[future]
results = future.result()
# print(results) why printing only one of them and then stops?
total_project_votes.append(results[0])
# write into a dataframe and print it:
d = {'total_project_votes':total_project_votes}
dataframe = pandas.DataFrame(data=d)
print(dataframe)
I expected to get this output:
total_project_votes
0 0
1 2353
2 41
3 0
4 0
5 12
6 5529
7 NA
8 2
But instead I get this error:
TypeError: 'int' object is not iterable
I don't really understand what this error means. What is wrong in my script? How can I solve it?
When I look at Traceback it looks like this is where the issue is coming from:
sum_votes=map(int,sum_votes).
down below some additional information
Traceback:
Traceback (most recent call last):
File "toz.py", line 91, in <module>
results = future.result()
File "C:\Users\*\AppData\Local\Programs\Python\Python37-32\lib\concurrent\futures\_base.py", line 425, in result
return self.__get_result()
File "C:\Users\*\AppData\Local\Programs\Python\Python37-32\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "C:\Users\*\AppData\Local\Programs\Python\Python37-32\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "my_scrap.py", line 71, in scraper
sum_votes=map(int,sum_votes) # change all items of the list in integers
TypeError: 'int' object is not iterable
I found my error:
I should have put:
sum_votes.append(item['sumVotesIncremented'])
Instead of:
sum_votes=item['sumVotesIncremented'].
Also, because we only have one item here: total_project_votes. Our tuple results have only one item.
And that can cause some problems. Because when we do results[0] it doesn't behave like a list.
It is not going to show the whole total_project_votes but the first character of the string. (For example "Hello" become "H").
And if total_project_votes was an int object instead of a string. It would generate an other error.
To solve this issue, I need to add another object in the tuple results and then when you do results[0] it actually behave like a list.

ValueError in Python 3 code

I have this code that will allow me to count the number of missing rows of numbers within the csv for a script in Python 3.6. However, these are the following errors in the program:
Error:
Traceback (most recent call last):
File "C:\Users\GapReport.py", line 14, in <module>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
File "C:\Users\GapReport.py", line 14, in <genexpr>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
ValueError: invalid literal for int() with base 10: 'AC-SEC 000000001'
Code:
import csv
def out(*args):
print('{},{}'.format(*(str(i).rjust(4, "0") for i in args)))
prev = 0
data = csv.reader(open('Padded Numbers_export.csv'))
print(*next(data), sep=', ') # header
for line in data:
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
if start != prev+1:
out(prev+1, start-1)
prev = end
out(start, end)
I'm stumped on how to fix these issues.Also, I think the csv many lines in it, so if there's a section that limits it to a few numbers, please feel free to update me on so.
CSV Snippet (Sorry if I wasn't clear before!):
The values you have in your CSV file are not numeric.
For example, FMAC-SEC 000000001 is not a number. So when you run int(s.strip()[2:]), it is not able to convert it to an int.
Some more comments on the code:
What is the utility of doing EndDoc_Padded, EndDoc_Padded = (...)? Currently you are assigning values to two variables with the same name. Either name one of them something else, or just have one variable there.
Are you trying to get the two different values from each column? In that case, you need to split line into two first. Are the contents of your file comma separated? If yes, then do for s in line.split(','), otherwise use the appropriate separator value in split().
You are running this inside a loop, so each time the values of the two variables would get updated to the values from the last line. If you're trying to obtain 2 lists of all the values, then this won't work.

Python: IndexError immediately after successful evaluation of list at same index

I am processing a tab-delimited data set of almost three million lines. Since I have enough memory, I am loading the entire data file into memory via a list. I then go and clean up inconsistencies with the data row-by-row. After 150,000 lines of successful computation, the program halts with this error:
Traceback (most recent call last):
File "C:/Users/me/dataset_cleanup_utility/dataset_cleanup.py", line 466, in <module>
ROWS_PASSED = passed
File "C:/Users/me/dataset_cleanup_utility/dataset_cleanup.py", line 43, in dataset_cleanup
row = make_consistent(row, row_count)
File "C:/Users/me/dataset_cleanup_utility/dataset_cleanup.py", line 180, in make_consistent
row[11] = remove("(STAFF)", "", str(row[11]))
IndexError: string index out of range
The code sample that causes this is below:
if "(STAFF)" in str(row[11]):
row[11] = remove("(STAFF)", "", str(row[11]))
def remove(unwanted, wanted, _str):
s = str(_str).rsplit(unwanted, 1)
if len(s) == 2:
return str(s[0]) + wanted + str(s[1])
else:
return str(s[0]) + wanted
Here, row is the list containing all of the columns for a given row and the IndexError is being thrown INSIDE the if statement that checks row[11]. So what this error is telling me is that the row[11] was okay when evaluating the if statement, but inside the if statement, when evaluated again, row[11] no longer exists. How could this be if no changes to row[11] occurred after the if statement was evaluated?

Python - Delete from Array while enumerating

Error:
Traceback (most recent call last):
File "<string>", line 10, in <module>
File "/Users/georg/Programmierung/Glyphs/Glyphs/Glyphs/Scripts/GlyphsApp.py", line 59, in __iter__
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC/objc/_convenience.py", line 589, in enumeratorGenerator
yield container_unwrap(anEnumerator.nextObject(), StopIteration)
objc.error: NSGenericException - *** Collection <__NSArrayM: 0x7f9906245480> was mutated while being enumerated.
I know this error occurs because I'm trying to delete objects from the array while also enumerating these objects. But I don't know how to solve it. I'm fairly new to object orientated programming and am limiting myself to scripting.
I searched the web and it seems to solve the error, I have to copy the array before deleting objects from it. When I'm tying to copy the array via deepcopy
import copy
pathcopy = copy.deepcopy(thisLayer.paths)
right before for path in thisLayer.paths:
But in this case I get the following error:
Cannot pickle Objective-C objects
Usually the program crashes after the first Glyph. For clarification: I work in Glyphsapp, a Typedesigning software.
Here is the Code:
# loops through every Glyph and deletes every path with nodes on the left half
for myGlyph in Glyphs.font.glyphs:
glname = myGlyph.name
thisLayer = Glyphs.font.glyphs[glname].layers[1]
middle = thisLayer.bounds.size.width/2+thisLayer.LSB
thisGlyph = thisLayer.parent
for path in thisLayer.paths: # this is where the code crashes
for thisNode in path.nodes:
if thisNode.position.x < middle:
#print thisNode.position.x
try:
thisLayer = path.parent()
except Exception as e:
thisLayer = path.parent
try:
thisLayer.removePath_ ( thisNode.parent() )
except AttributeError:
pass
Thank you in advance
Thank you very much Andreas,
with your help I was able to fix my code :-)
Here is the outcome:
for myGlyph in Glyphs.font.glyphs:
glname = myGlyph.name
thisLayer = Glyphs.font.glyphs[glname].layers[1]
middle = thisLayer.bounds.size.width/2+thisLayer.LSB
thisGlyph = thisLayer.parent
for path in thisLayer.paths:
for thisNode in path.nodes:
if thisNode.position.x < middle:
nodeList = []
nodeList.append(thisNode.parent())
nLCopy = nodeList[:]
for ncontainer in nLCopy:
thisLayer.removePath_ ( ncontainer )

Substring in Python, what is wrong here?

I'm trying to simulate a substring in Python but I'm getting an error:
length_message = len(update)
if length_message > 140:
length_url = len(short['url'])
count_message = 140 - length_url
update = update["msg"][0:count_message] # Substring update variable
print update
return 0
The error is the following:
Traceback (most recent call last):
File "C:\Users\anlopes\workspace\redes_sociais\src\twitterC.py", line 54, in <module>
x.updateTwitterStatus({"url": "http://xxx.com/?cat=49s", "msg": "Searching for some ....... tips?fffffffffffffffffffffffffffffdddddddddddddddddddddddddddddssssssssssssssssssssssssssssssssssssssssssssssssssseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeedddddddddddddddddddddddddddddddddddddddddddddddfffffffffffffffffffffffffffffffffffffffffffff "})
File "C:\Users\anlopes\workspace\redes_sociais\src\twitterC.py", line 35, in updateTwitterStatus
update = update["msg"][0:count_message]
TypeError: string indices must be integers
I can't do this?
update = update["msg"][0:count_message]
The variable "count_message" return "120"
Give me a clue.
Best Regards,
UPDATE
I make this call, update["msg"] comes from here
x = TwitterC()
x.updateTwitterStatus({"url": "http://xxxx.com/?cat=49", "msg": "Searching for some ...... ....?fffffffffffffffffffffffffffffdddddddddddddddddddddddddddddssssssssssssssssssssssssssssssssssssssssssssssssssseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeedddddddddddddddddddddddddddddddddddddddddddddddfffffffffffffffffffffffffffffffffffffffffffffddddddddddddddddd"})
Are you looping through this code more than once?
If so, perhaps the first time through update is a dict, and update["msg"] returns a string. Fine.
But you set update equal to the result:
update = update["msg"][0:int(count_message)]
which is (presumably) a string.
If you are looping, the next time through the loop you will have an error because now update is a string, not a dict (and therefore update["msg"] no longer makes sense).
You can debug this by putting in a print statement before the error:
print(type(update))
or, if it is not too large,
print(repr(update))

Categories