Computing aggregate by creating nested dictionary on the fly - python

I'm new to python and I could really use your help and guidance at the moment. I am trying to read a csv file with three cols and do some computation based on the first and second column i.e.
A spent 100 A spent 2040
A earned 60
B earned 48
B earned 180
A spent 40
.
.
.
Where A spent 2040 would be the addition of all 'A' and 'spent' amounts. This does not give me an error but it's not logically correct:
for row in rows:
cols = row.split(",")
truck = cols[0]
if (truck != 'A' and truck != 'B'):
continue
record = cols[1]
if(record != "earned" and record != "spent"):
continue
amount = int(cols[2])
#print(truck+" "+record+" "+str(amount))
if truck in entries:
#entriesA[truck].update(record)
if record in records:
records[record].append(amount)
else:
records[record] = [amount]
else:
entries[truck] = records
if record in records:
records[record].append(amount)
else:
entries[truck][record] = [amount]
print(entries)
I am aware that this part is incorrect because I would be adding the same inner dictionary list to the outer dictionary but I'm not sure how to go from there:
entries[truck] = records
if record in records:
records[record].append(amount)
However, Im not sure of the syntax to create a new dictionary on the fly that would not be 'records'
I am getting:
{'B': {'earned': [60, 48], 'spent': [100]}, 'A': {'earned': [60, 48], 'spent': [100]}}
But hoping to get:
{'B': {'earned': [48]}, 'A': {'earned': [60], 'spent': [100]}}
Thanks.

For the kind of calculation you are doing here, I highly recommend Pandas.
Assuming in.csv looks like this:
truck,type,amount
A,spent,100
A,earned,60
B,earned,48
B,earned,180
A,spent,40
You can do the totalling with three lines of code:
import pandas
df = pandas.read_csv('in.csv')
totals = df.groupby(['truck', 'type']).sum()
totals now looks like this:
amount
truck type
A earned 60
spent 140
B earned 228
You will find that Pandas allows you to think on a much higher level and avoid fiddling with lower level data structures in cases like this.

if record in entries[truck]:
entries[truck][record].append(amount)
else:
entries[truck][record] = [amount]
I believe this is what you would want? Now we are directly accessing the truck's records, instead of trying to check a local dictionary called records. Just like you did if there wasn't any entry of a truck.

Related

Iterate through multiple list of dictionaries

I would like to iterate through list of dictionaries in order to get a specific value, but I can't figure it out.
I've made a simplified version of what I've been working with. These lists or much longer, with more dictionaries in them, but for the sake of an example, I hope this shortened dataset will be enough.
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
For example, I need the values of the key "30" from the dictionaries above. I've managed to get those and stored them in a list of integers. ( [626, 914] )
These integers are basically IDs. After this, I need to get the value of these IDs from another list of dictionaries.
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
I would like to print/store the track_names and track_lengths of the IDs I've got from the listOfResults earlier. Unfortunately, I've ended up in a complete mess of for loops.
You want something like this:
ids = [626, 914]
result = { track for track in list_of_tracks if track.get("track_id") in ids }
I unfortunately can't comment on the answer given by Nathaniel Ford because I'm a new user so I just thought I'd share it here as an answer.
His answer is basically correct, but I believe you need to replace the curly braces with brackets or else you will get this error: TypeError: unhashable type: 'dict'
The answer should look like:
ids = [626, 914]
result = [track for track in listOfTrack if track.get("track_id") in ids]
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
ids = [x.get('30') for x in listOfResults]
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
out = [x for x in listOfTrack if x.get('track_id') in ids]
Alternatively, it may be time to learn a new library if you're going to be doing a lot this.
import pandas as pd
results_df = pd.DataFrame(listOfResults)
track_df = pd.DataFrame(listOfTrack)
These Look like:
# results_df
29 30 10 32
0 2523 626 0 128
1 2466 914 0 69
# track_df
track_length track_id track_name
0 1.26 626 Rainbow Road
1 6.21 914 Excalibur
Now we can answer your question:
# Creates a mask of rows where this is True.
mask = track_df['track_id'].isin(results_df['30'])
# Specifies that we want just those two columns.
cols = ['track_length', 'track_name']
out = track_df.loc[mask, cols]
print(out)
# Or we can make it back into a dictionary:
print(out.to_dict('records'))
Output:
track_length track_name
0 1.26 Rainbow Road
1 6.21 Excalibur
[{'track_length': 1.26, 'track_name': 'Rainbow Road'}, {'track_length': 6.21, 'track_name': 'Excalibur'}]

Iterating on a list based on different parameters

I'm once again asking for help on iterating over a list. This is the problem that eludes me this time:
I have this table:
that contains various combinations of countries with their relative trade flow.
Since trade goes both ways, my list has for example one value for ALB-ARM (how much albania traded with armenia that year) and then down the list another value for ARM-ALB (the other way around).
I want to sum this two trade values for every pair of countries; and I've been trying around with some code but I quickly realise how all my approaches are wrong.
How do I even set it up? I feel like it's too hard with a loop and it will be easy with some function that I don't even know exists.
Example data in Table format:
from astropy.table import Table
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
t = Table([country1,country2,flow],names=["1","2","flow"],meta={"Header":"Table"})
and the expected output would be:
trade = [700,90,700,320,90,320]
result = Table([country1,country2,flow,trade],names=["1","2","flow","trade"],meta={"Header":"Table"})
Thank you in advance all
Maybe this could help:
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
trade = []
pairs = map(lambda t: '-'.join(t), zip(country1, country2))
flow_map = dict(zip(pairs, flow))
for left_country, right_country in zip(country1, country2):
trade.append(flow_map['-'.join((left_country, right_country))] + flow_map['-'.join((right_country, left_country))])
print(trade)
outputs:
[700, 90, 700, 320, 90, 320]

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)

I have a spreadsheet with several columns containing survey responses. This spreadsheet will be merged into others and I will then have duplicate rows similar to the ones below. I will then need to take all questions with the same text and calculate the percentages of the answers based on the entirety of the merged document.
Example Excel Data
**Poll Question** **Poll Responses**
The content was clear and effectively delivered  37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating  37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training  37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course?  5 Total Votes
My method for calculating a non-percent number of votes will be to convert the percentages to a number. E.G. find and extract 37 from 37 Total Votes, then use the following formula to get the number of users that voted on that particular answer: percent * total / 100.
So 24.30 * 37 / 100 = 8.99 rounded up means 9 out of 37 people voted for "Strongly Agree".
Here's an example spreadsheet of what I'd like to be able to do:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(note: non-percent and subtotal would be newly created columns)
Currently I take a folder full of .xls files and I loop through that folder, saving them to another in an .xlsx format. Inside that loop, I've added a comment block that contains my # NEW test CODE where I'm trying to put the logic to do this.
As you can see, I'm trying to target the cell and get the value, then get some regex and extract the number from it, (then add it to the subtotal column in that row. I then want to add it till I see a new instance of a row containing x Total Votes.
Here's my current code:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL;DR
This seems very convoluted, but if I can get the two new columns that contain the total votes for a question and the number (not percentage) of votes for an answer, then I can do some VLOOKUP magic for this on the merged document. Any help or methodology suggestions would be greatly appreciated. Thanks!
I solved this, I'll post the pseudo code below:
I loop through each sheet. Inside that loop, I loop through each row using for n, row in enumerate(df.itertuples(), 1):.
I get the value of the field that might contain "Poll Response" poll_response = str(row[3])
Using an if / else I check if the poll_response contains the text "Total Votes". If it does, it must be a question, otherwise it must be a row with an answer.
In the if for the question I get the cells that contain the data I need. I then have a function that compares the question text with all objects question text in the array. If it's a match, then I simply update the fields of the object, otherwise I create a new question object.
else the row is an answer row, and I use the question text to find the object in the array and update/add the answers or data.
This process loops through all the rows in each spreadsheet, and now I have my array full of unique question objects.

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

Categories