Python data wrangling issues

Python data wrangling issues - python

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.

Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35

Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Related

How to read space delimited data, two row types, no fixed width and plenty of missing values?

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.

I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!

There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.

Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635

How to import data from CSV file containing certain words?

I have a CSV file containing daily data on yields of different government bonds of varying maturities. The headers are formatted as by the country followed by the maturity of the bond, for eg UK 10Y. What I would like to do is just import all the yields for one government bond at all maturities for one date, so for example import all the UK government bond yields at a particular date. The first date is 07/01/2021.
I know I can use Pandas, but all the codes I have seen require to use usecols function when importing. I'd like to just create a function and import only the data that I want without using usecols.
Snapshot of data, UK data is further right, but format is the same

You can try:
import time
import datetime
col_to_check = "UK government bond yields"
get_after = "07/01/2021"
get_after = time.mktime(datetime.datetime.strptime(get_after, "%d/%m/%Y").timetuple())
with open("yourfile.csv", "r") as msg:
data = msg.readlines()
index_to_check = data[0].split(",").index(col_to_check)
for i, v in enumerate(data):
if i == 0:
pass
else:
date = time.mktime(datetime.datetime.strptime(v.split(",")[index_to_check], "%d/%m/%Y").timetuple())
if date > get_after:
pass
else:
data[i] = ""
print ([x for x in data if x])
This is untested code as you did not provide a sample input but in principle it should work.
You have the header name of the column you want to check, the limit date.
You get the index of the first in your csv row. You convert the limit date to timestamp integer.
Then you read your data line by line and check. If the date/timestamp is greater than your limit you pass, else you assign empty value at the corresponding index of data.
Finally you remove empty elements to get the final list.

Working with data from CSV with Python without using Pandas

I am very new to using python to process data on CSV files. I have a CSV file with the data below. I want to take the averages of the time stamps for each Sprint, Jog, and Walk column by session. The below example has the subject John Doe and Session2 and Session3 that I would like to find the averages of separately and write them to a new CSV file. Is there a way not using PANDAS but other modules like CSV or Numpy to gather the data by the person (subject) and then by session. I have tried to make a dictionary but the keys get overwritten. I have also tried using a List but I cannot figure out how to target the sessions to average them out. Not sure what I am doing wrong. I also tried using dictReader to read the fieldnames and then to process the data but I cannot figure out how to group all the John Doe Session2 data to find the average of the times.
Subject, Session, Course, Size, Category, Sprint, Jog, Walk
John Doe, Session2, 17, 2, Bad, 25s, 36s, 55s
John Doe, Session2, 3, 2, Good, 26s, 35s, 45s
John Doe, Session2, 1, 2, Good, 22s, 31s, 47s
John Doe, Session3, 5, 2, Good, 16s, 32s, 55s
John Doe, Session3, 2, 2, Good, 13s, 24s, 52s
John Doe, Session3, 16, 2, Bad, 15s, 26s, 49s
PS I say no PANDAS because my groupmates are not adding this module since we have so many other dependencies.

Given your input, these built-in Python libraries can generate the output you want:
import csv
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
with open('input.csv','r',newline='') as fin,open('output.csv','w',newline='') as fout:
# skip needed because sample data had spaces after comma delimiters.
reader = csv.DictReader(fin,skipinitialspace=True)
# Output file will have these fieldnames
writer = csv.DictWriter(fout,fieldnames='Subject Session Sprint Jog Walk'.split())
writer.writeheader()
# for each subject/session, groupby returns a 2-tuple of sort key and an
# iterator over the rows of that key. Data must be sorted by the key already!
for (subject,session),group in groupby(reader,key=itemgetter('Subject','Session')):
# built the row to output. defaultdict(int) assumes integer(0) if key doesn't exist.
row = defaultdict(int)
row['Subject'] = subject
row['Session'] = session
# Count the items for average.
count = 0
for item in group:
count += 1
# sum the rows, removing the 's'
for col in ('Sprint','Jog','Walk'):
row[col] += int(item[col][:-1])
# produce the average
for col in ('Sprint','Jog','Walk'):
row[col] /= count
writer.writerow(row)
Output:
Subject,Session,Sprint,Jog,Walk
John Doe,Session2,24.333333333333332,34.0,49.0
John Doe,Session3,14.666666666666666,27.333333333333332,52.0
Function links: itemgetter
groupby
defaultdict
If your data is not pre-sorted, you can use the following replacement lines to read in and sort the data by using the same key used in groupby. However, in this implementation the data must be small enough to load it all into memory at once.
sortkey = itemgetter('Subject','Session')
data = sorted(reader,key=sortkey)
for (subject,session),group in groupby(data,key=sortkey):
...

As you want the average grouped by subject and session, just compose unique keys out of that information:
import csv
times = {}
with open('yourfile.csv', 'r') as csvfile[1:]:
for row in csv.reader(csvfile, delimiter=','):
key = row[0]+row[1]
if key not in times.keys():
times[key] = row[-3:]
else:
times[key].extend(row[-3:])
average = {k: sum([int(entry[:-1]) for entry in v])/len(v) for k, v in times.items()}
This assumes that the first two entries do have regular structure as in your example and there is no ambiguity when composing the two first entries per row. To be sure one could insert a special delimiter between them in the key.
If you are also the person storing the data: Writing the unit of a column in the column header saves transformation effort later and avoids redundant information storage.

Handling data from csv file with Python

I know Python is almost made for these kind of purposes, but I am really struggling to understand how I get access to specific values in the dataset, and I tried both with pandas and csv modules. It is probably a matter of syntax. Here's the thing: I have a csv file in the form of
Nation, Year, No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
...
...
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
...
and so on. What I would like to do is to get the total amount of refugees per year, i.e. selecting all the rows with a certain year and summing all the elements in the related "no. of refugees" column. I managed to do this:
import csv
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
print headers
#2013
list2013=[]
for line in d_reader:
if (line['Year']=='2013'):
list2013.append(line['Refugees'])
list2013=map(int,list2013) #I have str values in my file
ref13=sum(list2013)
but I am looking for a more elegant (and, above all, iterative) solution. Moreover, if I perform that procedure multiple times for different years, I always get 0: it works for 2013 only, not sure why.
Edit: I tried this as well, without success, but I think this could be totally wrong:
import csv
refugees_dict={}
a=range(2005,2014)
a=map(str, a)
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
for element in a:
for line in d_reader:
if (line['Year']==element):
print 'hello!'
temp_list=[]
temp_list.append(line['Refugees'])
temp_list=map(int, temp_list)
refugees_dict[a]=sum(temp_list)
print refugees_dict
The next step of my work will involve further studies on the dataset, eg I am probably gonna need to access data nation-wise instead of year-wise, and I really appreciate any hint so I understand how to manipulate data.
Thanks a lot.

Since you tagged pandas in the question, here's a pandas solution to getting the number of refugees per year.
Let's say my input csv looks like this (note that I've eliminated the extra space before the column names):
Nation,Year,No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
You can read that into a pandas DataFrame like this:
df = pd.read_csv('data.csv')
You can then get the total like this:
df.groupby(['Year']).sum()
This gives:
No. of refugees
Year
2012 7370
2013 7150

Consider:
from collections import defaultdict
by_year = defaultdict(int) # a dict that has a 0 under every key.
and then
by_year[line['year']] += int(line['Refugees'])
Now you can just look at by_year['2013'] and see your sum (same for other years).

To sum by year you can try this:
f = open('file.csv').readlines()
f = [i.strip('\n').split(',') for i in f]
years = {i[1]:0 for i in f}
for i in f:
years[i[1]] += int(i[-1])
Now, you have a dictionary that has the sum of all the refugees by year.
To access nation-wise:
nations = {i[0]:0 for i in f}
for i in f:
nations[i[0]] += int(i[-1])

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.