matching tuples in pandas and data processing - python

The following is a simplified blob of my dataframe. I want to process
first.csv
,No.,Time,Source,Destination,Protocol,Length,Info,src_dst_pair
325778,112.305107,02:e0,Broadcast,ARP,64,Who has 253.244.230.77? Tell 253.244.230.67,"('02:e0', 'Broadcast')"
801130,261.868118,02:e0,Broadcast,ARP,64,Who has 253.244.230.156? Tell 253.244.230.67,"('02:e0', 'Broadcast')"
700094,222.055094,02:e0,Broadcast,ARP,60,Who has 253.244.230.77? Tell 253.244.230.156,"('02:e0', 'Broadcast')"
766542,766543,247.796156,100.118.138.150,41.177.26.176,TCP,66,32222 > http [SYN] Seq=0,"('100.118.138.150', '41.177.26.176')"
767405,248.073313,100.118.138.150,41.177.26.176,TCP,64,32222 > http [ACK] Seq=1,"('100.118.138.150', '41.177.26.176')"
767466,248.083268,100.118.138.150,41.177.26.176,HTTP,380,Continuation [Packet capture],"('100.118.138.150', '41.177.26.176')"
I have all the unique elements of the (last element) src_dst_pair
uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel())
[('02:e0', 'Broadcast') ('100.118.138.150', '41.177.26.176')]
How can I do the following in pandas
for each element in uniq_src_dst_pair, check against the df.src_dst_pair. If it matches, add df.Length and store it in a separate column
my expected result is
('02:e0', 'Broadcast') : 188
('100.118.138.150', '41.177.26.176') : 510
How can I do this?
Below is my try
import pandas
import numpy
data = pandas.read_csv('first.csv')
print data
uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel())
print uniq_src_dst_pair
print len(uniq_src_dst_pair)
# following is hardcoded, but need to be more general for the above list
match1 = data[data.src_dst_pair == "('02:e0:ed:0a:fb:5f', 'Broadcast')"] # doesn't work

Your csv file is messed up. You shouldn't have the first comma in the header, and you have an extra field in your 4th non-header row. Fixing that, you could use:
In [6]: data.groupby('src_dst_pair').Length.sum()
Out[6]:
src_dst_pair
('02:e0', 'Broadcast') 188
('100.118.138.150', '41.177.26.176') 510
Name: Length, dtype: int64
However, your final field, 'src_dst_pair' is superfluous if this is what you wanted to accomplish because you can simply do something like the following:
In [8]: data.groupby(['Source','Destination']).Length.sum()
Out[8]:
Source Destination
02:e0 Broadcast 188
100.118.138.150 41.177.26.176 510
Name: Length, dtype: int64

Related

Unable to parse DataFrame values

In spite of searching for and adapting several potential solutions online and via StackOverflow, I seem to be making no headway. I was testing this air quality index (AQI) library (you can install it using - $ pip install python-aqi) that I recently discovered using two different data. First data is a value stored in a variable, while the second is a series of values in a data frame. For the former, the code ran successfully, but for the latter, I kept getting this error: InvalidOperation: [<class 'decimal.ConversionSyntax'>]. Please help. Thanks. Here is the link to the data: https://drive.google.com/file/d/16QQnDIEFkDQGafNYTk6apY9JcRimRBxH/view?usp=sharing
Code 1 - successful
import aqi # import air quality library
import pandas as pd
# define the PM2.5 variable and store a value
pmt_2_5 = 60.5
mush_2pt5 = aqi.to_iaqi(aqi.POLLUTANT_PM25, str(pmt_2_5)) #AQI code
mush_2pt5
Code 2 - Unsuccessful
import aqi # import air quality library
import pandas as pd
# Read data
df_mush = pd.read_csv("book1.csv", parse_dates=["Date"], index_col="Date", sep=",")
# parse data in the third column to the variable (see image)
pmt_2_5 = df_mushin['mush-pm_2_5']
# parse the variable and calculate AQI
df_mush_2pt5 = aqi.to_iaqi(aqi.POLLUTANT_PM25, str(pmt_2_5))
df_mush_2pt5
You can't pass a Series but only a string:
df_mush = pd.read_csv('book1.csv', parse_dates=['Date'], index_col='Date', sep=',')
df_mush_2pt5 = (df_mush['mush-pm_2_5'].astype(str)
.map(lambda cc: aqi.to_iaqi(aqi.POLLUTANT_PM25, cc)))
Output:
# it's not a dataframe but a series
>>> df_mush_2pt5
0 154
1 152
2 162
3 153
4 153
5 158
6 153
7 134
8 151
9 136
10 154
Name: mush-pm_2_5, dtype: object
Documentation:
Help on function to_iaqi in module aqi:
to_iaqi(elem, cc, algo='aqi.algos.epa')
Calculate an intermediate AQI for a given pollutant. This is the
heart of the algo.
.. warning:: the concentration is passed as a string so
:class:`decimal.Decimal` doesn't act up with binary floats.
:param elem: pollutant constant
:type elem: int
:param cc: pollutant contentration (µg/m³ or ppm)
:type cc: str # <-- This is a string, not Series
:param algo: algorithm module canonical name
:type algo: str

How to read space delimited data, two row types, no fixed width and plenty of missing values?

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.
I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!
There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.
Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635

Match index function from excel in pandas

There is a match index function in Excel that i use to match if the elements are present in the required column
=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)
This is the function i am using right now and it is yielding me good results but I want to implement it in python.
brand N Z None
Honor 63 96 190
Tecno 0 695 763
from this table I want
brand L N Z
Honor 0 63 96
Tecno 0 0 695
It should compare both the column and index and give the appropriate value
i have tried the lookup function in python but that gives me the
ValueError: Row labels must have same size as column labels
What you basically do with your excel formula, is creating something like a pivot table, you can also do that with pandas. E.g. like this:
# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing= set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) + [brands_query[0]] * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)
For my testdata, this outputs:
value
variable L N Z
brand
Bar NaN NaN NaN
Honor NaN 63.0 96.0
Tecno NaN 0.0 695.0
The testdata is (note, that above we don't see col None and row Foo, but we see row Bar and column L, which are actually not present in the testdata, but were "queried"):
brand N Z None
0 Honor 63 96 190
1 Tecno 0 695 763
2 Foo 8 111 231
You can generate this testdata using:
import pandas as pd
import numpy as np
import io
raw=\
"""brand N Z None
Honor 63 96 190
Tecno 0 695 763
Foo 8 111 231"""
df= pd.read_csv(io.StringIO(raw), sep='\s+')
Note: the result as shown in the output is a regular pandas dataframe. So in case you plan to write the data back to a excel sheet, there should be no problem (pandas provides methods to read/write dataframes to/from excel-files).
Do you need to use Pandas for this action. You can do it with simple python as well. Read from one text file and print out matched and processed fields.
Basic file reading in Python goes like this. Where datafile.csv is your file. This reads all the lines in one file and prints out right result. First you need to save your file in .csv format so there is a separator between fields ','.
import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output
Input file:
brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763
Prints out:
brand L N Z
Honor 0 63 96
Tecno 0 0 695
(I am not familiar with Excel Match-function so you may need to add some logic to above Python script to get logic working with all your data.)

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories