Matching cells in CSV to return calculation - python

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!

I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.

I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

Related

How to read space delimited data, two row types, no fixed width and plenty of missing values?

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.
I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!
There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.
Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635

ways to improve efficiency of Python script

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py

Pandas dataframe CSV reduce disk size

for my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv:
and this is my code:
# drop all features we don't need
for attribute in df:
if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'):
df = df.drop(attribute, axis=1)
# create a dictionary of airports, each airport has the following structure:
# IATA : (NAME, COUNTRY, LAT, LNG)
airport_dict = {}
for airport in df.itertuples():
airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5])
# From tutorial 4 soulution:
airportcodes=list(airport_dict)
airportdists=pd.DataFrame()
for i, airport_code1 in enumerate(airportcodes):
airport1 = airport_dict[airport_code1]
dists=[]
for j, airport_code2 in enumerate(airportcodes):
if j > i:
airport2 = airport_dict[airport_code2]
dists.append(distanceBetweenAirports(airport1[2],airport1[3],airport2[2],airport2[3]))
else:
# little edit: no need to calculate the distance twice, all duplicates are set to 0 distance
dists.append(0)
airportdists[i]=dists
airportdists.columns=airportcodes
airportdists.index=airportcodes
# set all 0 distance values to NaN
airportdists = airportdists.replace(0, np.nan)
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv')
I also tried re-indexing it before saving:
# remove all NaN values
airportdists = airportdists.stack().reset_index()
airportdists.columns = ['airport1','airport2','distance']
but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement...
Can you help me shrink the size of my csv? Thank you!
I have done a similar application in the past; here's what I will do:
It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport.
In this case the loading of file is really fast.
My suggestion will be instead of storing as a CSV try to store in Key Value pair data structure like JSON. It will be very fast on retrieval. Or try parquet file format that will consume 1/4 of the CSV file storage.
import pandas as pd
import numpy as np
from pathlib import Path
from string import ascii_letters
#created a dataframe
df = pd.DataFrame(np.random.randint(0,10000,size=(1000000, 52)),columns=list(ascii_letters))
df.to_csv('csv_store.csv',index=False)
print('CSV Consumend {} MB'.format(Path('csv_store.csv').stat().st_size*0.000001))
#CSV Consumend 255.22423999999998 MB
df.to_parquet('parquate_store',index=False)
print('Parquet Consumed {} MB'.format(Path('parquate_store').stat().st_size*0.000001))
#Parquet Consumed 93.221154 MB
The title of the question, "..reduce disk size" is solved by outputting a compressed version of the csv.
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv', compression='zip')
Or one better with Pandas 0.24.0
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv.zip')
You will find the csv is hugely compressed.
This of course does not solve for optimizing load and save time and does nothing for working memory. But hopefully useful when disk space is at a premium or cloud storage is being paid for.
The best compression would be to instead store the latitude and longitude of each airport, and then compute the distance between any two pairs on demand. Say, two 32-bit floating point values for each airport and the identifier, which would be about 110K bytes. Compressed by a factor of about 1300.

Python: Remove all lines of csv files that do not match within a range

I have been given two sets of data in the form of csv files which have 23 columns and thousands of lines of data.
The data in column 14 corresponds to the positions of stars in an image of a galaxy.
The issue is that one set of data contains values for positions that do not exist in the second set of data. They need to both contain the same positions, but the positions are off by a value of 0.0002 each data set.
F435.csv has values which are 0.0002 greater than the values in F550.csv. I am trying to find the matches between the two files, but within a certain range because all values are off by a certain amount.
Then, I need to delete all lines of data that correspond to values that do not match.
Below is a sample of the data from each of the two files:
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
Below is the code I have so far, but I'm unsure how to find matches based on that specific column. I am very new to Python, and this task is probably way beyond my knowledge of Python, but I desperately need to figure it out. I've been working on this single task for weeks, trying different methods. Thank you in advance!
import csv
with open('F435W.csv') as csvF435:
readCSV = csv.reader(csvF435, delimiter=',')
with open('F550M.csv') as csvF550:
readCSV = csv.reader(csvF550, delimiter=',')
for x in range (0,6348):
a = csvF435[x]
for y in range(0,6349):
b = csvF550[y]
if b < a + 0.0002 and b > a - 0.0002:
newlist.append(b)
break
You can use the following sample:
import csv
def isfloat(value):
try:
float(value)
return True
except ValueError:
return False
interval = 0.0002
with open('F435W.csv') as csvF435:
csvF435_in = csv.reader(csvF435, delimiter=',')
#clean the file content before processing
with open("merge.csv","w") as merge_out:
pass
with open("merge.csv", "a") as merge_out:
#write the header of the output csv file
for header in csvF435_in:
merge_out.write(','.join(header)+'\n')
break
for l435 in csvF435_in:
with open('F550M.csv') as csvF550:
csvF550_in = csv.reader(csvF550, delimiter=',')
for l550 in csvF550_in:
if isfloat(l435[13]) and isfloat(l550[13]) and abs(float(l435[13])-float(l550[13])) < interval:
merge_out.write(','.join(l435)+'\n')
F435W.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
1,2017.013,0.01242859,-8.2618,0,51434.12,0.3269918,-11.7781,0,0.01957931,1387.9406,541.916,49.9898514,41.5266996,8.81E+01,1.63E+03,1.44E+02,40.535,8.65,84.72,0.00061,0.00035,62.14
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88
3,215.1939,0.01242859,-5.8321,0.0001,262.2751,0.03840466,-6.0469,0.0002,-0.002961465,3248.686,52.8478,50.003155,41.5019044,4.77E+00,5.05E+00,-1.63E-01,2.263,2.166,-65.29,0.002,0.0019,-66.78
4,0.3796681,0.01240305,1.0515,0.0355,0.5823653,0.05487975,0.587,0.1023,-0.00425157,3760.344,11.113,50.0051049,41.4949256,1.93E+00,1.02E+00,-7.42E-02,1.393,1.007,-4.61,0.05461,0.03818,-6.68
5,0.9584663,0.01249223,0.0461,0.0142,1.043696,0.0175857,-0.0464,0.0183,-0.004156116,4013.2063,9.1225,50.0057256,41.4914444,1.12E+00,9.75E-01,1.09E-01,1.085,0.957,28.34,0.01934,0.01745,44.01
F550M.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE,,FALSE
2,1921.566,0.01258874,-8.2091,0,37128.06,0.2618096,-11.4243,0,0.01455503,4617.5225,554.576,49.9887896,41.5264699,6.09E+01,8.09E+02,1.78E+01,28.459,7.779,88.63,0.00054,0.00036,77.04,,
3,1.055918,0.01256313,-0.0591,0.0129,9.834856,0.1109255,-2.4819,0.0122,-0.002955142,3936.4946,85.3255,49.9949149,41.5370016,3.98E+01,1.23E+01,1.54E+01,6.83,2.336,24.13,0.06362,0.01965,23.98,,
4,151.2355,0.01260153,-5.4491,0.0001,184.0693,0.03634057,-5.6625,0.0002,-0.002626019,3409.2642,76.9891,49.9931935,41.5442109,4.02E+00,4.35E+00,-1.47E-03,2.086,2.005,-89.75,0.00227,0.00198,66.61,,
5,0.3506025,0.01258874,1.138,0.039,0.3466277,0.01300407,1.1503,0.0407,-0.002441164,3351.9893,8.9147,49.9942299,41.5451727,4.97E-01,5.07E-01,7.21E-03,0.715,0.702,62.75,0.02,0.01989,82.88
merge.csv:
NUMBER,FLUX_APER,FLUXERR_APER,MAG_APER,MAGERR_APER,FLUX_BEST,FLUXERR_BEST,MAG_BEST,MAGERR_BEST,BACKGROUND,X_IMAGE,Y_IMAGE,ALPHA_J2000,DELTA_J2000,X2_IMAGE,Y2_IMAGE,XY_IMAGE,A_IMAGE,B_IMAGE,THETA_IMAGE,ERRA_IMAGE,ERRB_IMAGE,ERRTHETA_IMAGE
2,84.73392,0.01245409,-4.8201,0.0002,112.9723,0.04012135,-5.1324,0.0004,-0.002142646,150.306,146.7986,49.9942613,41.5444109,4.92E+00,5.60E+00,-2.02E-01,2.379,2.206,-74.69,0.00339,0.0029,88.88

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories