Storing L2 tick data with Python - python
Preamble:
I am working with L2 tick data.
The bid/offer will not necessarily be balanced in terms of number of levels
The number of levels could range from 0 to 20.
I want to save the full book to disk every time it is updated
I believe I want to use numpy array such that I can use h5py/vaex to perform offline data processing.
I'll ideally be writing (appending) to disk every x updates or on a timer.
If we assume an example book looks like this:
array([datetime.datetime(2017, 11, 6, 14, 57, 8, 532152), # book creation time
array(['20171106-14:57:08.528', '20171106-14:57:08.428'], dtype='<U21'), # quote entry (bid)
array([1.30699, 1.30698]), # quote price (bid)
array([100000., 250000.]), # quote size (bid)
array(['20171106-14:57:08.528'], dtype='<U21'), # quote entry (offer)
array([1.30709]), # quote price (offer)
array([100000.])], # quote size (offer)
dtype=object)
Numpy doesnt like the jagged-ness of the array, and whilst I'm happy (enough) to use np.pad to pad the times/prices/sizes to a length of 20, I don't think I want to be creating an array for the book creation time.
Could/should I be going about this differently? Ultimately I'll want to do asof joins against the a list of trades hence I'd like a column-store approach. How is everyone else doing this? Are they storing multiple rows? or multiple columns?
EDIT:
I want to be able to do something like:
with h5py.File("foo.h5", "w") as f:
f.create_dataset(data=my_np_array)
and then later perform an asof join between my hdf5 tickdata and a dataframe of trades.
EDIT2:
In KDB the entry would look like:
q)t:([]time:2017.11.06D14:57:08.528;sym:`EURUSD;bid_time:enlist 2017.11.06T14:57:08.528 20171106T14:57:08.428;bid_px:enlist 1.30699, 1.30698;bid_size:enlist 100000. 250000.;ask_time:enlist 2017.11.06T14:57:08.528;ask_px:enlist 1.30709;ask_size:enlist 100000.)
q)t
time sym bid_time bid_px bid_size ask_time ask_px ask_size
-----------------------------------------------------------------------------------------------------------------------------------------------------------
2017.11.06D14:57:08.528000000 EURUSD 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428 1.30699 1.30698 100000 250000 2017.11.06T14:57:08.528 1.30709 100000
q)first t
time | 2017.11.06D14:57:08.528000000
sym | `EURUSD
bid_time| 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428
bid_px | 1.30699 1.30698
bid_size| 100000 250000f
ask_time| 2017.11.06T14:57:08.528
ask_px | 1.30709
ask_size| 100000f
EDIT3:
Should I just give in with the idea of a nested column and have 120 columns (20*(bid_times+bid_prices+bid_sizes+ask_times+ask_prices+ask_sizes)? Seems excessive, and unwieldy to work with...
For anyone is stumbling across this ~2 years later, I have recently revisited this code and have swapped out h5py for pyarrow+parquet.
This means I can create a schema with nested columns and read that back into a pandas DataFrame with ease:
import pyarrow as pa
schema = pa.schema([
("Time", pa.timestamp("ns")),
("Symbol", pa.string()),
("BidTimes", pa.list_(pa.timestamp("ns"))),
("BidPrices", pa.list_(pa.float64())),
("BidSizes", pa.list_(pa.float64())),
("BidProviders", pa.list_(pa.string())),
("AskTimes", pa.list_(pa.timestamp("ns"))),
("AskPrices", pa.list_(pa.float64())),
("AskSizes", pa.list_(pa.float64())),
("AskProviders", pa.list_(pa.string())),
])
In terms of streaming the data to disk, I use pq.ParquetWriter.write_table - keeping track of open filehandles (one per Symbol) so that I can append to the file, only closing (and thus writing metadata) when I'm done.
Rather than streaming pyarrow tables, I stream regular Python dictionaries, creating a Pandas DataFrame when I hit a given size (e.g. 1024 rows) which I then pass to the ParquetWriter to write down.
Related
Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?
newbie python learner here! I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false. Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? : trialnum,colourtext,colourname,condition,response,rt,correct 1,blue,red,incongruent,red,0.767041,True 2,yellow,yellow,congruent,yellow,0.647259,True 3,green,blue,incongruent,blue,0.990185,True 4,green,green,congruent,green,0.720116,True 5,yellow,yellow,congruent,yellow,0.562909,True 6,yellow,yellow,congruent,yellow,0.538918,True 7,green,yellow,incongruent,yellow,0.693017,True 8,yellow,red,incongruent,red,0.679368,True 9,yellow,blue,incongruent,blue,0.951432,True 10,blue,blue,congruent,blue,0.633367,True 11,blue,green,incongruent,green,1.289047,True 12,green,green,congruent,green,0.668142,True 13,blue,red,incongruent,red,0.647722,True 14,red,blue,incongruent,blue,0.858307,True 15,red,red,congruent,red,1.820112,True 16,blue,green,incongruent,green,1.118404,True 17,red,red,congruent,red,0.798532,True 18,red,red,congruent,red,0.470939,True 19,red,blue,incongruent,blue,1.142712,True 20,red,red,congruent,red,0.656328,True 21,red,yellow,incongruent,yellow,0.978830,True 22,green,red,incongruent,red,1.316182,True 23,yellow,yellow,congruent,green,0.964292,False 24,green,green,congruent,green,0.683949,True 25,yellow,green,incongruent,green,0.583939,True 26,green,blue,incongruent,blue,1.474140,True 27,green,blue,incongruent,blue,0.569109,True 28,green,green,congruent,blue,1.196470,False 29,red,red,congruent,red,4.027546,True 30,blue,blue,congruent,blue,0.833177,True 31,red,red,congruent,red,1.019672,True 32,green,blue,incongruent,blue,0.879507,True 33,red,red,congruent,red,0.579254,True 34,red,blue,incongruent,blue,1.070518,True 35,blue,yellow,incongruent,yellow,0.723852,True 36,yellow,green,incongruent,green,0.978838,True 37,blue,blue,congruent,blue,1.038232,True 38,yellow,green,incongruent,yellow,1.366425,False 39,green,red,incongruent,red,1.066038,True 40,blue,red,incongruent,red,0.693698,True 41,red,blue,incongruent,blue,1.751062,True 42,blue,blue,congruent,blue,0.449651,True 43,green,red,incongruent,red,1.082267,True 44,blue,blue,congruent,blue,0.551023,True 45,red,blue,incongruent,blue,1.012258,True 46,yellow,green,incongruent,yellow,0.801443,False 47,blue,blue,congruent,blue,0.664119,True 48,red,green,incongruent,yellow,0.716189,False 49,green,green,congruent,yellow,0.630552,False 50,green,yellow,incongruent,yellow,0.721917,True 51,red,red,congruent,red,1.153943,True 52,blue,red,incongruent,red,0.571019,True 53,yellow,yellow,congruent,yellow,0.651611,True 54,blue,blue,congruent,blue,1.321344,True 55,green,green,congruent,green,1.159240,True 56,blue,blue,congruent,blue,0.861646,True 57,yellow,red,incongruent,red,0.793069,True 58,yellow,yellow,congruent,yellow,0.673190,True 59,yellow,red,incongruent,red,1.049320,True 60,red,yellow,incongruent,yellow,0.773447,True 61,red,yellow,incongruent,yellow,0.693554,True 62,red,red,congruent,red,0.933901,True 63,blue,blue,congruent,blue,0.726794,True 64,green,green,congruent,green,1.046116,True 65,blue,blue,congruent,blue,0.713565,True 66,blue,blue,congruent,blue,0.494177,True 67,green,green,congruent,green,0.626399,True 68,blue,blue,congruent,blue,0.711896,True 69,blue,blue,congruent,blue,0.460420,True 70,green,green,congruent,yellow,1.711978,False 71,blue,blue,congruent,blue,0.634218,True 72,yellow,blue,incongruent,yellow,0.632482,False 73,yellow,yellow,congruent,yellow,0.653813,True 74,green,green,congruent,green,0.808987,True 75,blue,blue,congruent,blue,0.647117,True 76,green,red,incongruent,red,1.791693,True 77,red,yellow,incongruent,yellow,1.482570,True 78,red,red,congruent,red,0.693132,True 79,red,yellow,incongruent,yellow,0.815830,True 80,green,green,congruent,green,0.614441,True 81,yellow,red,incongruent,red,1.080385,True 82,red,green,incongruent,green,1.198548,True 83,blue,green,incongruent,green,0.845769,True 84,yellow,blue,incongruent,blue,1.007089,True 85,green,blue,incongruent,blue,0.488701,True 86,green,green,congruent,yellow,1.858272,False 87,yellow,yellow,congruent,yellow,0.893149,True 88,yellow,yellow,congruent,yellow,0.569597,True 89,yellow,yellow,congruent,yellow,0.483542,True 90,yellow,red,incongruent,red,1.669842,True 91,blue,green,incongruent,green,1.158416,True 92,blue,red,incongruent,red,1.853055,True 93,green,yellow,incongruent,yellow,1.023785,True 94,yellow,blue,incongruent,blue,0.955395,True 95,yellow,yellow,congruent,yellow,1.303260,True 96,blue,yellow,incongruent,yellow,0.737741,True 97,yellow,green,incongruent,green,0.730972,True 98,green,red,incongruent,red,1.564596,True 99,yellow,yellow,congruent,yellow,0.978911,True 100,blue,yellow,incongruent,yellow,0.508151,True 101,red,green,incongruent,green,1.821969,True 102,red,red,congruent,red,0.818726,True 103,yellow,yellow,congruent,yellow,1.268222,True 104,yellow,yellow,congruent,yellow,0.585495,True 105,green,green,congruent,green,0.673404,True 106,blue,yellow,incongruent,yellow,1.407036,True 107,red,red,congruent,red,0.701050,True 108,red,green,incongruent,red,0.402334,False 109,red,green,incongruent,green,1.537681,True 110,green,yellow,incongruent,yellow,0.675118,True 111,green,green,congruent,green,1.004550,True 112,yellow,blue,incongruent,blue,0.627439,True 113,yellow,yellow,congruent,yellow,1.150248,True 114,blue,yellow,incongruent,yellow,0.774452,True 115,red,red,congruent,red,0.860966,True 116,red,red,congruent,red,0.499595,True 117,green,green,congruent,green,1.059725,True 118,red,red,congruent,red,0.593180,True 119,green,yellow,incongruent,yellow,0.855915,True 120,blue,green,incongruent,green,1.335018,True But I am only interested in the 'condition', 'rt', and 'correct' columns. I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table: Participant Stimulus Type Mean Reaction Time Percentage Correct 01 Congruent 0.560966 80 01 Incongruent 0.890556 64 02 Congruent 0.460576 89 02 Incongruent 0.956556 55 Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice! I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants? Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table: import os import glob import pandas as pd #set working directory os.chdir('data') #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig') Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table? Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes? I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this: from pathlib import Path # Use the Path class to represent a path. It offers more # functionalities when perform operations on paths path = Path("./data").resolve() # Create a dictionary whose keys are the Participant ID # (the `01` in `P01.csv`, etc), and whose values are # the data frames initialized from the CSV data = { p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv") } # Create a master data frame by combining the individual # data frames from each CSV file df = pd.concat(data, keys=data.keys(), names=["participant", None]) # Calculate the statistics result = ( df.groupby(["participant", "condition"]).agg(**{ "Mean Reaction Time": ("rt", "mean"), "correct": ("correct", "sum"), "size": ("trialnum", "size") }).assign(**{ "Percentage Correct": lambda x: x["correct"] / x["size"] }).drop(columns=["correct", "size"]) .reset_index() )
ways to improve efficiency of Python script
I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.) The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment): MSTRG.38 NC_008066.1 9204 9987 48395.347656 MSTRG.36 NC_008066.1 7582 8265 47979.933594 MSTRG.33 NC_008066.1 5899 7437 43807.781250 MSTRG.49 NC_008066.1 14732 15872 26669.763672 MSTRG.38 NC_008066.1 8363 9203 19514.273438 MSTRG.34 NC_008066.1 7439 7510 16855.662109 And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand): JQ673480.1 697 778 SRX6359746.5505370/2 8 + JQ673480.1 744 824 SRX6359746.5505370/1 8 - JQ673480.1 1712 1791 SRX6359746.2565519/2 27 + JQ673480.1 3445 3525 SRX6359746.7028440/2 23 - JQ673480.1 4815 4873 SRX6359746.6742605/2 37 + JQ673480.1 5055 5092 SRX6359746.5420114/2 40 - JQ673480.1 5108 5187 SRX6359746.2349349/2 24 - JQ673480.1 7139 7219 SRX6359746.3831446/2 22 + The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines. I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so: #import modules import csv import pandas as pd import matplotlib.pyplot as plt #set sample name sample='CON-2' #set fraction number f=6 #dictionary to store values d={} #load file name into variable fileRNA="top500_R8_7-{}-RNA.gtf".format(sample) print(fileRNA) #read tsv file tsvRNA = open(fileRNA) readRNA = csv.reader(tsvRNA, delimiter="\t") expGenes=[] #convert tsv file into Python list for row in readRNA: gene=row[0],row[1],row[2],row[3],row[4] expGenes.append(row) #print(expGenes) #establish file name for DNA reads fileDNA="D2_7-{}-{}.bed".format(sample,f) print(fileDNA) tsvDNA = open(fileDNA) readDNA = csv.reader(tsvDNA, delimiter="\t") #put file into Python list MCNgenes=[] for row in readDNA: read=row[0],row[1],row[2] MCNgenes.append(read) #find read counts for r in expGenes: #include FPKM in the dictionary d[r[0]]=[r[4]] regionCount=0 #set start and stop points based on transcript file chr=r[1] start=int(r[2]) stop=int(r[3]) #print("start:",start,"stop:",stop) for row in MCNgenes: if start < int(row[1]) < stop: regionCount+=1 d[r[0]].append(regionCount) n+=1 df=pd.DataFrame.from_dict(d) #convert to heatmap df.to_csv("7-CON-2-6_forHeatmap.csv") This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm. You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates': import pandas as pd import numpy as np from sklearn.neighbors import KDTree dna = pd.DataFrame() # this is your dataframe with DNA data rna = pd.DataFrame() # Same for RNA # Let's assume you are using 'start' and 'stop' columns as coordinates dna_coord = dna.loc[:, ['start', 'stop']] rna_coord = rna.loc[:, ['start', 'stop']] dna_kd = KDTree(dna_coord) rna_kd = KDTree(rna_coord) # Now you can go through your data and match with DNA: my_data = pd.DataFrame() for start, stop in zip(my_data.start, my_data.stop): coord = np.array(start, stop) dist, idx = dna_kd.query(coord, k=1) # Assuming you need an exact match if np.islose(dist, 0): # Now that you have the index of the matchin row in DNA data # you can extract information using the index and do whatever # you want with it dna_gene_data = dna.loc[idx, :] You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with). General advice try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict) try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods) whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns) More on this topic here: How to iterate over rows in a DataFrame in Pandas? Answer: DON'T*! Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you. docs: https://docs.python.org/3/library/profile.html python3 -m cProfile -o profile.out myscript.py
Sorting in pandas by multiple column without distorting index
I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially. I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function. Could you please tell me a way I can achieve this? ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com used this below code to set the index census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME']) and, used the below code to sort the data: census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] ) sample data I am working, I would love to share the csv file but I don't see a way to share this. STNAME,CTYNAME,CENSUS2010POP,SUMLEV Alabama,Autauga County,54571,50 Alabama,Baldwin County,182265,50 Alabama,Barbour County,27457,50 Alabama,Bibb County,22915,50 Alabama,Blount County,57322,50 Alaska,Aleutians East Borough,3141,50 Alaska,Aleutians West Census Area,5561,50 Alaska,Anchorage Municipality,291826,50 Alaska,Bethel Census Area,17013,50 Wyoming,Platte County,8667,50 Wyoming,Sheridan County,29116,50 Wyoming,Sublette County,10247,50 Wyoming,Sweetwater County,43806,50 Wyoming,Teton County,21294,50 Wyoming,Uinta County,21118,50 Wyoming,Washakie County,8533,50 Wyoming,Weston County,7208,50 Required End Result: STNAME,CTYNAME,CENSUS2010POP,SUMLEV Alabama,Autauga County,54571,50 Alabama,Baldwin County,182265,50 Alabama,Barbour County,27457,50 Alabama,Bibb County,22915,50 Alabama,Blount County,57322,50 Alaska,Aleutians East Borough,3141,50 Alaska,Aleutians West Census Area,5561,50 Alaska,Anchorage Municipality,291826,50 Alaska,Bethel Census Area,17013,50 Wyoming,Platte County,8667,50 Wyoming,Sheridan County,29116,50 Wyoming,Sublette County,10247,50 Wyoming,Sweetwater County,43806,50 Wyoming,Teton County,21294,50 Wyoming,Uinta County,21118,50 Wyoming,Washakie County,8533,50 Wyoming,Weston County,7208,50
pulling a column of data with a set number of rows from multiple text files into one text file
I have several hundred text files. I want to extract a specific column with a set number of rows. The files are exactly the same the only thing different is the data values. I want to put that data into a new text file with each new column preceding the previous one. The file is a .sed basically the same as a .txt file. this is what it looks like. The file actually goes from Wvl 350-2150. Comment: Version: 2.2 File Name: C:\Users\HyLab\Desktop\Curtis Bernard\PSR+3500_1596061\PSR+3500_1596061\2019_Feb_16\Contact_00186.sed <Metadata> Collected By: Sample Name: Location: Description: Environment: </Metadata> Instrument: PSR+3500_SN1596061 [3] Detectors: 512,256,256 Measurement: REFLECTANCE Date: 02/16/2019,02/16/2019 Time: 13:07:52.66,13:29:17.00 Temperature (C): 31.29,8.68,-5.71,31.53,8.74,-5.64 Battery Voltage: 7.56,7.20 Averages: 10,10 Integration: 2,2,2,10,8,2 Dark Mode: AUTO,AUTO Foreoptic: PROBE {DN}, PROBE {DN} Radiometric Calibration: DN Units: None Wavelength Range: 350,2500 Latitude: n/a Longitude: n/a Altitude: n/a GPS Time: n/a Satellites: n/a Calibrated Reference Correction File: none Channels: 2151 Columns [5]: Data: Chan.# Wvl Norm. DN (Ref.) Norm. DN (Target) Reflect. % 0 350.0 1.173460E+002 1.509889E+001 13.7935 1 351.0 1.202493E+002 1.529762E+001 13.6399 2 352.0 1.232869E+002 1.547818E+001 13.4636 3 353.0 1.264006E+002 1.563467E+001 13.2665 4 354.0 1.294906E+002 1.578425E+001 13.0723 I've taken some coding classes but that was a long time ago. I figured this is a pretty straightforward problem for even a novice coder which I am not but I can't seem to find anything like this so was hoping for help on here. I honestly don't need anything fancy just something like this would be amazing so I don't have to copy and paste each file! 12.3 11.3 etc... 12.3 11.3 etc... 12.3 11.3 etc... etc.. etc.. etc...
In MATLAB R2016b or later, the easiest way to do this would be using readtable: t = readtable('file.sed', delimitedTextImportOptions( ... 'NumVariables', 5, 'DataLines', 36, ... 'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join')); where file.sed is the name of the file 'NumVariables', 5 means there are 5 columns of data 'DataLines', 36 means the data starts on the 36th line and continues to the end of the file 'Delimiter', ' ' means the character that separates the columns is a space 'ConsecutiveDelimitersRule', 'join' means treat more than one space as if they were just one (rather than as if they separate empty columns of data). This assumes that the example file you've posted is in the exact format of your real data. If it's different you may have to modify the parameters above, possibly with reference to the help for delimitedTextImportOptions or as an alternative, fixedWidthImportOptions. Now you have a MATLAB table t with five columns, of which column 2 is the wavelengths and column 5 is the reflectances - I assume that's the one you want? You can access that column with t(:,5) So to collect all the reflectance columns into one table you would do something like fileList = something % get the list of files from somewhere - say as a string array or a cell array of char resultTable = table; for ii = 1:numel(fileList) sedFile = fileList{ii}; t = readtable(sedFile, delimitedTextImportOptions( ... 'NumVariables', 5, 'DataLines', 36, ... 'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join')); t.Properties.VariableNames{5} = sprintf('Reflectance%d', ii); resultTable = [resultTable, t(:,5)]; end The t.Properties.VariableNames ... line is there because column 5 of t will be called Var5 every time, but in the result table each variable name needs to be unique. Here we're renaming the output table variables Reflectance1, Reflectance2 etc but you could change this to whatever you want - perhaps the name of the actual file from sedFile - as long as it's a valid unique variable name. Finally you can save the result table to a text file using writetable. See the MATLAB help for how to use that.
In Python 3.x with numpy: import numpy as np file_list = something # filenames in a Python list result_array = None for sed_file in file_list: reflectance_column = np.genfromtxt(sed_file, skip_header=35, usecols=4) result_array = (reflectance_column if result_array is None else np.column_stack((result_array, reflectance_column))) np.savetxt('outputfile.txt', result_array) Here skip_header=35 ignores the first 35 lines usecols=4 only returns column 5 (Python uses zero-based indexing) see the help for savetxt for further details
Pandas dataframe CSV reduce disk size
for my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv: and this is my code: # drop all features we don't need for attribute in df: if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'): df = df.drop(attribute, axis=1) # create a dictionary of airports, each airport has the following structure: # IATA : (NAME, COUNTRY, LAT, LNG) airport_dict = {} for airport in df.itertuples(): airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5]) # From tutorial 4 soulution: airportcodes=list(airport_dict) airportdists=pd.DataFrame() for i, airport_code1 in enumerate(airportcodes): airport1 = airport_dict[airport_code1] dists=[] for j, airport_code2 in enumerate(airportcodes): if j > i: airport2 = airport_dict[airport_code2] dists.append(distanceBetweenAirports(airport1[2],airport1[3],airport2[2],airport2[3])) else: # little edit: no need to calculate the distance twice, all duplicates are set to 0 distance dists.append(0) airportdists[i]=dists airportdists.columns=airportcodes airportdists.index=airportcodes # set all 0 distance values to NaN airportdists = airportdists.replace(0, np.nan) airportdists.to_csv(r'../Project Data Files-20190322/distances.csv') I also tried re-indexing it before saving: # remove all NaN values airportdists = airportdists.stack().reset_index() airportdists.columns = ['airport1','airport2','distance'] but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement... Can you help me shrink the size of my csv? Thank you!
I have done a similar application in the past; here's what I will do: It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport. In this case the loading of file is really fast.
My suggestion will be instead of storing as a CSV try to store in Key Value pair data structure like JSON. It will be very fast on retrieval. Or try parquet file format that will consume 1/4 of the CSV file storage. import pandas as pd import numpy as np from pathlib import Path from string import ascii_letters #created a dataframe df = pd.DataFrame(np.random.randint(0,10000,size=(1000000, 52)),columns=list(ascii_letters)) df.to_csv('csv_store.csv',index=False) print('CSV Consumend {} MB'.format(Path('csv_store.csv').stat().st_size*0.000001)) #CSV Consumend 255.22423999999998 MB df.to_parquet('parquate_store',index=False) print('Parquet Consumed {} MB'.format(Path('parquate_store').stat().st_size*0.000001)) #Parquet Consumed 93.221154 MB
The title of the question, "..reduce disk size" is solved by outputting a compressed version of the csv. airportdists.to_csv(r'../Project Data Files-20190322/distances.csv', compression='zip') Or one better with Pandas 0.24.0 airportdists.to_csv(r'../Project Data Files-20190322/distances.csv.zip') You will find the csv is hugely compressed. This of course does not solve for optimizing load and save time and does nothing for working memory. But hopefully useful when disk space is at a premium or cloud storage is being paid for.
The best compression would be to instead store the latitude and longitude of each airport, and then compute the distance between any two pairs on demand. Say, two 32-bit floating point values for each airport and the identifier, which would be about 110K bytes. Compressed by a factor of about 1300.