I have a script which produces a 15x1096 array of data using
np.savetxt("model_concentrations.csv", model_con, header="rows:','.join(sources), delimiter=",")
Each of the 15 rows corresponds to a source of emissions, while each column is 1 day over 3 years. If at all possible I would like to have a 'header' in column 1 which states the emssion source. When i use the option "header='source1,source2,...'" these labels get placed in the first row (like expected). ie.
2per 3rd_pvd 3rd_unpvd 4rai_rd 4rai_yd 5rmo 6hea
2.44E+00 2.12E+00 1.76E+00 1.33E+00 6.15E-01 3.26E-01 2.29E+00 ...
1.13E-01 4.21E-02 3.79E-02 2.05E-02 1.51E-02 2.29E-02 2.36E-01 ...
My question is, is there a way to inverse the header so the csv appears like this:
2per 7.77E+00 8.48E-01 ...
3rd_pvd 1.86E-01 3.62E-02 ...
3rd_unpvd 1.04E+00 2.65E-01 ...
4rai_rd 8.68E-02 2.88E-02 ...
4rai_yd 1.94E-01 8.58E-02 ...
5rmo 7.71E-01 1.17E-01 ...
6hea 1.07E+01 2.71E+00 ...
...
Labels for rows and columns is one of main reasons for the existence of pandas.
import pandas as pd
# Assemble your source labels in a list
sources = ['2per', '3rd_pvd', '3rd_unpvd', '4rai_rd',
'4rai_yd', '5rmo', '6hea', ...]
# Create a pandas DataFrame wrapping your numpy array
df = pd.DataFrame(model_con, index=sources)
# Saving it a .csv file writes the index too
df.to_csv('model_concentrations.csv', header=None)
Related
I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`
I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for each csv file). If i use the pandas.read_hdf function to load a slice of data, there is no problem.
x y z
0 -8274.591528 36.053843 24.766887
1 -8273.229203 34.853409 21.883050
2 -8289.577896 15.326737 26.041516
3 -8279.589741 27.798428 26.222326
4 -8272.836821 37.035071 24.795912
... ... ... ...
995 -8258.567634 3.581020 23.955874
996 -8270.526953 4.373765 24.381293
997 -8287.429578 1.674278 25.838418
998 -8250.624879 4.884777 21.815401
999 -8287.115655 1.100695 25.931318
1000 rows × 3 columns
This is how it looks like, i can access to any column i want, and the shape is (1000,3) as it should be. However, when i try to load the hdf5 file using vaex.open function:
# table
0 '(0, [-8274.59152784, 36.05384262, 24.7668...
1 '(1, [-8273.22920299, 34.85340869, 21.8830...
2 '(2, [-8289.5778959 , 15.32673748, 26.0415...
3 '(3, [-8279.58974054, 27.79842822, 26.2223...
4 '(4, [-8272.83682085, 37.0350707 , 24.7959...
... ...
1,322,286,736 '(2792371, [-6781.56835851, 2229.30828904, -6...
1,322,286,737 '(2792372, [-6781.71119626, 2228.78749838, -6...
1,322,286,738 '(2792373, [-6779.3251589 , 2227.46826613, -6...
1,322,286,739 '(2792374, [-6777.26078082, 2229.49535808, -6...
1,322,286,740 '(2792375, [-6782.81758335, 2228.87820639, -6...
This is what I'm getting. The shape is (1322286741, 1) and only column is 'table'. When i try to call the vaex imported hdf as galacto[0]:
[(0, [-8274.59152784, 36.05384262, 24.76688728])]
In pandas imported data these are x,y,z columns for the first row. When i tried to inspect the data in another problem, it also gave an error saying no data has found. So i think the problem is pandas appending hdf5 files row by row and it doesn't work in other programs. Is there a way i can fix this issue?
hdf5 is as flexible as say JSON and xml, in that you can store data in any way you want. Vaex has its own way of storing the data (you can check with the h5ls utils the structure, it's very simple) that does not align with how Pandas/PyTables stores it.
Vaex stores each column as a single contiguous array, which is optimal if you don't work with all columns, and makes it easy to memory map to a (real) numpy array. PyTables stores each row (at least of the same type) next to each other. Meaning if you would calculate the mean of the x columns, you effectively go over all the data.
Since PyTables hdf5 is probably already much faster to read than CSV, I suggest you do the following (not tested, but it should get the point across):
import vaex
import pandas as pd
import glob
# make sure dir vaex exists
for filename in glob.glob("pandas/*.hdf5"): # assuming your files live there
pdf = pd.read_hdf(filename)
df = vaex.from_pandas(pdf) # now df is a vaex dataframe
df.export(filename.replace("pandas", "vaex"), progress=True)) # same in vaex' format
df = vaex.open("vaex/*.hdf5") # it will be concatenated
# don't access df.x.values since it's not a 'real' numpy array, but
# a lazily concatenated column, so it would need to memory copy.
# If you need that, you can optionally do (and for extra performance)
# df.export("big.hdf5", progress=True)
# df_single = vaex.open("big.hdf5")
# df_single.x.values # this should reference the original data on disk (no mem copy)
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
my input data looks like(input.txt):
AGAP2 TCGA-BL-A0C8-01A-11R-A10U-07 66.7328
AGAP2 TCGA-BL-A13I-01A-11R-A13Y-07 186.8366
AGAP3 TCGA-BL-A13J-01A-11R-A10U-07 183.3767
AGAP3 TCGA-BL-A3JM-01A-12R-A21D-07 33.2927
AGAP3 TCGA-BT-A0S7-01A-11R-A10U-07 57.9040
AGAP3 TCGA-BT-A0YX-01A-11R-A10U-07 99.8540
AGAP4 TCGA-BT-A20J-01A-11R-A14Y-07 88.8278
AGAP4 TCGA-BT-A20N-01A-11R-A14Y-07 129.7021
i want the output.txt looks like :
TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A13I-01A-11R-A13Y-07 ...
AGAP2 66.7328 186.8366
AGAP3 0 0
Using pandas: read csv, create pivot and write csv.
import pandas as pd
df = pd.read_table("input.txt", names="xy", sep=r'\s+')
# reset index first - we need named column
new = df.reset_index().pivot(index="index", columns='x', values='y')
new.fillna(0, inplace=True)
new.to_csv("output.csv", sep='\t') # tab separated
Reshaping and Pivot Tables
EDIT: filling empty values
I have the following Excel file:
ID Name Budget
... ... ...
... ... ...
... some unfilled blank cells
ID Name Budget
... ... ...
... some unfilled blank cells
ID Name Budget
... ... ...
I want to read this Excel sheet using Pandas (for instance ExcelFile) into separated structures(each table before the unfilled cells constitutes a dataframe/dictionary/...).
I need to do this so that I can process the data in the same structure as well as between multiple stuctures (like summing the budget of a repeating ID or Name in each structure)
What is the easiest way to do this while keeping a reasonable memory performance ?
Here is the code that read all the data by read_excel() and split:
import pandas as pd
df = pd.read_excel("c:\\tmp\\book1.xlsx", "Sheet1")
mask = df["ID"] == "ID"
nmask = ~mask
s = mask.astype(int).cumsum()
dfs = [x.dropna() for _,x in df[nmask].groupby(s[nmask])]
for df in dfs:
print df
The values in dfs are all object.