Analysing Json file in Python using pandas

Analysing Json file in Python using pandas - python

I have to analyse a lot of data doing my Bachelors project.
The data will be handed to me in .json files. My supervisor has told me that it should be fairly easy if I just use Pandas.
Since I am all new to Python (I have decent experience with MatLab and C though) I am having a rough start.
If someone would be so kind to explain me how to do this I would really appreciate it.
The files look like this:
{"columns":["id","timestamp","offset_freq","reprate_freq"],
"index":[0,1,2,3,4,5,6,7 ...
"data":[[526144,1451900097533,20000000.495000001,250000093.9642499983],[...
need to import the data and analyse it (make some plots), but I'm not sure how to import data like this..
Ps. I have Python and the required packages installed.

You did not give the full format of JSON file, but if it looks like
{"columns":["id","timestamp","offset_freq","reprate_freq"],
"index":[0,1,2,3,4,5,6,7,8,9],
"data":[[39,69,50,51],[62,14,12,49],[17,99,65,79],[93,5,29,0],[89,37,42,47],[83,79,26,29],[88,17,2,7],[95,87,34,34],[40,54,18,68],[84,56,94,40]]}
then you can do (I made up random numbers)
df = pd.read_json(file_name_or_Python_string, orient='split')
print df
id timestamp offset_freq reprate_freq
0 39 69 50 51
1 62 14 12 49
2 17 99 65 79
3 93 5 29 0
4 89 37 42 47
5 83 79 26 29
6 88 17 2 7
7 95 87 34 34
8 40 54 18 68
9 84 56 94 40

Related

Turn a 2 second order array into pandas dataframe

I have a data set as such of 2 order array, with arbitrary length. as shown below
[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']
.
.
.
]
I want to turn it into a panda dataframe as columns and rows, shown below
15 39 17 43
23 40 18 44
28 41 18 45
28 42 27 46
34 43 26 47
.
.
.
anyone has idea how to achieve it without saving the data out to files during process?

My strategy is defining a function first to deal with the comma and quotes. Keeping in mind that your data is already a 2 dimensional numpy array I define the following function:
def str_to_flt(lst):
tmp = np.array([[float(i.split(",")[0]),float(i.split(",")[1])] for i in lst])
return tmp
import pandas as pd
df = pd.DataFrame(np.concatenate((str_to_flt(data[:,0]), str_to_flt(data[:,1])), axis=1))

Your data:
from io import StringIO
s="""[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']]"""
df=pd.read_csv(StringIO(s),header=None)
You can do:
d={"\[\['":"","'\]\]":"","'\]\]'":"","'\]":"","\['":"","' '":','}
df=df.replace(d,regex=True)
df[[1.2,1.5]]=df.pop(1).str.extract(r"(\d+),(\d+)")
df=df.sort_index(axis=1)
output of df:
0.0 1.2 1.5 2.0
0 15 39 17 43
1 23 40 18 44
2 28 41 18 45
3 28 42 27 46
4 34 43 26 47
Ofcourse you can rename the name of columns according to your need by using columns attribute or rename() method and typecast data by using astype() method according to your need

How do I sort columns of numerical file data in python

I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data.
I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit.
Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.

First of all, you should not put code as images, since there is a functionality to insert and format here in the editor.
It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays).
Here is an example:
import numpy as np
array = np.random.randint(0,100, size=50)
print(array)
Output:
[89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4]
So if we use the method mentioned before:
print(array.sort())
Output:
[ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97]
Easy as that :)

How to save Jupyter notebook output in a file using python command?

I was trying to save output of my Jupyternotebook 'npsmiles_descriptors.py' into a text file so that it can be used by other applications.
Can anyone help me out by pointing out the error in my last command. I am getting syntax error.
I was trying to save output of my Jupyternotebook 'npsmiles_descriptors.py' into a text file so that it can be used by other applications.
Can anyone help me out by pointing out the error in my last command. I am getting syntax error.
df1 = df['smiles']
print(df['smiles'])
0 O1[C##H]2C[C#H](O)[C##]3([C#H]([C#H](OC(=O)c4...
1 OC[C#H]1N(CCC1)C(=O)[C##H](NC(=O)[C#H](CCCCC)...
2 O1[C##H](C[C#H](O)\C=C/[C##H]([C#H](O)[C#H](\...
3 O1[C#H](CO)[C##H](O)[C#H](O)[C##H](O)[C##H]1O...
4 O1[C##H]([C#H](C[C##H](C)[C#]1(O)CO)C)[C##H]1...
5 P(O[C#H]1[C#H](O[C##]2(O[C##H](C\C=C\c3nc(oc3...
6 O1[C#H](CO)[C##H](O)[C##H](O)[C##H]1n1c2NC=[N...
7 O1[C#H]2[C##H](CC[C##]3(O[C#]34[C##H]2C(=CC4)...
8 O1[C#H](CO)[C##H](O)[C##H](O)[C##H]1n1cnc(C(=...
9 O1[C##]2(C)[C##](O)([C##]3([C##H]([C#H](O)[C#...
10 S(=O)(=O)([O-])N1C[C#](OC)(NC(=O)C)C1=O
11 S1[C#H]2N(C(C(=O)[O-])=C(C1)COC(=O)N)C(=O)[C#...
12 O1[C##H]2C[C##]3(O[C#H]([C#H](CC)C)[C#H](C=C3...
13 O1[C##H](C)[C##H](O)[C##H]([NH3+])C[C##H]1O[C...
14 O=C1[C#]2(O)[C##H](C=C1C)[C#]1(O)[C#H]([C#H]3...
15 S(CC[NH3+])C=1C[C#H]2N(C=1C(=O)[O-])C(=O)[C##...
16 O=C1N(C)[C##H]([C#H](O)[C##H](C\C=C\C)C)C(=O)...
17 O1[C#H](/C(=C/[C#H]2C[C##H](OC)[C#H](O)CC2)/C...
18 O1C[C#H]1C(=O)CCCCC[C##H]1NC(=O)[C##H]2N(CCC2...
19 O(C)c1cc2N([C##H]3[C#]4([C#H]5[NH+](CC=C[C##]...
20 O(C)C1=CC=C2c3c(cc(OC)c(OC)c3OC)CC[C#H](NC(=O...
21 O=C([C##H](\C=C(\C=C\C(=O)N[O-])/C)C)c1ccc(N(...
22 O1[C#](C)([C#H]2[C#H](OC)[C#H](OC(=O)\C=C\C=C...
23 O1[C#H]2n3c4c(c5c(CNC5=O)c5c6c(n(c45)[C#]1(C)...
24 O1[C#H](CC)[C#](O)(C)[C#H](O)[C##H](C)C(=O)[C...
25 O1[C##H](CO)[C#H](O)[C##H](O)[C#H]([NH2+]C)[C...
26 S1[C#H]2N([C##H](C(=O)[O--])C1(C)C)C(=O)[C#H]...
27 O=C(N[C##H](O)C(=O)NCCCC[NH2+]CCC[NH3+])C[C##...
28 O1[C##H](CC(=O)[C##H](\C=C(/C)\[C##H](O)[C##H...
29 Oc1ccc(cc1)[C#H](O)[C##H](O)[C##H]1NC(=O)[C#H...
30 O1[C##H]2[C##](O)([C#]34O[C##H]5OC(=O)[C#H](O...
31 Clc1c2Oc3cc4[C##H](NC(=O)[C##H](NC(=O)[C#H](N...
32 O1[C##H](C)[C#H](C)[C#H](O)[C#H](\C=C\C=C\C=C...
33 Clc1c2c(C(O[C##H](C[C#H]3O[C##H]3/C=C\C=C\C(=...
34 O1[C#H](C[C##H](O)[C#H](C\C=C\Cc2c(C1=O)c(O)c...
35 S1C2=N[C#H](c3oc(c(n3)-c3oc(c(n3)-c3occ(n3)-c...
36 O1c2c3c4c(c(O)c2C)c(O)c(NC(=O)/C(=C\C=C\[C#H]...
37 O1[C##H](C[C#H](OC)[C##H](O)CC\C=C(\C=C\[C#H]...
38 O1[C##H](C\C=C\C=C\[C#H](O)[C##H](C[C#H](CC=O...
39 O1[C##]2(C(=O)[O-])[C#](O)(C(O)=O)[C#H](O[C#]...
40 O1C[C#H](CO)[C##H](O)C[C#]12OC[C##H](CC2)CCS
41 ClC(\C=C\[C##H](O)CC(C[C#H]1O[C#H]2[C#H](O)[C...
42 O1[C##H]2[C#H](O[C##]3([C#H](O[C##H]4[C#H](O[...
43 O(C)c1cc2c(nccc2[C##H](O)[C#H]2[N##H+]3C[C##H...
44 O1C[C#H](N=C1c1ccccc1O)C(=O)N[C##H](CCCCN([O-...
45 O(C)c1c(OC)c2[nH]c(cc2cc1OC)C(=O)N1C=2[C#]3([...
46 [S+](CCCNC(=O)c1nc(sc1)-c1nc(sc1)CCNC(=O)[C##...
47 O1[C#H](CCC\C=C\[C#H]2[C##H](C[C##H](O)C2)[C...
48 O1[C##]23[C##H]([C#H](C)C(=C)[C##H](O)[C##H]2...
49 s1cc(nc1C)\C=C(/C)\[C#H]1OC(=O)C[C#H](O)C(C)(...
50 S(C(=O)[C##]1(NC(=O)[C#H](C)[C##H]1O)[C##H](O...
51 Ic1c(C)c(C(S[C##H]2[C#H](O[C##H](ON[C#H]3[C#H...
52 O1[C##H]2O[C##]3(OO[C#]24[C##H](CC[C#H]([C##H...
53 O1[C##H](C[C##H](O)CC1=O)CC[C##H]1[C##H]2C(C=...
54 O1[C##H](C[C##H](OC(=O)[C##H](NC=O)CC(C)C)C\C...
55 O1[C##H](C[C#H]2CO[C##H](C\C(=C\C(OCCCCCCCCC(...
56 O1[C#H](C)[C#H](NC(=O)[C##H](NC(=O)[C#H](NC(=...
57 O=C(N[C##H](CC(C)C)C(=O)[O-])[C##H](O)[C#H]([...
58 OC/C(=C\CC\C(=C\CO)\C)/CC\C=C(\CC\C=C(\C)/C)/C
59 O(C)C1=C2C[C#H](C[C#H](OC)[C#H](O)[C#H](\C=C(...
Name: smiles, dtype: object
python npsmiles_descriptors.py > out.txt
File "<ipython-input-35-e13d177d9791>", line 1
python npsmiles_descriptors.py > out.txt
^
SyntaxError: invalid syntax

Use !python npsmiles_descriptors.py > out.txt instead. !<command line command> is shorthand for
import os
os.system("<command line command>")
in a Jupyter Notebook.
Credit: https://stackoverflow.com/a/47952494/14212394

Python code crashes when running, but not when debugging (Ctypes)

I am running into a REALLY weird case with a little class involving ctypes that I am writing. The objective of this class is to load a matrix that is in proprietary format into a python structure that I had to create (these matrices can have several cores/layers and each core/layer can have several indices that refer to only a few elements of the matrix, thus forming submatrices).
The code that test the class is this:
import numpy as np
from READS_MTX import mtx
import time
mymatrix=mtx()
mymatrix.load('D:\\MyMatrix.mtx', True)
and the class I created is this:
import os
import numpy as np
from ctypes import *
import ctypes
import time
def main():
pass
#A mydll mtx can have several cores
#we need a class to define each core
#and a class to hold the whole file together
class mtx_core:
def __init__(self):
self.name=None #Matrix core name
self.rows=-1 #Number of rows in the matrix
self.columns=-1 #Number of columns in the matrix
self.type=-1 #Data type of the matrix
self.indexcount=-1 #Tuple with the number of indices for each dimension
self.RIndex={} #Dictionary with all indices for the rows
self.CIndex={} #Dictionary with all indices for the columns
self.basedata=None
self.matrix=None
def add_core(self, mydll,mat,core):
nameC=ctypes.create_string_buffer(50)
mydll.MATRIX_GetLabel(mat,0,nameC)
nameC=repr(nameC.value)
nameC=nameC[1:len(nameC)-1]
#Add the information to the objects' methods
self.name=repr(nameC)
self.rows= mydll.MATRIX_GetBaseNRows(mat)
self.columns=mydll.MATRIX_GetBaseNCols(mat)
self.type=mydll.MATRIX_GetDataType(mat)
self.indexcount=(mydll.MATRIX_GetNIndices(mat,0 ),mydll.MATRIX_GetNIndices(mat,0 ))
#Define the data type Numpy will have according to the data type of the matrix in question
dt=np.float64
v=(self.columns*c_float)()
if self.type==1:
dt=np.int32
v=(self.columns*c_long)()
if self.type==2:
dt=np.int64
v=(self.columns*c_longlong)()
#Instantiate the matrix
time.sleep(5)
self.basedata=np.zeros((self.rows,self.columns),dtype=dt)
#Read matrix and puts in the numpy array
for i in range(self.rows):
mydll.MATRIX_GetBaseVector(mat,i,0,self.type,v)
self.basedata[i,:]=v[:]
#Reads all the indices for rows and put them in the dictionary
for i in range(self.indexcount[0]):
mydll.MATRIX_SetIndex(mat, 0, i)
v=(mydll.MATRIX_GetNRows(mat)*c_long)()
mydll.MATRIX_GetIDs(mat,0, v)
t=np.zeros(mydll.MATRIX_GetNRows(mat),np.int64)
t[:]=v[:]
self.RIndex[i]=t.copy()
#Do the same for columns
for i in range(self.indexcount[1]):
mydll.MATRIX_SetIndex(mat, 1, i)
v=(mydll.MATRIX_GetNCols(mat)*c_long)()
mydll.MATRIX_GetIDs(mat,1, v)
t=np.zeros(mydll.MATRIX_GetNCols(mat),np.int64)
t[:]=v[:]
self.CIndex[i]=t.copy()
class mtx:
def __init__(self):
self.data=None
self.cores=-1
self.matrix={}
mydll=None
def load(self, filename, verbose=False):
#We load the DLL and initiate it
mydll=cdll.LoadLibrary('C:\\Program Files\\Mysoftware\\matrixDLL.dll')
mydll.InitMatDLL()
mat=mydll.MATRIX_LoadFromFile(filename, True)
if mat<>0:
self.cores=mydll.MATRIX_GetNCores(mat)
if verbose==True: print "Matrix has ", self.cores, " cores"
for i in range(self.cores):
mydll.MATRIX_SetCore(mat,i)
nameC=ctypes.create_string_buffer(50)
mydll.MATRIX_GetLabel(mat,i,nameC)
nameC=repr(nameC.value)
nameC=nameC[1:len(nameC)-1]
#If verbose, we list the matrices being loaded
if verbose==True: print " Loading core: ", nameC
self.datafile=filename
self.matrix[nameC]=mtx_core()
self.matrix[nameC].add_core(mydll,mat,i)
else:
raise NameError('Not possible to open file. TranCad returned '+ str(tc_value))
mydll.MATRIX_CloseFile(filename)
mydll.MATRIX_Done(mat)
if __name__ == '__main__':
main()
When I run the test code in ANY form (double clicking, python's IDLE or Pyscripter) it crashes with the familiar error "WindowsError: exception: access violation writing 0x0000000000000246", but when I debug the code using Pyscripter stoping in any inner loop, it runs perfectly.
I'd really appreciate any insights.
EDIT
THe Dumpbin output for the DLL:
File Type: DLL
Section contains the following exports for CaliperMTX.dll
00000000 characteristics
52FB9F15 time date stamp Wed Feb 12 08:19:33 2014
0.00 version
1 ordinal base
81 number of functions
81 number of names
ordinal hint RVA name
1 0 0001E520 InitMatDLL
2 1 0001B140 MATRIX_AddIndex
3 2 0001AEE0 MATRIX_Clear
4 3 0001AE30 MATRIX_CloseFile
5 4 00007600 MATRIX_Copy
6 5 000192A0 MATRIX_CreateCache
7 6 00019160 MATRIX_CreateCacheEx
8 7 0001EB10 MATRIX_CreateSimple
9 8 0001ED20 MATRIX_CreateSimpleLike
10 9 00016D40 MATRIX_DestroyCache
11 A 00016DA0 MATRIX_DisableCache
12 B 0001A880 MATRIX_Done
13 C 0001B790 MATRIX_DropIndex
14 D 00016D70 MATRIX_EnableCache
15 E 00015B10 MATRIX_GetBaseNCols
16 F 00015B00 MATRIX_GetBaseNRows
17 10 00015FF0 MATRIX_GetBaseVector
18 11 00015CE0 MATRIX_GetCore
19 12 000164C0 MATRIX_GetCurrentIndexPos
20 13 00015B20 MATRIX_GetDataType
21 14 00015EE0 MATRIX_GetElement
22 15 00015A30 MATRIX_GetFileName
23 16 00007040 MATRIX_GetIDs
24 17 00015B80 MATRIX_GetInfo
25 18 00015A50 MATRIX_GetLabel
26 19 00015AE0 MATRIX_GetNCols
27 1A 00015AB0 MATRIX_GetNCores
28 1B 00016EC0 MATRIX_GetNIndices
29 1C 00015AC0 MATRIX_GetNRows
30 1D 00018AF0 MATRIX_GetVector
31 1E 00015B40 MATRIX_IsColMajor
32 1F 00015B60 MATRIX_IsFileBased
33 20 000171A0 MATRIX_IsReadOnly
34 21 00015B30 MATRIX_IsSparse
35 22 0001AE10 MATRIX_LoadFromFile
36 23 0001BAE0 MATRIX_New
37 24 00017150 MATRIX_OpenFile
38 25 000192D0 MATRIX_RefreshCache
39 26 00016340 MATRIX_SetBaseVector
40 27 00015C20 MATRIX_SetCore
41 28 00016200 MATRIX_SetElement
42 29 00016700 MATRIX_SetIndex
43 2A 0001AFA0 MATRIX_SetLabel
44 2B 00018E50 MATRIX_SetVector
45 2C 00005DA0 MAT_ACCESS_Create
46 2D 00005E40 MAT_ACCESS_CreateFromCurrency
47 2E 00004B10 MAT_ACCESS_Done
48 2F 00005630 MAT_ACCESS_FillRow
49 30 000056D0 MAT_ACCESS_FillRowDouble
50 31 00005A90 MAT_ACCESS_GetCurrency
51 32 00004C30 MAT_ACCESS_GetDataType
52 33 000058E0 MAT_ACCESS_GetDoubleValue
53 34 00004C40 MAT_ACCESS_GetIDs
54 35 00005AA0 MAT_ACCESS_GetMatrix
55 36 00004C20 MAT_ACCESS_GetNCols
56 37 00004C10 MAT_ACCESS_GetNRows
57 38 000055A0 MAT_ACCESS_GetRowBuffer
58 39 00005570 MAT_ACCESS_GetRowID
59 3A 00005610 MAT_ACCESS_GetToReadFlag
60 3B 00005870 MAT_ACCESS_GetValue
61 3C 00005AB0 MAT_ACCESS_IsValidCurrency
62 3D 000055E0 MAT_ACCESS_SetDirty
63 3E 000059F0 MAT_ACCESS_SetDoubleValue
64 3F 00005620 MAT_ACCESS_SetToReadFlag
65 40 00005960 MAT_ACCESS_SetValue
66 41 00005460 MAT_ACCESS_UseIDs
67 42 00005010 MAT_ACCESS_UseIDsEx
68 43 00005490 MAT_ACCESS_UseOwnIDs
69 44 00004D10 MAT_ACCESS_ValidateIDs
70 45 0001E500 MAT_pafree
71 46 0001E4E0 MAT_palloc
72 47 0001E4F0 MAT_pfree
73 48 0001E510 MAT_prealloc
74 49 00006290 MA_MGR_AddMA
75 4A 00006350 MA_MGR_AddMAs
76 4B 00005F90 MA_MGR_Create
77 4C 00006050 MA_MGR_Done
78 4D 000060D0 MA_MGR_RegisterThreads
79 4E 00006170 MA_MGR_SetRow
80 4F 00006120 MA_MGR_UnregisterThread
81 50 0001E490 UnloadMatDLL
Summary
6000 .data
5000 .pdata
C000 .rdata
1000 .reloc
1000 .rsrc
54000 .text

programming challenge help (python)? [duplicate]

This question already has answers here:
Euler project #18 approach
(10 answers)
Closed 9 years ago.
I'm trying to solve project euler problem 18/67 . I have an attempt but it isn't correct.
tri = '''\
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23'''
sum = 0
spot_index = 0
triarr = list(filter(lambda e: len(e) > 0, [[int(nm) for nm in ln.split()] for ln in tri.split('\n')]))
for i in triarr:
if len(i) == 1:
sum += i[0]
elif len(i) == 2:
spot_index = i.index(max(i))
sum += i[spot_index]
else:
spot_index = i.index(max(i[spot_index],i[spot_index+1]))
sum += i[spot_index]
print(sum)
When I run the program, it is always a little bit off of what the correct sum/output should be. I'm pretty sure that it's an algorithm problem, but I don't know how exactly to fix it or what the best approach to the original problem might be.

Your algorithm is wrong. Consider if there was a large number like 1000000 on the bottom row. Your algorithm might follow a path that doesn't find it at all.
The question hints that this one can be brute forced, but that there is also a more clever way to solve it.
Somehow your algorithm will need to consider all possible pathways/sums.
The brute force method is to try each and every one from top to bottom.
The clever way uses a technique called dynamic programming

Here's the algorithm. I'll let you figure out a way to code it.
Start with the two bottom rows. At each element of the next-to-bottom row, figure out what the sum will be if you reach that element by adding the maximum of the two elements of the bottom row that correspond to the current element of the next-to-bottom row. For instance, given the sample above, the left-most element of the next-to-bottom row is 63, and if you ever reach that element, you will certainly choose its right child 62. So you can replace the 63 on the next-to-bottom row with 63 + 62 = 125. Do the same for each element of the next-to-bottom row; you will get 125, 164, 102, 95, 112, 123, 165, 128, 166, 109, 112, 147, 100, 54. Now delete the bottom row and repeat on the reduced triangle.
There is also a top-down algorithm that is dual to the one given above. I'll let you figure that out, too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Analysing Json file in Python using pandas - python

Related

Turn a 2 second order array into pandas dataframe

How do I sort columns of numerical file data in python

How to save Jupyter notebook output in a file using python command?

Python code crashes when running, but not when debugging (Ctypes)

programming challenge help (python)? [duplicate]

Categories

Resources