I have some json data of phone accelerometer data. Which looks like this:
{u'timestamps': {u'1524771017235': [[u'x',
u'y',
u'z',
u'rotationX',
u'rotationY',
u'rotationZ'],
[-0.02, 0, 0.04, 102.65, 68.15, 108.61],
[-0.03, 0.02, 0.02, 102.63, 68.2, 108.5],
[-0.05, 0.01, 0.1, 102.6, 68.25, 108.4],
[-0.02, 0, 0.09, 102.6, 68.25, 108.4],
[-0.01, 0, 0.03, 102.6, 68.25, 108.4]]}}
What I want is to have a dataFrame with columns of the name of the data (x, y,z rotationX, rotationY, rotationZ) and the row of each data entry. The timestamp information can be stored elsewhere.
When I used d = pd.read_json('data.json'), this is what I get:
timestamps
2018-04-26 19:30:17.235 [[x, y, z, rotationX, rotationY, rotationZ], [...
It seems it takes the timestamps as index. And put all the rest in 1 cell.
I dont have much experience with json so the couldn't really make much sense out of the pandas.read_json api. please help.
My current workaround is to manually skip through the first 2 dictionaries. And create a df with the first column as the headers. It works but it is really not ideal...
dataDf = pd.DataFrame(data = d['timestamps']['1524771017235'][1:], columns = d['timestamps']['1524771017235'][0])
x y z rotationX rotationY rotationZ
0 -0.02 0.00 0.04 102.65 68.15 108.61
1 -0.03 0.02 0.02 102.63 68.20 108.50
2 -0.05 0.01 0.10 102.60 68.25 108.40
Thanks
What you need is to get access to the key of the dictionary {u'1524771017235': [[u'x', ... being the value associated to the key timestamps of the dictionary d associated to your json file. Then try:
d['timestamps'].keys()[0]
and it should return your '1524771017235' value, so to create your dataDf just do:
dataDf = pd.DataFrame(data = d['timestamps'][d['timestamps'].keys()[0]][1:],
columns = d['timestamps'][d['timestamps'].keys()[0]][0])
and you get the same result.
Related
I have a numpy array that contains the coordinates (X,Y,Z) of 5 points:
Coordinates = np.array([[1000, 1000,10],[1003, 1003,10],[1004, 1004,10],[1002, 1002,10],[1001, 1001,10]])
On the other hand, I have a Pandas dataframe that contains the value of a variable for each of these 5 points:
d = {"Values": [0.25, 0.24,0.23,0.3,0.22]}
df = pd.DataFrame(data=d)
With treeBall_Neighbors I get the index of the neighbors of each point within a radius of 2m:
treeBall_Neighbors = sklearn.neighbors.BallTree(Coordinates, leaf_size=2)
indices_Neighbors=treeBall_Neighbors.query_radius(Coordinates[:], r=2)
And finally I want to add the mean value of the neighbors of each point into the dataframe:
df["Neighbors_Values"]=df["Values"].iloc[indices_Neighbors.tolist()[:]].mean()
But sadly I'm getting the error "ValueError: setting an array element with a sequence". The only partial solution that I get was only for the first row:
df["Neighbors_Values"]=df["Values"].iloc[indices_Neighbors.tolist()[0]].mean()
Do you have any idea of how can I obtain the other values without do a loop? The final result should looks like this:
Values Neighbors_Values
0 0.25 0.235
1 0.24 0.256667
2 0.23 0.235
3 0.30 0.253333
4 0.22 0.256667
Finally I resolved the problem with the following code (using a function and lambda):
def Obtain_Mean(x, df):
print(df.Values.iloc[x.indices_Neighbors].mean())
return df.Values.iloc[x.indices_Neighbors].mean()
#inFile["Neighbors_Z"][i]=inFile["NDVI"].iloc[indices_Neighbors.tolist()[i]].mean()
Coordinates = np.array([[1000, 1000,10],[1003, 1003,10],[1004, 1004,10],[1002, 1002,10],[1001, 1001,10]])
d = {"Values": [0.25, 0.24,0.23,0.3,0.22]}
df = pd.DataFrame(data=d)
treeBall_Neighbors = sklearn.neighbors.BallTree(Coordinates, leaf_size=2)
indices_Neighbors=treeBall_Neighbors.query_radius(Coordinates[:], r=2)
df["indices_Neighbors"]=indices_Neighbors
df['Mean_Neighbors'] = df.apply(lambda x: Obtain_Mean(x, df) , axis=1)
I'm trying to encode genomes from strings stored in a dataframe to an array of corresponding numerical values.
Here is some of my dataframe (for some reason it doesn't give me all 5 columns just 2):
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
So I need to split these strings character by character and assign them to floats. This is the lookup table I was using:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
I tried to apply this directly using:
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
But I have too much data to fit into memory so I'm trying to process using chunks and I'm having trouble defining a reprocessing function.
Here's my code so far:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath, chunksize=10)
chunk_list = []
def preprocess(chunk):
chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return;
for chunk in dataframe:
chunk_filter = preprocess(chunk)
chunk_list.append(chunk_filter)
dataframe1 = pd.concat(chunk_list)
print(dataframe1)
Thanks in advance!
You have chunk_filter = preprocess(chunk), but your preprocess() function returns nothing, so chunk_filter is always meaningless. Modify your preprocess function to store the result of the apply() call, then return that value. For example:
def preprocess(chunk):
processed_chunk = chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return processed_chunk;
By doing this, you actually return the data from the preprocess function so that it can be appended to the chunk list. As you have it currently, the preprocess function works correctly but essentially discards the results.
I'm calculating the frequency of words into many text files (140 docs), the end of my work is to create a csv file where I can order the frequence of every word by single doc and by all docs.
Let say I have:
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
...
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
So, what I need is a cvs file to export in excel that looks like this:
WORD, ABS_FREQ, DOC_1_FREQ, DOC_2_FREQ, ..., DOC_140_FREQ
hello, 0.001 0.8 0.2 0.1
world, 0.002 0.9 0.03 0.5
baby, 0.005 0.7 0.6 0.9
How can I do It with python?
You could also convert it to a Pandas Dataframe and save it as a csv file or continue analysis in a clean format.
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
all = [absolut_freq, doc_1, doc_2, doc_140]
# if you have a bunch of docs, you could use enumerate and then format the colname as you iterate over and create the dataframe
colnames = ['AbsoluteFreq', 'Doc1', 'Doc2', 'Doc140']
import pandas as pd
masterdf = pd.DataFrame()
for i in all:
df = pd.DataFrame([i]).T
masterdf = pd.concat([masterdf, df], axis=1)
# assign the column names
masterdf.columns = colnames
# get a glimpse of what the data frame looks like
masterdf.head()
# save to csv
masterdf.to_csv('docmatrix.csv', index=True)
# and to sort the dataframe by frequency
masterdf.sort(['AbsoluteFreq'])
You can make it a mostly a data-driven process—given only the variable names of all the dictionary variables—by first creating a table with all the data listed in it, and then using the csv module to write a transposed (columns for rows swapped) version it to the output file.
import csv
absolut_freq = {u'hello': 0.001, u'world': 0.002, u'baby': 0.005}
doc_1 = {u'hello': 0.8, u'world': 0.9, u'baby': 0.7}
doc_2 = {u'hello': 0.2, u'world': 0.3, u'baby': 0.6}
doc_140 ={u'hello': 0.1, u'world': 0.5, u'baby': 0.9}
dic_names = ('absolut_freq', 'doc_1', 'doc_2', 'doc_140') # dict variable names
namespace = globals()
words = namespace[dic_names[0]].keys() # assume dicts all contain the same words
table = [['WORD'] + list(words)] # header row (becomes first column of output)
for dic_name in dic_names: # add values from each dictionary given its name
table.append([dic_name.upper()+'_FREQ'] + list(namespace[dic_name].values()))
# Use open('merged_dicts.csv', 'wb') for Python 2.
with open('merged_dicts.csv', 'w', newline='') as csvfile:
csv.writer(csvfile).writerows(zip(*table))
print('done')
CSV file produced:
WORD,ABSOLUT_FREQ_FREQ,DOC_1_FREQ,DOC_2_FREQ,DOC_140_FREQ
world,0.002,0.9,0.3,0.5
baby,0.005,0.7,0.6,0.9
hello,0.001,0.8,0.2,0.1
No matter how you want to write this data, first you need an ordered data structure, for example a 2D list:
docs = []
docs.append( {u'hello':0.001, u'world':0.002, u'baby':0.005} )
docs.append( {u'hello':0.8, u'world':0.9, u'baby':0.7} )
docs.append( {u'hello':0.2, u'world':0.3, u'baby':0.6} )
docs.append( {u'hello':0.1, u'world':0.5, u'baby':0.9} )
words = docs[0].keys()
result = [ [word] + [ doc[word] for doc in docs ] for word in words ]
then you can use the built-in csv module: https://docs.python.org/2/library/csv.html
My function outputs a list, for instance when I type:
My_function('TV', 'TV_Screen')
it outputs the following:
['TV', 1, 'TV_Screen', 0.04, 'True']
Now, my TV is made of several parts, such as speaker, transformer, etc., I can keep running my function for each part, and for instance change 'TV_Screen' for 'TV_Speaker', or 'TV_transformer', etc.
The alternative is to create a list with all the part, such as:
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
What I am trying to get is a pandas data frame with 5 columns (because my function outputs 5 variables, see above the section "it outputs the following:") and in this case 3 rows (one of each for 'TV_Screen', 'TV_Speaker', and 'TV_transformer'). Basically, I want the following to be in a data frame:
['TV', 1, 'TV_Screen', 0.04, 'True']
['TV', 9, 'TV_Speaker', 0.56, 'True']
['TV', 3, 'TV_transformer', 0.80, 'False']
I know I need a for loop somewhere, but I am not sure how to create this data frame. Could you please help? (I can change the output of my function to be a pd.Series or something else that would work better).
Thanks!
If you have many arrays, it may be worth converting them into a numpy matrix first and then converting them into a dataframe.
import pandas as pd
import numpy as np
a = ['TV', 1, 'TV_Screen', 0.04, 'True']
b = ['TV', 9, 'TV_Speaker', 0.56, 'True']
c = ['TV', 3, 'TV_transformer', 0.80, 'False']
matrix = np.matrix([a,b,c])
df = pd.DataFrame(data=matrix)
You can do it like this:
def My_function(part):
# prepare result
result = ['TV', 1, part, 0.04, 'True'] # for testing
return result
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
df = pd.DataFrame([My_function(part) for part in TV_parts])
>>> df
0 1 2 3 4
0 TV 1 TV_Screen 0.04 True
1 TV 1 TV_Speaker 0.04 True
2 TV 1 TV_transformer 0.04 True
I have a file in following format:
10000
2
2
2
2
0.00
0.00
0 1
0.00
0.01
0 1
...
I want to create a dataframe from this file (skipping the first 5 lines) like this:
x1 x2 y1 y2
0.00 0.00 0 1
0.00 0.01 0 1
So the lines are converted to columns (where each third line is also split into two columns, y1 and y2).
In R I did this as follows:
df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))
I am looking for a python alternative (pandas?) to this scan(file, what=list(...)) function.
Does it exist or do I have to write a more extended script?
You can skip the first 5, and then take groups of 4 to build a Python list, then put that in pandas as a start... I wouldn't be surprised if pandas offered something better though:
from itertools import islice, izip_longest
with open('input') as fin:
# Skip header(s) at start
after5 = islice(fin, 5, None)
# Take remaining data and group it into groups of 4 lines each... The
# first 2 are float data, the 3rd is two integers together, and the 4th
# is the blank line between groups... We use izip_longest to ensure we
# always have 4 items (padded with None if needs be)...
for lines in izip_longest(*[iter(after5)] * 4):
# Convert first two lines to float, and take 3rd line, split it and
# convert to integers
print map(float, lines[:2]) + map(int, lines[2].split())
#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]
As far as I know I cannot see any options here http://pandas.pydata.org/pandas-docs/stable/io.html to organize your DataFrame as you want;
But you can achieve it easly:
lines = open('YourDataFile.txt').read() # read the whole file
import re # import re
elems = re.split('\n| ', lines)[5:] # split each element and exclude the first 5
grouped = zip(*[iter(elems)]*4) # group them 4 by 4
import pandas as pd # import pandas
df = pd.DataFrame(grouped) # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2'] # columns names
It's not concise, it's not elegant, but it's clear what it does...
OK, here's how I did it (it is in fact a combo of Jon's & Giupo's answer, tnx guys!):
with open('myfile.txt') as file:
data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']