I printed out composed array and saved to text file, it like:
({
ngram_a67e6f3205f0-n: 1,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8580469779197205)
({
ngram_a67e6f3205f0-n: 2,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8880895806519427)
({
ngram_a67e6f3205f0-n: 3,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8656452460818544)
I hope extract data to produce python Dataframe, it like:
1, 10000, 0.8580469779197205
2, 10000, 0.8880895806519427
My advice is to change the input format of your file, if possible. It would greatly simplify your life. If this is not possible, the following code solves your problem:
import pandas as pd
import re
pattern_tuples = '(?<=\()[^\)]*'
pattern_numbers = '[ ,](?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?'
col_name = ['ngram', 'logreg', 'vocabSize', 'score']
with open('test.txt','r') as f:
matchs = re.findall(pattern_tuples, f.read())
arr_data = [[float(val.replace(',','')) for val in re.findall(pattern_numbers, match)] for match in matchs]
df = pd.DataFrame(arr_data, columns=col_name).astype({'ngram':'int', 'vocabSize': 'int'})
and gives:
ngram logreg vocabSize score
0 1 0.01 10000 0.858047
1 2 0.01 10000 0.888090
2 3 0.01 10000 0.865645
Brief explanation
Read the file
Using re.findall and the regex pattern_tuples finds all the tuples in the file
For each tuple, using the regex pattern_numbers you will find the 4 numerical values that interest you. In this way you will get a list of lists containing your data
Enter the results in a pandas dataframe
Extra
Here's how you could save your CV results in json format, so you can manage them more easily:
Create an cv_results array to keep the CV results
For each loop of CVs you will get a tuple t with the results, which you will have to transform into a dictionary and hang in the array cv_results
At the end of the CV loops, save the results in json format
.
cv_results = []
for _ in range_cv: # Loop CV
# ... Calculate results of CV in t
t = ({'ngram_a67e6f3205f0-n': 1,
'logreg_c120232d9faa-regParam': 0.01,
'cntVec_9c0e7831261d-vocabSize': 10000},
0.8580469779197205) # FAKE DATA for this example
# append results like a dict
cv_results.append({'res':t[0], 'score':t[1]})
# Store results in json format
with open('cv_results.json', 'w') as outfile:
json.dump(cv_results, outfile, indent=4)
Now you can read the json file and you can access all the fields like a normal python dictionary:
with open('cv_results.json') as json_file:
data = json.load(json_file)
data[0]['score']
# output: 0.8580469779197205
Why not do:
import pandas as pd
With open(file.txt) as file:
df = pd.DataFrame([i for i in eval(file.readline())])
Eval takes a string and converts it to the literal python representation which is pretty nifty. That would convert each parenthetical to a single item iterator which is then stored into a list. Pd dataframe class can take a list of dictionaries with identical keys and create a dataframe
I have a JSON file which resulted from YouTube's iframe API and I want to put this JSON data into a pandas dataframe, where each JSON key will be a column, and each record should be a new row.
Normally I would use a loop and iterate over the rows of the JSON but this particular JSON looks like this :
[
"{\"timemillis\":1563467467703,\"date\":\"18.7.2019\",\"time\":\"18:31:07,703\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:02\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.3,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}",
"{\"timemillis\":1563467468705,\"date\":\"18.7.2019\",\"time\":\"18:31:08,705\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:03\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.5,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}"
]
In this JSON not every key is written as a new line. How can I extract the keys in this case, and express them as columns?
A Pythonic Solution would be to use the keys and values API of the Python Dictionary.
it should be something like this:
ls = [
"{\"timemillis\":1563467467703,\"date\":\"18.7.2019\",\"time\":\"18:31:07,703\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:02\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.3,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}",
"{\"timemillis\":1563467468705,\"date\":\"18.7.2019\",\"time\":\"18:31:08,705\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:03\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.5,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}"
]
ls = [json.loads(j) for j in ls]
keys = [j.keys() for j in ls] # this will get you all the keys
vals = [j.values() for j in ls] # this will get the values and then you can do something with it
print(keys)
print(values)
easiest way is to leverage json_normalize from pandas.
import json
from pandas.io.json import json_normalize
input_dict = [
"{\"timemillis\":1563467467703,\"date\":\"18.7.2019\",\"time\":\"18:31:07,703\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:02\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.3,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}",
"{\"timemillis\":1563467468705,\"date\":\"18.7.2019\",\"time\":\"18:31:08,705\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:03\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.5,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}"
]
input_json = [json.loads(j) for j in input_dict]
df = json_normalize(input_json)
I think you are asking to break down your key and values and want keys as a column,and values as a row:
This is my approach and plz always provide how your expected output should like
ChainMap flats your dict in key and values and pretty much is self explanatory.
data = ["{\"timemillis\":1563467467703,\"date\":\"18.7.2019\",\"time\":\"18:31:07,703\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:02\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.3,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}","{\"timemillis\":1563467468705,\"date\":\"18.7.2019\",\"time\":\"18:31:08,705\",\"videoId\":\"0HJx2JhQKQk\",\"startSecond\":\"0\",\"stopSecond\":\"90\",\"playerStateNumeric\":1,\"playerStateVerbose\":\"Playing\",\"curTimeFormatted\":\"0:03\",\"totalTimeFormatted\":\"9:46\",\"playoutLevelPercent\":0.5,\"bufferLevelPercent\":1.4,\"qual\":\"large\",\"qualLevels\":[\"hd720\",\"large\",\"medium\",\"small\",\"tiny\",\"auto\"],\"playbackRate\":1,\"playbackRates\":[0.25,0.5,0.75,1,1.25,1.5,1.75,2],\"playerErrorNumeric\":\"\",\"playerErrorVerbose\":\"\"}"]
import json
from collections import ChainMap
data = [json.loads(i) for i in data]
data = dict(ChainMap(*data))
keys = []
vals = []
for k,v in data.items():
keys.append(k)
vals.append(v)
data = pd.DataFrame(zip(keys,vals)).T
new_header = data.iloc[0]
data = data[1:]
data.columns = new_header
#startSecond playbackRates playbackRate qual totalTimeFormatted timemillis playerStateNumeric playerStateVerbose playerErrorNumeric date time stopSecond bufferLevelPercent playerErrorVerbose qualLevels videoId curTimeFormatted playoutLevelPercent
#0 [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2] 1 large 9:46 1563467467703 1 Playing 18.7.2019 18:31:07,703 90 1.4 [hd720, large, medium, small, tiny, auto] 0HJx2JhQKQk 0:02 0.3
I'm trying to make a pandas dataframe from a .npy file which, when read in using np.load, returns a numpy array containing a dictionary. My initial instinct was to extract the dictionary and then create a dataframe using pd.from_dict, but this fails every time because I can't seem to get the dictionary out of the array returned from np.load. It looks like it's just np.array([dictionary, dtype=object]), but I can't get the dictionary by indexing the array or anything like that. I've also tried using np.load('filename').item() but the result still isn't recognized by pandas as a dictionary.
Alternatively, I tried pd.read_pickle and that didn't work either.
How can I get this .npy dictionary into my dataframe? Here's the code that keeps failing...
import pandas as pd
import numpy as np
import os
targetdir = '../test_dir/'
filenames = []
successful = []
unsuccessful = []
for dirs, subdirs, files in os.walk(targetdir):
for name in files:
filenames.append(name)
path_to_use = os.path.join(dirs, name)
if path_to_use.endswith('.npy'):
try:
file_dict = np.load(path_to_use).item()
df = pd.from_dict(file_dict)
#df = pd.read_pickle(path_to_use)
successful.append(path_to_use)
except:
unsuccessful.append(path_to_use)
continue
print str(len(successful)) + " files were loaded successfully!"
print "The following files were not loaded:"
for item in unsuccessful:
print item + "\n"
print df
Let's assume once you load the .npy, the item (np.load(path_to_use).item()) looks similar to this;
{'user_c': 'id_003', 'user_a': 'id_001', 'user_b': 'id_002'}
So, if you need to come up with a DataFrame like below using above dictionary;
user_name user_id
0 user_c id_003
1 user_a id_001
2 user_b id_002
You can use;
df = pd.DataFrame(list(x.item().iteritems()), columns=['user_name','user_id'])
If you have a list of dictionaries like below;
users = [{'u_name': 'user_a', 'u_id': 'id_001'}, {'u_name': 'user_b', 'u_id': 'id_002'}]
You can simply use
df = pd.DataFrame(users)
To come up with a DataFrame similar to;
u_id u_name
0 id_001 user_a
1 id_002 user_b
Seems like you have a dictionary similar to this;
data = {
'Center': [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
'Vpeak': [1.1, 2.2],
'ID': ['id_001', 'id_002']
}
In this case, you can simply use;
df = pd.DataFrame(data) # df = pd.DataFrame(file_dict.item()) in your case
To come up with a DataFrame similar to;
Center ID Vpeak
0 [0.1, 0.2, 0.3] id_001 1.1
1 [0.4, 0.5, 0.6] id_002 2.2
If you have ndarray within the dict, do some preprocessing similar to below; and use it to create the df;
for key in data:
if isinstance(data[key], np.ndarray):
data[key] = data[key].tolist()
df = pd.DataFrame(data)
If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!
EDIT: See end of my post for working code, obtained from zeekay here.
I have a CSV file with two columns (voltage and current). Because the voltage is recorded to many significant digits and the current only has 2, there are many identical current values as the value of the voltage changes. This isn't important to the programming but I'm just explaining how the data is physically obtained. I want to perform the following action:
For as long as the value of the second column (current) does not change, collect the values of the first column (voltage) into a list and average them. Then write a row into a new CSV file which is this averaged value of the voltage in the first column and the constant current value which did not change in the second column. In other words, if there are 20 rows for which the current did not change (say it is 6 uA), the 20 corresponding voltage values are averaged (say this average comes out to be 600 mV) and a row is generated in a new csv file which reads ('0.6','0.000006'). Then I want to continue iterating through the csv which is being read, repeating the above procedure for each set of fixed currents.
I've got the following code so far, but I'm not sure if I'm on the right track:
import sys, csv
with open('filetowriteto.csv','w') as avg:
loadeddata = open('filetoreadfrom.csv','r')
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1]
for row in readloaded:
newcurrent = row[1]
biaslist = []
if newcurrent == oldcurrent:
biaslist.append(row[0])
else :
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg,newcurrent])
newcurrent = row[1]
and then I'm not sure where to go.
Edit: It seems that zeekay is on the right track for what I want to do. I'm trying to implement his itertools.groupby() method but I'm currently getting a blank file generated. Here's my new code so far:
import sys, csv, itertools
with open('VI_avg(12).csv','w') as avg: # this is the file which gets written
loadeddata = open('VI(12).csv','r') # this is the file which is read
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1] # looks like this is no longer required
for current, row in itertools.groupby(readloaded, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
# write it out
writer.writerow(biasavg, current)
Suppose the CSV file being opened is something like this (shortened example):
0.595417,0.000065
0.595177,0.000065
0.594937,0.000065
0.594697,0.000065
0.594457,0.000065
0.594217,0.000065
0.593977,0.000065
0.593737,0.000065
0.593497,0.000064
0.593017,0.000064
0.592777,0.000064
0.592537,0.000064
0.592297,0.000064
0.587018,0.000064
0.586778,0.000064
0.586538,0.000063
0.586299,0.000063
0.586059,0.000063
0.585579,0.000063
0.585339,0.000063
0.585099,0.000063
0.584859,0.000063
0.584619,0.000063
0.584379,0.000063
0.584139,0.000063
0.583899,0.000063
0.583659,0.000063
Final update: Here's the working version, obtained from zeekay:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
You can use itertools.groupby to group results as you read through the csv, which would simplify things a lot. Given your updated example:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
Maybe you can try using pandas:
import pandas
voltage = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3]
current = [1.0, 1.0, 1.1, 1.3, 1.2, 1.3]
df = pandas.DataFrame({'voltage': voltage, 'current': current})
result = df.groupby('current').mean()
# Output
voltage
current
1.0 1.15
1.1 1.30
1.2 2.20
1.3 2.20
result.to_csv('grouped_data.csv')
One way:
curDict = {}
for row in loaded row:
if row[1] not in curDict.keys(): # if not already there create key/value pair
curDict[str(row[1])] = [row[0]]
else: # already exists, add to key/value pair
curDict[str(row[1])].append(row[0])
#You'll end up with:
# {'0.6': [599, 600, 601...], ...}
# write the rows
for k,v in curDict.values():
avgValue = reduce(lambda a,b: a+b, v)/len(v) # calculate the avg of the voltages
writer.writerow([k,avgValue])
This version will do what you describe, but it will average all values with the same voltage, regardless of whether they are consecutive or not. Apologies if that's not what you want, but maybe it can help you along the way:
import csv
from collections import defaultdict
def f(acc, row):
acc[row[1]].append(float(row[0]))
return acc
with open('out.csv', 'w') as out:
writer = csv.writer(out)
data = open('in.csv', 'r')
r = csv.reader(data)
reduced = reduce(f, r, defaultdict(list))
for v, c in reduced.items():
writer.writerow([v, sum(c)/len(c)])
Yet another way using some very small test data (haven't included the csv stuff as you appear to have a handle on that):
#!/usr/bin/python3
test_data = [ # Only 3 currents in testdata:
(0.00030,5), # 5 : Only one entry, total 0.00030 - so should give 0.00030 as the average
(0.00012,6), # 6 : Two entries, total 0.00048 - so should give 0.00024 as the average
(0.00036,6),
(0.00001,7), # 7 : Four entries, total 0.00008 - so should give 0.00002 as the average
(0.00001,7),
(0.00001,7),
(0.00007,7)]
currents = dict()
for row in test_data:
if not row[1] in currents:
matching_currents = list((each[0] for each in test_data if each[1] == row[1]))
current_average = sum(matching_currents) / len(matching_currents)
currents[row[1]] = current_average
print("There were {0} unique currents found:\n".format(len(currents)))
for current,bias in currents.items():
print("Current: {0:2d} ( Average: {1:1.5f} )".format(current,bias))