I printed out composed array and saved to text file, it like:
({
ngram_a67e6f3205f0-n: 1,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8580469779197205)
({
ngram_a67e6f3205f0-n: 2,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8880895806519427)
({
ngram_a67e6f3205f0-n: 3,
logreg_c120232d9faa-regParam: 0.01,
cntVec_9c0e7831261d-vocabSize: 10000
},0.8656452460818544)
I hope extract data to produce python Dataframe, it like:
1, 10000, 0.8580469779197205
2, 10000, 0.8880895806519427
My advice is to change the input format of your file, if possible. It would greatly simplify your life. If this is not possible, the following code solves your problem:
import pandas as pd
import re
pattern_tuples = '(?<=\()[^\)]*'
pattern_numbers = '[ ,](?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?'
col_name = ['ngram', 'logreg', 'vocabSize', 'score']
with open('test.txt','r') as f:
matchs = re.findall(pattern_tuples, f.read())
arr_data = [[float(val.replace(',','')) for val in re.findall(pattern_numbers, match)] for match in matchs]
df = pd.DataFrame(arr_data, columns=col_name).astype({'ngram':'int', 'vocabSize': 'int'})
and gives:
ngram logreg vocabSize score
0 1 0.01 10000 0.858047
1 2 0.01 10000 0.888090
2 3 0.01 10000 0.865645
Brief explanation
Read the file
Using re.findall and the regex pattern_tuples finds all the tuples in the file
For each tuple, using the regex pattern_numbers you will find the 4 numerical values that interest you. In this way you will get a list of lists containing your data
Enter the results in a pandas dataframe
Extra
Here's how you could save your CV results in json format, so you can manage them more easily:
Create an cv_results array to keep the CV results
For each loop of CVs you will get a tuple t with the results, which you will have to transform into a dictionary and hang in the array cv_results
At the end of the CV loops, save the results in json format
.
cv_results = []
for _ in range_cv: # Loop CV
# ... Calculate results of CV in t
t = ({'ngram_a67e6f3205f0-n': 1,
'logreg_c120232d9faa-regParam': 0.01,
'cntVec_9c0e7831261d-vocabSize': 10000},
0.8580469779197205) # FAKE DATA for this example
# append results like a dict
cv_results.append({'res':t[0], 'score':t[1]})
# Store results in json format
with open('cv_results.json', 'w') as outfile:
json.dump(cv_results, outfile, indent=4)
Now you can read the json file and you can access all the fields like a normal python dictionary:
with open('cv_results.json') as json_file:
data = json.load(json_file)
data[0]['score']
# output: 0.8580469779197205
Why not do:
import pandas as pd
With open(file.txt) as file:
df = pd.DataFrame([i for i in eval(file.readline())])
Eval takes a string and converts it to the literal python representation which is pretty nifty. That would convert each parenthetical to a single item iterator which is then stored into a list. Pd dataframe class can take a list of dictionaries with identical keys and create a dataframe
Related
I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.
From the following code, I filtered a dataset and obtained their statistic values using stats.linregress(x,y). I would like to merge the obtained values in the lists to a table and then covert it to csv. How to merge the lists? I tried the .append() but then it adds [...] at the end of each list. How to convert these lists in one csv? The code below only convert the last list to csv. Also, where is appropriate to add .02f function to shorten down the digits? Many thanks!
for i in df.ingredient.unique():
mask = df.ingredient == i
x_data = df.loc[mask]["single"]
y_data = df.loc[mask]["total"]
ing = df.loc[mask]["ingredient"]
res = stats.linregress(x_data, y_data)
result_list=list(res)
#sum_table = result_list.append(result_list)
sum_table = result_list
np.savetxt("sum_table.csv", sum_table, delimiter=',')
#print(f"{i} = res")
#print(f"{i} = result_list")
output:
[0.555725080482033, 15.369647540612188, 0.655901508882146, 0.34409849111785396, 0.45223586826559015, [...]]
[0.8240446598271236, 16.290731244189164, 0.7821893273053173, 0.00012525348188386877, 0.16409500805404134, [...]]
[0.6967783360917531, 25.8981921144781, 0.861561500951743, 0.13843849904825695, 0.29030899523536124, [...]]
I have data in triplicates, I want to get pooled data of all three replicates into one data frame, maintaining the position of value from each row and column. Say, average of value in column 2 row 3 from all replicate files should appear in the new data frame at column 2 row 3. Sample of how the data looks and code that I tried are as follows. Any help is highly appreciated. Thanks
data = {}
for file in glob.glob('results/*.csv'):
name = check_output(['basename',file,'.csv']).decode().strip()
data[name] = pd.read_csv(file, index_col = 0, header = 0)
data[name].columns = pd.to_numeric(data[name].columns)
data['file1_A']
A B
1.8 1.7
1.3 1.3
data['file_B']
A B
1.7 1.4
1.9 1.7
data['file_c']
A B
1.2 1.6
2.1 2.9
expected outcome
file1
A B
1.56 1.56
1.76 1.96
i.e.,
A B
(1.8+1.7+1.2)/3 (1.7+1.4+1.6)/3
(1.3+1.9+2.1)/3 (1.3+1.7+2.9)/3
#I usually write the following code for small number samples
file1 = (data['file1_A']+data['file1_B']+data['file1_C'])/3
#I tried to write a loop for large number of samples, but it seems like it is not quite right.
files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
replicates ={}
for sample, df in totals.items():
if f in sample:
replicates[sample] = df
final_df = df/3
Working with multiple matrices is a job for numpy! It has a function numpy.mean() which takes the mean (=average) over multiple matrices. The trick is that you have to convert your pandas.DataFrame into a numpy.array and back. Have a look at this example:
import numpy
import pandas
import random
import itertools
# Given that loading the files isn't the problem, I'll create some dummy data here
data = {
f"file{filenumber}_{filename}": pandas.DataFrame(
[
{
"A": random.random() + random.randint(0, 2),
"B": random.random() + random.randint(0, 2),
}
for _ in range(2)
]
)
for filenumber, filename in itertools.chain.from_iterable([[(i, l) for l in ["A", "B", "C"]] for i in range(1, 6)])
}
# Loop the files
for filenumber in range(1, 6):
print(f"Processing files that start with: file{filenumber}_")
# Convert all files to numpy arrays
numpy_arrays = [item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")]
# Use numpy to take the mean of each cell, across the frames (mean is the same as summing and dividing by the number of elements)
means = numpy.mean(numpy_arrays, axis=0)
# Convert back to a dataframe
df = pandas.DataFrame(means, columns=data[f"file{filenumber}_A"].columns)
# Or in a single line
df = pandas.DataFrame(numpy.mean([item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")], axis=0), columns=data[f"file{filenumber}_A"].columns)
print(df)
Seems like answer is quite easy. Here is a simple loop that worked to get average matrix of all replicates.
#load all files into an empty dictionary
data = {}
for file in glob.glob('results/*.csv'):
name = check_output(['basename',file,'.csv']).decode().strip()
data[name] = pd.read_csv(file, index_col = 0, header = 0)
data[name].columns = pd.to_numeric(data[name].columns)
# write a loop to get an average of matrices of replicates
files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
df = (data[f + 'A']+ data[f + 'B']+data[f + 'C'])/3
totals[f] = df
I'm calculating the frequency of words into many text files (140 docs), the end of my work is to create a csv file where I can order the frequence of every word by single doc and by all docs.
Let say I have:
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
...
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
So, what I need is a cvs file to export in excel that looks like this:
WORD, ABS_FREQ, DOC_1_FREQ, DOC_2_FREQ, ..., DOC_140_FREQ
hello, 0.001 0.8 0.2 0.1
world, 0.002 0.9 0.03 0.5
baby, 0.005 0.7 0.6 0.9
How can I do It with python?
You could also convert it to a Pandas Dataframe and save it as a csv file or continue analysis in a clean format.
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
all = [absolut_freq, doc_1, doc_2, doc_140]
# if you have a bunch of docs, you could use enumerate and then format the colname as you iterate over and create the dataframe
colnames = ['AbsoluteFreq', 'Doc1', 'Doc2', 'Doc140']
import pandas as pd
masterdf = pd.DataFrame()
for i in all:
df = pd.DataFrame([i]).T
masterdf = pd.concat([masterdf, df], axis=1)
# assign the column names
masterdf.columns = colnames
# get a glimpse of what the data frame looks like
masterdf.head()
# save to csv
masterdf.to_csv('docmatrix.csv', index=True)
# and to sort the dataframe by frequency
masterdf.sort(['AbsoluteFreq'])
You can make it a mostly a data-driven process—given only the variable names of all the dictionary variables—by first creating a table with all the data listed in it, and then using the csv module to write a transposed (columns for rows swapped) version it to the output file.
import csv
absolut_freq = {u'hello': 0.001, u'world': 0.002, u'baby': 0.005}
doc_1 = {u'hello': 0.8, u'world': 0.9, u'baby': 0.7}
doc_2 = {u'hello': 0.2, u'world': 0.3, u'baby': 0.6}
doc_140 ={u'hello': 0.1, u'world': 0.5, u'baby': 0.9}
dic_names = ('absolut_freq', 'doc_1', 'doc_2', 'doc_140') # dict variable names
namespace = globals()
words = namespace[dic_names[0]].keys() # assume dicts all contain the same words
table = [['WORD'] + list(words)] # header row (becomes first column of output)
for dic_name in dic_names: # add values from each dictionary given its name
table.append([dic_name.upper()+'_FREQ'] + list(namespace[dic_name].values()))
# Use open('merged_dicts.csv', 'wb') for Python 2.
with open('merged_dicts.csv', 'w', newline='') as csvfile:
csv.writer(csvfile).writerows(zip(*table))
print('done')
CSV file produced:
WORD,ABSOLUT_FREQ_FREQ,DOC_1_FREQ,DOC_2_FREQ,DOC_140_FREQ
world,0.002,0.9,0.3,0.5
baby,0.005,0.7,0.6,0.9
hello,0.001,0.8,0.2,0.1
No matter how you want to write this data, first you need an ordered data structure, for example a 2D list:
docs = []
docs.append( {u'hello':0.001, u'world':0.002, u'baby':0.005} )
docs.append( {u'hello':0.8, u'world':0.9, u'baby':0.7} )
docs.append( {u'hello':0.2, u'world':0.3, u'baby':0.6} )
docs.append( {u'hello':0.1, u'world':0.5, u'baby':0.9} )
words = docs[0].keys()
result = [ [word] + [ doc[word] for doc in docs ] for word in words ]
then you can use the built-in csv module: https://docs.python.org/2/library/csv.html
I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.