I have 2 large lists, each with about 100 000 elements each and one being larger than the other, that I want to iterate through. My loop looks like this:
for i in list1:
for j in list2:
function()
This current looping takes too long. However, list1 is a list that needs to be checked from list2 but, from a certain index, there are no more instances beyond in list2. This means that looping from indexes might be faster but the problem is I do not know how to do so.
In my project, list2 is a list of dicts that have three keys: value, name, and timestamp. list1 is a list of the timestamps in order. The function is one that takes the value based off the timestamp and puts it into a csv file in the appropriate name column.
This is an example of entries from list1:
[1364310855.004000, 1364310855.005000, 1364310855.008000]
This is what list2 looks like:
{"name":"vehicle_speed","value":2,"timestamp":1364310855.004000}
{"name":"accelerator_pedal_position","value":4,"timestamp":1364310855.004000}
{"name":"engine_speed","value":5,"timestamp":1364310855.005000}
{"name":"torque_at_transmission","value":-3,"timestamp":1364310855.008000}
{"name":"vehicle_speed","value":1,"timestamp":1364310855.008000}
In my final csv file, I should have something like this:
http://s000.tinyupload.com/?file_id=03563948671103920273
If you want this to be fast, you should restructure the data that you have in list2 in order to speedup your lookups:
# The following code converts list2 into a multivalue dictionary
from collections import defaultdict
list2_dict = defaultdict(list)
for item in list2:
list2_dict[item['timestamp']].append((item['name'], item['value']))
This gives you a much faster way to look up your timestamps:
print(list2_dict)
defaultdict(<type 'list'>, {
1364310855.008: [('torque_at_transmission', -3), ('vehicle_speed', 0)],
1364310855.005: [('engine_speed', 0)],
1364310855.004: [('vehicle_speed', 0), ('accelerator_pedal_position', 0)]})
Lookups will be much more efficient when using list2_dict:
for i in list1:
for j in list2_dict[i]:
# here j is a tuple in the form (name, value)
function()
You appear to only want to use the elements in list2 that correspond to i*2 and i*2+1, That is elements 0, 1 and 2, 3, ...
You only need one loop.
for i in range(len(list1)):
j = list[i*2]
k = list2[j+1]
# Process function using j and k
You will only process to the end of list one.
i think pandas module is a perfect match for your goals...
import ujson # 'ujson' (Ultra fast JSON) is faster than the standard 'json'
import pandas as pd
filter_list = [1364310855.004000, 1364310855.005000, 1364310855.008000]
def file2list(fn):
with open(fn) as f:
return [ujson.loads(line) for line in f]
# Use pd.read_json('data.json') instead of pd.DataFrame(load_data('data.json'))
# if you have a proper JSON file
#
# df = pd.read_json('data.json')
df = pd.DataFrame(file2list('data.json'))
# filter DataFrame with 'filter_list'
df = df[df['timestamp'].isin(filter_list)]
# convert UNIX timestamps to readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# pivot data frame
# fill NaN's with zeroes
df = df.pivot(index='timestamp', columns='name', values='value').fillna(0)
# save data frame to CSV file
df.to_csv('output.csv', sep=',')
#pd.set_option('display.expand_frame_repr', False)
#print(df)
output.csv
timestamp,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0.0,0.0,-3.0,1.0
PS i don't know where did you get [Latitude,Longitude] columns from, but it's pretty easy to add those columns to your result DataFrame - just add the following lines before calling df.to_csv()
df.insert(0, 'latitude', 0)
df.insert(1, 'longitude', 0)
which would result in:
timestamp,latitude,longitude,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,0,0,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0,0,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0,0,0.0,0.0,-3.0,1.0
Related
I have 2 sets of json files, one for co2 and one for temp each stored in co2_results and temp_results. I want to convert them into a pandas dataframe. So far I am using the below method which is not very efficient especially when I have a lot of json files.
co2_sensor1 = pd.DataFrame(co2_results[0])
co2_sensor2 = pd.DataFrame(co2_results[1])
co2_sensor3 = pd.DataFrame(co2_results[2])
temp_sensor1 = pd.DataFrame(temp_results[0])
temp_sensor2 = pd.DataFrame(temp_results[1])
temp_sensor3 = pd.DataFrame(temp_results[2])
Is there a way I can make the above code more efficient? Like using functions or for loops?
If I store them in a list:
my_list = ['co2_sensor1', 'co2_sensor2', 'co2_sensor3', 'temp_sensor1', 'temp_sensor2', 'temp_sensor3']
can i then iterate through this list based on indeces? (e.g. from index 0 to 2 take data from co2_results and then afterwards take results from temp_results and after return all the strings from my_list as df.
Try this:
my_list = [pd.DataFrame(i) for i in co2_results + temp_results]
print(my_list)
This will iterate "pd.DataFrame()" for each item in a list combined from the two results list.
I have 2 DataFrames of (df1) 35k and (df2) 76k rows where I need to check whether df1["col1"] elements exist in df2["col2"] sub-elements. The code seems to be working fine on a sample dataset I have provided but the runtime takes forever on the original one. Here is a for-loop code I used on the sample dataset:
import pandas as pd
post_token_list = [['wXrL3TbK'], ['wXmTQKw1'], ['wXvnlWej'], ['wXvXBjKp']]
tokens_list = [['wXv3qoPQ', 'wXvT7ylu', 'wXvnIJuH', 'wXvXH7vy', 'wXvDXSS1', 'wXvjVE1F', 'wXvPV6z1', 'wXvHF1uw',
'wXvH1q03', 'wXvnTlcr', 'wXvDEG9U', 'wXLfZtO6', 'wXvLDDDl', 'wXvHTgjk', 'wXvHDDr8', 'wXvPBLbu',
'wXvvxXHI', 'wXvPBFge', 'wXvLxSii', 'wXvDhk2h', 'wXv3Alan', 'wXvvQuKy', 'wXvvQ6LO', 'wXpHNjw9'],
['wXYr2lVk', 'wXXj7iDP', 'wXXXIsQr', 'wXQbXKz6', 'wXN3tMp1', 'wXMfZV5N', 'wXvnlWej', 'wXSDyEaW',
'wXQ7mM78', 'wXMPvojh', 'wXMjo-8G', 'wXLfZtO6', 'wXN3tMp1'],
['wXr_jZmX', 'wXr7D0AM', 'wXrzjhxL', 'wXrfjQNe', 'wXrnihqT', 'wXrjyqm5', 'wXr3CD4h', 'wXrnSZsy',
'wXrTieP7', 'wXLfZtO6', 'wXgHVwkc', 'wXdvewsV', 'wXrfxZeg', 'wXrLB7Zo', 'wXprtX71', 'wXrHhjtO',
'wXrzwKBt', 'wXqz-RlY', 'wXq_fp7F', 'wXq7Po7n', 'wXq7fC73', 'wXqzvRSW', 'wXqf_PQ3', 'wXML2vCd'],
['wXv3aQrv', 'wXvn6ONM', 'wXvfaG0M', 'wXvf6LIr', 'wXvjJBg_', 'wXvL6M-0', 'wXv7p2cd', 'wXv3poSs',
'wXvz5kUz', 'wXvrZz0_', 'wXv_YVCb', 'wXLfZtO6', 'wXvX5Hgi', 'wXvz3Ptg', 'wXvHJUU-', 'wXvr4fB7',
'wXvnlWej', 'wXv_YUrK', 'wXv7Id05', 'wXv7IYOV', 'wXvfYfLo', 'wXv7Y3AV', 'wXvT4_pE', 'wXvPovRt'],
['wXoDui-2', 'wXoT9yTg', 'wXmTQKw1', 'wXormLxu', 'wXMX-NNQ', 'wXo7kUfB', 'wXon0rt_', 'wXozT-3V',
'wXnvYjEc', 'wXnTn9D6', 'wXnLH7Cz', 'wXn_2HV_', 'wXnPGou9', 'wXnPVSNo', 'wXuG0sl3', 'wXnjAs7X',
'wXm38mLv', 'wXmnj5Oh', 'wXmfjQ2h', 'wXm_wXuD', 'wXlPOUmy', 'wXcfHkmx', 'wXQ_62cx', 'wXUD3qyx']]
df1 = pd.DataFrame({"col1": post_token_list})
df2 = pd.DataFrame({"col2": tokens_list})
query_bounce = []
def query_bounce_checker(dataset_clicked, dataset_loaded, col1, col2):
for i in dataset_clicked[col1]:
for j in i:
[query_bounce.append(k) for k in dataset_loaded[col2] if j in k]
return query_bounce
query_bounce_checker(df1, df2, "col1", "col2")
i, j, and k values are used to access and compare the elements and sub-elements of the two respecting columns.
Speed is a contributing factor for me, and the function written here is not fast enough for a dataset of this size.
If this is actually what you want, this should be pretty fast.
import numpy as np
np.intersect1d(np.hstack(df1.col1),np.hstack(df2.col2))
Output
array(['wXmTQKw1', 'wXvnlWej'], dtype='<U8')
I am not sure if it is what you want. If you just want to check which values in df1 also exist in df2, you can transform two dataframes into arrays and use np.in1d() to do so.
Try this:
array1 = np.array((','.join(df1['col1'].apply(lambda x: ','.join(x)))).split(','))
array2 = np.array((','.join(df2['col2'].apply(lambda x: ','.join(x)))).split(','))
print(array1[np.in1d(array1,array2)])
Output:
['wXmTQKw1' 'wXvnlWej']
Hi I have code which looks like this:
with open("file123.json") as json_file:
data = json.load(json_file)
df_1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][1].items()]))
df_1_made =pd.json_normalize(json.loads(df_1.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][2].items()]))
df_2_made = pd.json_normalize(json.loads(df_2.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_3 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][3].items()]))
df_3_made = pd.json_normalize(json.loads(df_3.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
which the dataframe is built from a json file
the problem is that I am dealing with different json files and each one of them can lead to different number of dataframes. so the code above is 3, it may change to 7. Is there any way to make a for loop taking the length of the data:
length = len(data["spt"])
and make the correct number of dataframes from it? so I do not need to do it manually.
The simplest option here will be to put all your dataframes into a dictionary or a list. First define a function that creates the dataframe and then use a list comprehension.
def create_df(data):
df = pd.DataFrame(
dict(
[(k,pd.Series(v)) for k,v in data]
)
)
df =pd.json_normalize(
json.loads(
df.to_json(orient="records")
)
).T.drop(["content.id","shortname","name"])
return df
my_list_of_dfs = [create_df(data.items()) for x in data["spt"]]
Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.
I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.