I have an Excel file (.xlsx) with about 800 rows and 128 columns with pretty dense data in the grid. There are about 9500 cells that I am trying to replace the cell values of using Pandas data frame:
xlsx = pandas.ExcelFile(filename)
frame = xlsx.parse(xlsx.sheet_names[0])
media_frame = frame[media_headers] # just get the cols that need replacing
from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF
to_filenames = get_to_filenames()
media_frame = media_frame.replace(from_filenames, to_filenames)
frame.update(media_frame)
frame.to_excel(filename)
The replace() takes 60 seconds. Any way to speed this up? This is not huge data or task, I was expecting pandas to move much faster. FYI I tried doing the same processing with same file in CSV, but the time savings was minimal (about 50 seconds on the replace())
strategy
create pd.Series representing a map from filenames to filenames.
stack our dataframe, map, then unstack
setup
import pandas as pd
import numpy as np
from string import letters
media_frame = pd.DataFrame(
pd.DataFrame(
np.random.choice(list(letters), 9500 * 800 * 3) \
.reshape(3, -1)).sum().values.reshape(9500, -1))
u = np.unique(media_frame.values)
from_filenames = pd.Series(u)
to_filenames = from_filenames.str[1:] + from_filenames.str[0]
m = pd.Series(to_filenames.values, from_filenames.values)
solution
media_frame.stack().map(m).unstack()
timing
5 x 5 dataframe
100 x 100
9500 x 800
9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))
I got the 60 second task to complete in 10 seconds by removing replace() altogether and using set_value() one element at a time.
I found creating new col and dropping the existing column one is faster than waiting forever. ;)
Related
I have an issue converting a JSON column (which contains around 250 variables) into 250 separate columns. I'm able to use Pandas dataframe, but just for 46k rows it takes 30 minutes and sometimes kernel is crashing due to low memory (for 0.5 million rows in database).
Can somebody help me with code using NumPy arrays (which should decrease conversion time and reduce file size)?
The JSON column has data in below format:
My code :
for x in records:
list_ = list(x)
json_acceptable_string = list_[4].read()
list_features.append(json.loads(json_acceptable_string)
Once I get the list-features I'm preprocessing and using machine learning pipeline. This isn't working for large data.
I think this could help for building your np array
variable_name_list = ['var1','va2',....,'var250']
list_features = np.empty(shape=(len(records),len(variable_name_list)),dtype=str)
for index in range(records):
list_ = list(records[index])
json_acceptable_string = list_[4].read()
tmp_feature_list = []
tmp_feature_dict = json.loads(json_acceptable_string)
for var_name in variable_name_list:
if var_name not in tmp_feature_dict.keys():
tmp_feature_list.append("missing_val")
else :
tmp_feature_list.append(tmp_feature_dict[var_name])
tmp_feature_list = np.asarray(tmp_feature_list,dtype=str).reshape(1,len(variable_name_list))
list_features[index] = tmp_feature_list
I have a file with x number of string names and their associated IDs. Essentially two columns of data.
What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.
This is sort of what I had in mind. Just to show my intent:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.
I hope my issue is clearly worded, and really, any input is appreciated,
Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
This is considerably more elegant than my first answer.
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)
In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.
Here is what this could look like:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100
Hope it helps
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')
Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.
Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris
As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)
I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']