I have a file with x number of string names and their associated IDs. Essentially two columns of data.
What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.
This is sort of what I had in mind. Just to show my intent:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.
I hope my issue is clearly worded, and really, any input is appreciated,
Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
This is considerably more elegant than my first answer.
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)
In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.
Here is what this could look like:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100
Hope it helps
Related
I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.
I have a multi-hierarchical pandas dataframe shown below. How, for a given attribute, attr ('rh', 'T', 'V'), can I set certain values (say values > 0.5) to NaN over the entire set of pLevs? I have seen answers on how to set a specific column (e.g., df['rh', 50]) but have not seen how to select the entire set.
attr rh T V
pLev 50 75 100 50 75 100 50 75 100
refIdx
0 0.225026 0.013868 0.306472 0.144581 0.379578 0.760685 0.686463 0.476179 0.185635
1 0.496020 0.956295 0.471268 0.492284 0.836456 0.852873 0.088977 0.090494 0.604290
2 0.898723 0.733030 0.175646 0.841776 0.517127 0.685937 0.094648 0.857104 0.135651
3 0.136525 0.443102 0.759630 0.148536 0.426558 0.731955 0.523390 0.965385 0.094153
To facilitate assistance, I am including code to create the dataframe here:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((4,9)))
df.columns = pd.MultiIndex.from_product([['rh','T','V'],[50,75,100]])
df.columns.names = ['attr', 'pLev']
df.index.name = 'refIdx'
The notation is mildly annoying but you can use IndexSlice
df.loc[:,pd.IndexSlice['rh',:]]=np.nan
If your 'given attribute' is 'rh' then you can take a cross-section with the following:
df_xs = df.xs('rh', level='attr', axis=1, drop_level=False)
Then you can update the original df as follows:
df[df_xs > 0.5] = np.nan
This works because drop_level=False was given to .xs
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.
I have an Excel file (.xlsx) with about 800 rows and 128 columns with pretty dense data in the grid. There are about 9500 cells that I am trying to replace the cell values of using Pandas data frame:
xlsx = pandas.ExcelFile(filename)
frame = xlsx.parse(xlsx.sheet_names[0])
media_frame = frame[media_headers] # just get the cols that need replacing
from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF
to_filenames = get_to_filenames()
media_frame = media_frame.replace(from_filenames, to_filenames)
frame.update(media_frame)
frame.to_excel(filename)
The replace() takes 60 seconds. Any way to speed this up? This is not huge data or task, I was expecting pandas to move much faster. FYI I tried doing the same processing with same file in CSV, but the time savings was minimal (about 50 seconds on the replace())
strategy
create pd.Series representing a map from filenames to filenames.
stack our dataframe, map, then unstack
setup
import pandas as pd
import numpy as np
from string import letters
media_frame = pd.DataFrame(
pd.DataFrame(
np.random.choice(list(letters), 9500 * 800 * 3) \
.reshape(3, -1)).sum().values.reshape(9500, -1))
u = np.unique(media_frame.values)
from_filenames = pd.Series(u)
to_filenames = from_filenames.str[1:] + from_filenames.str[0]
m = pd.Series(to_filenames.values, from_filenames.values)
solution
media_frame.stack().map(m).unstack()
timing
5 x 5 dataframe
100 x 100
9500 x 800
9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))
I got the 60 second task to complete in 10 seconds by removing replace() altogether and using set_value() one element at a time.
I found creating new col and dropping the existing column one is faster than waiting forever. ;)