I'm currently trying to normalize a DataFrame(~600k rows) with prices (pricevalue) in different currencies(pricecurrency) so that every row has prices in EUR.
I'd like to convert them with the daily rate taken from a column date.
My current "solution" (using the CurrencyConverter package found on PyPI) looks like this:
from currency_converter import CurrencyConverter
c = CurrencyConverter(fallback_on_missing_rate=True,fallback_on_missing_rate_method="last_known")
def convert_currency(row):
return c.convert(row["pricevalue"], row["pricecurrency"],row["date"])
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
However, this solution is taking forever to run.
Is there a faster way to accomplish that? Any help is appreciated :)
It sounds strange to say this, but unfortunately you're not doing anything wrong!
The currency interpolation code is doing what you need it to do, and not much else. In your code, you're doing everything right. This means there's no thing you can quickly fix to get performance. You have a double lambda where you only need a single, but that won't make much of a difference:
i.e.
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
should be
df["converted_eur"] = df.apply(convert_currency, axis=1)
The first thing to do is to understand how long this processing will actually take by adding some UI:
from tqdm import tqdm
df["converted_eur"] = df.progress_apply(convert_currency, axis=1)
Once you know how long the job will actually take, try out these, in order:
Live with it.
Single instance parallelization, with something like pandarallel
Multi instance parallelization, with something like Dask
Related
I have this code that functions properly and produces the result I am looking for:
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)
print('Done did')
However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.
import glob
from thefuzz import fuzz
import modin.pandas as pd
files = glob.glob('/folder/folder/2013/*.csv')
for file in files:
df = pd.read_csv(file, dtype=str, lineterminator='\n')
f_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
print(f'{file} has been done')
It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?
The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:
ID
Text
11213
I am going to the cinema
23213
Black is my favourite colour
35455
I am going to the cinema with you
421323
My friends think I am a good guy.
It appears that the requirement is to compare each sentence against every other sentence. Given that overall approach here I don't think there is a good answer. You are looking at n^2 comparisons. As your row count gets large the overall processing requirements turn into a monster very quickly.
To figure out the feasibility you could run some smaller tests calculating the n^2 for that test to get an evaluations rows per second metric. Then calculate n^2 for the big datasets that you want to do to get an idea of the required processing time. That is assuming that your memory could handle it. Maybe there is work done on handling n^2 problems. Might want to look around for something like that.
You are doing more than twice the work that you need to do. You compare everything against everything twice and against itself. But even then when things get large, if you just do the combinations, n(n-1)/2 is still monstrous.
I have tried iterating over rows in my dataframe to get sentimental values.
My code is:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
analyzer = SentimentIntensityAnalyzer()
df['Sentiment Values'] = df['Comments'].apply(lambda Comments: analyzer.polarity_scores(Comments))`
but it returns
'float' object has no attribute 'encode'
My df is:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 No problem. I’m the same. Good to hold both for sure!
3 I understood most of that. Thank you.
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
EDIT:
I'm not able to reproduce - probably because the error is happening later down the dataframe than you've sent here.
I'm guessing the issue is that you've got some non-strings (floats, specifically) in your Comments columns. Probably you should examine them and remove them, but you can also just convert them to strings before sentiment analysis with .astype(str):
df['Sentiment Values'] = df['Comments'].astype(str).apply(lambda Comments: analyzer.polarity_scores(Comments))
I was able to execute the code using a .csv file adapted from your data.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
analyzer = SentimentIntensityAnalyzer()
df = pd.read_csv('df.csv', delimiter='_')
df['Sentiment Values'] = df['Comments'].apply(lambda Comments: analyzer.polarity_scores(Comments))
Note I've used an underscore as delimiter, which may not be totally robust depending on your input data.
In the dataframe, there is only a single column of comments. It looks like you may have included indices (1,2,3,4...) which could be the source of your error.
df.csv:
Comments
The main thing is the price appreciation of the token (this determines the gains or losses more than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities and protocols that accept the asset as collateral, the better. Finally, the yield for staking comes into play.
No problem. I’m the same. Good to hold both for sure!
I understood most of that. Thank you.
I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then you'll know for sure. Peace of mind has value too.
I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.
Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']
Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))
I would like to change some values in an mdf file (specifically, I would like to check for consistency, since the measurement instrument for some reason writes 10**10 when no value could be found). I can't figure out how to access specific values and change them. I figured out how to include the channel units in the channel names, which works reasonably fast:
with MDF(file) as mdf:
for i,gp in enumerate(mdf.groups):# add units to channel names (faster than using pandas)
for j,ch in enumerate(gp.channels):
mdf.groups[i].channels[j].name = ch.name + " [" + ch.unit + "]"
Unfortunately, gp.channels doesn't seem to have a way to access the data, only some metadata for each channel (or at least I can't figure out the attribute or method).
I already tried to convert to a dataframe, where this is rather easy, but the file is quite large so it takes waaaay too long to sift through all the datapoints - my guess is this could be quite a bit faster if it is done in the mdf directly.
# slow method with dataframe conversion
data = mdf.to_dataframe()
columns = data.columns.tolist()
for col in columns:
for i,val in enumerate(data[col]):
if val == 10**10:
data.loc[i, col] = np.nan
Downsampling solves the taking too long part, but this is not really a solution either since I do need the original sample rate.
Accessing the data is not a problem, since I can use the select() or get() methods, but I can't change the values - I don't know how. Ideally, I'd change any 10**10 to a np.nan.
ok, I figured out how to do if efficiently in pandas, which works for me.
I used an combination of a lambda function and the applymap method of a pandas DataFrame:
data = data.applymap(lambda x: np.nan if x==10**10 else x)
Do you still get the 10**10 values when you call get with ignore_invalidation_bots=False? In mdf v4 the writing applications can use the invalidation bits to mark invalid samples