Cant seem to iterate over a column to assign sentiment values - python

I have tried iterating over rows in my dataframe to get sentimental values.
My code is:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
analyzer = SentimentIntensityAnalyzer()
df['Sentiment Values'] = df['Comments'].apply(lambda Comments: analyzer.polarity_scores(Comments))`
but it returns
'float' object has no attribute 'encode'
My df is:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 No problem. I’m the same. Good to hold both for sure!
3 I understood most of that. Thank you.
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
EDIT:

I'm not able to reproduce - probably because the error is happening later down the dataframe than you've sent here.
I'm guessing the issue is that you've got some non-strings (floats, specifically) in your Comments columns. Probably you should examine them and remove them, but you can also just convert them to strings before sentiment analysis with .astype(str):
df['Sentiment Values'] = df['Comments'].astype(str).apply(lambda Comments: analyzer.polarity_scores(Comments))

I was able to execute the code using a .csv file adapted from your data.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
analyzer = SentimentIntensityAnalyzer()
df = pd.read_csv('df.csv', delimiter='_')
df['Sentiment Values'] = df['Comments'].apply(lambda Comments: analyzer.polarity_scores(Comments))
Note I've used an underscore as delimiter, which may not be totally robust depending on your input data.
In the dataframe, there is only a single column of comments. It looks like you may have included indices (1,2,3,4...) which could be the source of your error.
df.csv:
Comments
The main thing is the price appreciation of the token (this determines the gains or losses more than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities and protocols that accept the asset as collateral, the better. Finally, the yield for staking comes into play.
No problem. I’m the same. Good to hold both for sure!
I understood most of that. Thank you.
I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then you'll know for sure. Peace of mind has value too.

Related

How do I query more than one column in a data frame?

I'm taking a Data Science class that uses Python and this is a questions that stumped me today. "How many babies are named “Oliver” in the state of Utah for all years?"
To answer this question we were supposed to use data from this set https://raw.githubusercontent.com/byuidatascience/data4names/master/data-raw/names_year/names_year.csv
So I started by loading in pandas.
import pandas as pd
Then I loaded in the data set and created a data frame
url='https://raw.githubusercontent.com/byuidatascience/data4names/master/data-raw/names_year/names_year.csv'
names=pd.read_csv(url)
Finally I used the .query() method to single out the data type that I wanted, the name Oliver.
oliver=names.query("name == 'Oliver'")
I eventually found the total number of babies that had been named Oliver in Utah using this code
total=pd.DataFrame.sum(quiz)
print(total)
but I wasn't sure how to single out the data for both the name and the state, or if that is even possible. Is there anyone out there that knows of a better way to find this answer?
You have all the code there you just need one more line to Sum accordint to the state:
print(oliver.UT.sum()) # this will give you the total for the state of UTAH
and forget about the quiz.

How do I efficiently calculate the mean of nested subsets of Vaex dataframes?

I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.
Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

Splitting a single large csv file to resample by two columns

I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.
I share some code and outputs to help you understand my question better.
what i got when i tried with groupby() but not exactly what i wanted
This is how my csv file looks like. It contains more than 300,000 data.
Some code snippet:
import pandas as pd
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path)
return data
# read csv file
dataset = read_data('raw_data.csv')
df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)
I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.
First, calculate the greatest common number of sample
num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
Now you can sample
df.groupby(['alphabet', 'subject']).sample(num_sample)

Convert prices with daily rate in pandas

I'm currently trying to normalize a DataFrame(~600k rows) with prices (pricevalue) in different currencies(pricecurrency) so that every row has prices in EUR.
I'd like to convert them with the daily rate taken from a column date.
My current "solution" (using the CurrencyConverter package found on PyPI) looks like this:
from currency_converter import CurrencyConverter
c = CurrencyConverter(fallback_on_missing_rate=True,fallback_on_missing_rate_method="last_known")
def convert_currency(row):
return c.convert(row["pricevalue"], row["pricecurrency"],row["date"])
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
However, this solution is taking forever to run.
Is there a faster way to accomplish that? Any help is appreciated :)
It sounds strange to say this, but unfortunately you're not doing anything wrong!
The currency interpolation code is doing what you need it to do, and not much else. In your code, you're doing everything right. This means there's no thing you can quickly fix to get performance. You have a double lambda where you only need a single, but that won't make much of a difference:
i.e.
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
should be
df["converted_eur"] = df.apply(convert_currency, axis=1)
The first thing to do is to understand how long this processing will actually take by adding some UI:
from tqdm import tqdm
df["converted_eur"] = df.progress_apply(convert_currency, axis=1)
Once you know how long the job will actually take, try out these, in order:
Live with it.
Single instance parallelization, with something like pandarallel
Multi instance parallelization, with something like Dask

Categories