implement a text classifier with python - python

i try to implement a Persian text classifier with python, i use excel to read my data and make my data set.
i would be thankful if you have any suggestion about better implementing.
i tried this code to access to body of messages which have my conditions and store them. i took screenshot of my excel file to help more.
for example i want to store body of messages which its col "foolish" (i mean F column) have value of 1(true).
https://ibb.co/DzS1RpY "screenshot"
import pandas as pd
file='1.xlsx'
sorted=pd.read_excel(file,index_col='foolish')
var=sorted[['body']][sorted['foolish']=='1']
print(var.head())
expected result is body of rows 2,4,6,8.

try assigning like this:
df_data=df["body"][df["foolish"]==1.0]
dont use - which is a python operator instead use _ (underscore)
Also note that this will return a series.
For a dataframe , use:
df_data = pd.DataFrame(df['body'][df["foolish"]==1.0])

Related

Python: show rows if there's certain keyword from the list and show what was the detected keyword

I was trying to get a data frame of spam messages so I can analyze them. This is what the original CSV file looks like.
I want it to be like
This is what I had tried:
###import the original CSV (it's simplified sample which has only two columns - sender, text)
import pandas as pd
df = pd.read_csv("spam.csv")
### if any of those is in the text column, I'll put that row in the new data frame.
keyword = ["prize", "bit.ly", "shorturl"]
### putting rows that have a keyword into a new data frame.
spam_list = df[df['text'].str.contains('|'.join(keyword))]
### creating a new column 'detected keyword' and trying to show what was detected keyword
spam_list['detected word'] = keyword
spam_list
However, "detected word" is in order of the list.
I know it's because I put the list into the new column, but I couldn't think/find a better way to do this. Should I have used "for" as the solution? Or am I approaching it in a totally wrong way?
You can define a function that gets the result for each row:
def detect_keyword(row):
for key in keyword:
if key in row['text']:
return key
then get it done for all rows with pandas.apply() and save results as a new column:
df['detected_word'] = df.apply(lambda x: detect_keyword(x), axis=1)
You can use the code given below in the picture to solve your stated problem, I wasn't able to paste the code because stackoverflow wasn't allowing to paste short links. The link to the code is available.
The code has been adapted from here

Reading and writing to an excel sheet using pandas in python, to use append or concat and what method?

i'm writing a small script that reads from excel sheet the id of an episode and fills in it's corresponding series name, here's a following example of my excel sheet that would be used as input
my script would read the "tconst" value and use it to find the corrisponding episode on imdb and get the website title and use that to find the name of the series,
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
dataset_loc='C:\\Users\\Ghandy\\Documents\\Datasets\\Episodes with over 1k ratings 2020+Small.xlsx'
dataset= pd.read_excel(dataset_loc)
for tconst in dataset['tconst']:
url='https://www.imdb.com/title/{}/'.format(tconst)
soup = BeautifulSoup(urlopen(url),features="lxml")
dataset = dataset.append({"Name": re.findall(r'"([^"]*)"',soup.title.get_text())[0]}, ignore_index=True)
dataset.to_excel(dataset_loc,index=False)
I got a few problems with this code, first python keeps telling me to not use concat and instead use append, but all the answers on google and stackoverflow give examples with append and i don't know how to use concat exactly,
second, my data is being appened into a completely new and empty row, not next to the original data that i want, so in this example i would get "The Mandalorian" at row 4 instead of 2,
and finally third, i want to know if it's better to add the data one at a time or put them all in a temporary list variable and then add that all at the same time, and how would i go about doing that with concat?
I can't really say what your problem with append and concat consists in -- everyone says use append and you use append as well, do you want to use concat instead? Here is a post on the difference between concat and append.
Append appends rows, you might want to use .at?
I would say this depends on how much data you already have and how much you are going to add. To have less overhead and copying around I would prefer to add directly to the dataframe, but if there is a lot happening between the url call and the adding to the df, the collected version could be better.
thanks to #Stimmot using .at, the code would look like this now:
for index, tconst in enumerate(dataset['tconst']):
url='https://www.imdb.com/title/{}/'.format(tconst)
soup = BeautifulSoup(urlopen(url),features="lxml")
dataset.at[index,'Name']=re.findall(r'"([^"]*)"',soup.title.get_text())[0]
dataset.to_excel(dataset_loc)

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

Convert csv column into list using pandas

I'm currently working on a project that takes a csv list of student names who attended a meeting, and converts it into a list (later to be compared to full student roster list, but one thing at a time). I've been looking for answers for hours but I still feel stuck. I've tried using both pandas and the csv module. I'd like to stick with pandas, but if it's easier in the csv module that works too. CSV file example and code below.
The file is autogenerated by our video call software- so the formatting is a little weird.
Attendance.csv
see sample as image, I can't insert images yet
Code:
data = pandas.read_csv("2A Attendance Report.csv", header=3)
AttendanceList = data['A'].to_list()
print(str(AttendanceList))
However, this is raising KeyError: 'A'
Any help is really appreciated, thank you!!!
As seen in sample image, you have column headers in the first row itself. Hence you need to remove header=3 from your read_csv call. Either replace it with header=0 or don't specify any explicit header value at all.

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Categories