Delete text that is before a character - python

I'm trying to keep only the text after "Background", but I didn't have success trying to do it. For instance, I have a comment like this:
05/2022: AB: 6/20/22 - I'm learning how to use pandas library.
Background: I'm trying to learn python.
How can I make all cells have only the background comment? It should look like this:
Background: I'm trying to learn python.
Please see my code below:
import pandas as pd
df = pd.read_excel(r"C:\Users\R\Desktop\PythonLib\data\52022.xlsx")
comments = df["Comment"]
df['new_background'] = df["Comment"].str.split('Background:').str[0]
print(df["new_background"])

You should provide a sample of your data.
That said, you should probably do:
df['new_background'] = df["Comment"].str.replace(r'.*(?=Background:)',
'', regex=True)
Or, if you want NaN in case of missing background:
df['new_background'] = df["Comment"].str.extract(r'(Background:.*)')

Related

how to delete commas in whole DataFrame using pandas or python

I'm complete newby to any kind of these programs.
I studied philosophy and economy and trying to learn python for web crawler for my own investment strategy.
I'm from South Korea, so I'm quite nervous to type English here, but I'm trying to be brave! (please, excuse my ugly English)
enter image description here
this is the DataFrame that I've got from the website.
I'm crawling financial datas and as you might see, numbers has commas in it.
their types are object.
what I want to do is to make them integer so I can do some math.(sum, multiplication, etc.)
I searched (including Korean web sites) and I found the way to do using columns name,
like this code
cols = ['col1', 'col2', ..., 'colN']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
But, what I need is doing it regardless columns' name
I need over 2000 companies' data and columns' names are different depending on company
I'd like to make a code like
"Delete ',' in cols, cols from col#0 to col#end"
Thanks in advance
the very first thing you can do is to differentiate data frame by their type and do the processing they needed.
object_list = list(df.select_dtypes(include ="object"))
float_list = list(df.select_dtypes(include ="float64"))
int_list = list(df.select_dtypes(include ="int64"))
then replace whatever you need
df[object_list] = df[object_list].replace(",","")
df[float_list ] = df[float_list ].apply(str) # so that you can replace easily
df[float_list ] = df[float_list ].replace(",","")
df[float_list ] = df[float_list ].apply(float) # now its clean and int
df[int_list ] = df[int_list ].apply(str)
df[int_list ] = df[int_list ].replace(",","")
df[float_list ] = df[float_list ].apply(int)
Based on this answer, you can just get a list of column names, add it into a variable and simply call it where you would have the list of columns. But there are other things to keep in mind, as well. In the documentation, replace is a function that is applied to the dataframe, you might get errors if you do something like df = df.replace(). And the last idea is that the number formatting might be visual only. Can you not work with the data in there? A conversion might help you, but it might also not be an issue at all, if you simply want to work with data. Another idea would be converting them from numbers to strings, and replacing the commas with spaces, if needed be. This answer might help you with that.

How would I be able to remove this part of the variable?

So I am making a code like a guessing game. The data for the guessing game is in the CSV file so I decided to use pandas. I have tried to use pandas to import my csv file, pick a random row and put the data into variables so I can use it in the rest of the code but, I can't figure out how to format the data in the variable correctly.
I've tried to split the string with split() but I am quite lost.
ar = pandas.read_csv('names.csv')
ar.columns = ["Song Name","Artist","Intials"]
randomsong = ar.sample(1)
songartist = randomsong["Artist"]
songname = (randomsong["Song Name"])
songintials = randomsong["Intials"]
print(songname)
My CSV file looks like this.
Song Name,Artist,Intials
Someone you loved,Lewis Capaldi,SYL
Bad Guy,Billie Eilish,BG
Ransom,Lil Tecca,R
Wow,Post Malone, W
I expect the output to be the name of the song from the csv file. For Example
Bad Guy
Instead the output is
1 Bad Guy
Name: Song Name, dtype:object
If anyone knows the solution please let me know. Thanks
You're getting a series object as output. You can try
randomsong["Song Name"].to_string()
Use df['column].values to get values of the column.
In your case, songartist = randomsong["Artist"].values[0] because you want only the first element of the returned list.

Unable to run Stanford Core NLP annotator over whole data set

I have been trying to use Stanford Core NLP over a data set but it stops at certain indexes which I am unable to find.
The data set is available on Kaggle: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data
This is a function that outputs the sentiment of a paragraph by taking the mean sentiment value of individual sentences.
import json
def funcSENT(paragraph):
all_scores = []
output = nlp.annotate(paragraph, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only takes in one single sentence.
#"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process faster
"enforceRequirements": "false"
})
all_scores = []
for i in range(0,len(output['sentences'])):
all_scores.append((int(json.loads(output['sentences'][i]['sentimentValue']))+1))
final_score = sum(all_scores)/len(all_scores)
return round(final_score)
Now I run this code for every review in the 'Reviews' column using this code.
import pandas as pd
data_file = 'C:\\Users\\SONY\\Downloads\\Amazon_Unlocked_Mobile.csv'
data = pd.read_csv( data_file)
from pandas import *
i = 0
my_reviews = data['Reviews'].tolist()
senti = []
while(i<data.shape[0]):
senti.append(funcSENT(my_reviews[i]))
i=i+1
But somehow I get this error and I am not able to find the problem. Its been many hours now, kindly help.
[1]: https://i.stack.imgur.com/qFbCl.jpg
How to avoid this error?
As I understand, you're using pycorenlp with nlp=StanfordCoreNLP(...) and a running StanfordCoreNLP server. I won't check the data you are using since it appears to require a Kaggle account.
Running with the same setup but different paragraph shows that printing "output" alone shows an error from the java server, in my case:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Input word not tagged
I THINK that because there is no part-of-speech annotator, the server cannot perform the parsing. Whenever you use parse or depparse, I think you need to have the "pos" annotator as well.
I am not sure what the sentiment annotator needs, but you may need other annotators such as "lemma" to get good sentiment results.
Print output by itself. If you get the same java error, try adding the "pos" annotator to see if you get the expected json. Otherwise, try to give a simpler example, using your own small dataset maybe, and comment or adjust your question.

Apply Number formatting to Pandas HTML CSS Styling

In Pandas, there is a new styler option for formatting CSS ( http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.core.style.Styler.html ).
Before, when I wanted to make my numbers into accounting/dollar terms, I would use something like below:
df = pd.DataFrame.from_dict({'10/01/2015': {'Issued': 200}}, orient='index')
html = df.to_html(formatters={'Issued': format_money})
format_money function:
def format_money(item):
return '${:,.0f}'.format(item)
Now I want to use the Style options, and keep my $ formatting. I'm not seeing any way to do this.
Style formatting for example would be something like this:
s = df.style.bar(color='#009900')
#df = df.applymap(config.format_money) -- Doesn't work
html = s.render()
This would add bars to my HTML table like so(Docs here: http://pandas.pydata.org/pandas-docs/stable/style.html):
So basically, how do I do something like add the bars, and keep or also add in the dollar formatting to the table? If I try to do it before, the Style bars don't work because now they can't tell that the data is numerical and it errors out. If I try to do it after, it cancels out the styling.
That hasn't been implemented yet (version 0.17.1) - but there is a pull request for that (https://github.com/pydata/pandas/pull/11667) and should come out in 0.18. For now you have to stick to using the formatters.

Process string using sphinx

I was wondering whether it is possible to process a string in python, using sphinx. Basically, I would like to pass a string containing restructured text to sphinx, and then generate the corresponding HTML output. I meant something like this
import sphinx.something
s = 'some restructured text'
html_out = sphinx.something(s)
However, I could not find anything along these lines. So, is this possible, and if so, how would one do this?
The quickest solution I've seen is:
from docutils.examples import html_body
def rst(rstStr):
return html_body(rstStr, input_encoding='utf-8',
output_encoding='utf-8').strip()
I'd be interested myself in better solutions..

Categories