How to display the full-text of a column in Pandas - python

I have a data frame that contains a column with long texts.
To demonstrate how it looks (note the ellipses "..." where text should continue):
id text group
123 My name is Benji and I ... 2
The above text is actually longer than that phrase. For example it could be:
My name is Benji and I am living in Kansas.
The actual text is much longer than this.
When I try to subset the text column only, it only shows the partial text with the dots "...".
I need to make sure full text is shown for text sumarization later.
But I'm not sure how to show the full text when selecting the text column.
My df['text'] output looks something like this:
1 My name is Benji and I ...
2 He went to the creek and ...
How do I show the full text and without the index number?

You can use pd.set_option with display.max_colwidth to display automatic line-breaks and multi-line cells:
display.max_colwidthint or None
The maximum width in characters of a column in the repr of a pandas data structure. When the column overflows, a “…” placeholder is embedded in the output. A ‘None’ value means unlimited. [default: 50]
So in your case:
pd.set_option('display.max_colwidth', None)
For older versions, like version 0.22, use -1 instead of None

You can convert to a list an join with newlines ("\n"):
import pandas as pd
text = """The bullet pierced the window shattering it before missing Danny's head by mere millimeters.
Being unacquainted with the chief raccoon was harming his prospects for promotion.
There were white out conditions in the town; subsequently, the roads were impassable.
The hawk didn’t understand why the ground squirrels didn’t want to be his friend.
Nobody loves a pig wearing lipstick."""
df = pd.DataFrame({"id": list(range(5)), "text": text.splitlines()})
Original output:
print(df["text"])
Yields:
0 The bullet pierced the window shattering it be...
1 Being unacquainted with the chief raccoon was ...
2 There were white out conditions in the town; s...
3 The hawk didn’t understand why the ground squi...
4 Nobody loves a pig wearing lipstick.
Desired output:
print("\n".join(df["text"].to_list()))
Yields:
The bullet pierced the window shattering it before missing Danny's head by mere millimeters.
Being unacquainted with the chief raccoon was harming his prospects for promotion.
There were white out conditions in the town; subsequently, the roads were impassable.
The hawk didn’t understand why the ground squirrels didn’t want to be his friend.
Nobody loves a pig wearing lipstick.

Related

Drop 'near' duplicates with 80% or 90% match in pandas dataframe column

I have a dataframe df column which has contents like :
A header
content
First
['Mary is going to school happily with a big smile in her face wearing a bright blue uniform']
Second
['Ramy is going to school happily with a big smile in her face in a prussian blue jeans']
Third
['Mary is going to college happily with a big smile in her face wearing a bright blue uniform']
I want to drop duplicates which have like 90 percent similarity and keep first occurence only.
Is it possible somehow to incorporate drop_duplicates in this scenario without a fuzzywuzzy matcher kind of thing.
My desired output should be:
A header
content
First
['Mary is going to school happily with a big smile in her face wearing a bright blue uniform']
Second
['Ramy is going to school happily with a big smile in her face in a prussian blue uniform']

Highlight text in dataframe based on regex pattern

Problem: I have a use case wherein I'm required to highlight the word/words with red font color in a dataframe row based on a regex pattern. I landed upon a regex pattern as it ignores all spaces, punctuation, and case sensitivity.
Source: The original source comes from a csv file. So I'm looking to load it into a dataframe, do the pattern match highlight formatting and output it on excel.
Code: The code helps me with the count of words that match in the dataframe row.
import pandas as pd
import re
df = pd.read_csv("C:/filepath/filename.csv", engine='python')
p = r'(?i)(?<![^ .,?!-])Crust|good|selection|fresh|rubber|warmer|fries|great(?!-[^ .,?!;\r\n])'
df['Output'] = df['Output'].apply(lambda x: re.sub(p, red_fmt.format(r"\g<0>"), x))
Sample Data:
Input
Wow... Loved this place.
Crust is not good.
The selection on the menu was great and so were the prices.
Honeslty it didn't taste THAT fresh.
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.
The fries were great too.
Output: What I'm trying to achieve.
import re
# Console output color.
red_fmt = "\033[1;31m{}\033[0m"
s = """
Wow... Loved this place.
Crust is not good.
The selection on the menu was great and so were the prices.
Honeslty it didn't taste THAT fresh.
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.
The fries were great too.
"""
p = r'(?i)(?<![^ \r\n.,?!-])Crust|good|selection|fresh|rubber|warmer|fries|great(?!-[^ .,?!;\r\n])'
print(re.sub(p, red_fmt.format(r"\g<0>"), s))

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

Pandas unable to parse csv with multiple lines within a cell

I have a csv file Decoded.csv
Query,Doc,article_id,data_source
5000,how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.,459,random
5001,what is hypospadia,A birth defect of the male urethra.,409,dummy
5002,difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.,461,nytimes
And there are 3 Query 5000,5001 & 5002.
Query 5000 has a Doc value which has multiple lines and that is confusing for pandas.
(1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible)
My python code is as under
def main():
import pandas as pd
dataframe = pd.read_csv("Decoded.csv")
queries, docs = dataframe['Query'], dataframe['Doc']
for idx in range(len(queries)):
print("idx: ", idx, " ", queries[idx], " <-> ", docs[idx])
query_doc_appended = (queries[idx] + " " + docs[idx])
print(query_doc_appended)
if __name__ == '__main__':
main()
And it fails. Please point me how to get rid of new line characters so that Query 5000 has the complete set of statements for Doc.
your Query 5001 row has too many fields in it, making it have 5 columns instead of the 4 that the other rows have.
5001,what is hypospadia,A birth defect of the male urethra.,409,dummy
you can double-quote your Doc content within Decoded.csv to get around this.
2 problems:
to allow multiline fields, the field data has to be enclosed in double quotes.
there are also comma's in your field data.
So, the csv should look like this:
Query,Doc,article_id,data_source
5000,"how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning.
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.",459,random
5001,"what is hypospadia,A birth defect of the male urethra.",409,dummy
5002,"difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.",461,nytimes
In case there are double quotes inside those fields, they have to be escaped with another double quote.

Categories