How create new column in Spark using Python, based on other column?

How create new column in Spark using Python, based on other column? - python

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:
"content" "other column"
The father has two dogs father
One cat stay at home of my mother mother
etc. etc.
I thought to create an array with words who interessed me. For example:
people=[mother,father,etc.]
Then, I iterate on column "content" and extract the word to insert on new column:
def extract_people(df):
column=[]
people=[mother,father,etc.]
for row in df.select("content").collect():
for word in people:
if str(row).find(word):
column.append(word)
break
return pd.Series(column)
f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))
This code don't work and give me this error on the collect():
22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space
Maybe because my file is too large, have 15 million of row.
How I may make the new column in different mode?

Using the following dataframe as an example
+---------------------------------+
|content |
+---------------------------------+
|Thefatherhas two dogs |
|The fatherhas two dogs |
|Thefather has two dogs |
|Thefatherhastwodogs |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother |
|etc. |
|my feet smell |
+---------------------------------+
You can do the following
from pyspark.sql import functions
arr = ["father", "mother", "etc."]
expression = (
"CASE " +
"".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) +
"ELSE 'None' END")
df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs |father |
|The fatherhas two dogs |father |
|Thefather has two dogs |father |
|Thefatherhastwodogs |father |
|One cat stay at home of my mother|mother |
|One cat stay at home of mymother |mother |
|Onecatstayathomeofmymother |mother |
|etc. |etc. |
|my feet smell |None |
+---------------------------------+------------+

Related

Apyori, outputting a rule with more than two items isn't working

I'm a total python noob and programming beginner, but I'm trying to analyse some fictional Data using apyori in python for school.
I didn't write most of this program myself, I got it from a Jupyter Notebook my teacher gave me and I understand most of it, except for the actual creation of the data frame we output at the end.
My biggest issue is this though:
If I output the entire rule for one of the rows in my data frame, with
print(association_results[0])
I get this output:
RelationRecord(items=frozenset({'Spiderman 3', 'Moana', 'Tomb Rider'}), support=0.007199040127982935, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Moana', 'Tomb Rider'}), items_add=frozenset({'Spiderman 3'}), confidence=0.20300751879699247, lift=3.0825089038385434)])
If I understand it correctly that should essentially mean "When someone buys Moana and Tomb Rider they're likely to also buy Spiderman 3", however in my Database at the end I only get the output "When someone buys Moana they're likely to also buy Spiderman 3".
That happens for multiple rows in my data frame and I couldn't find an example online of anyone having a rule including two items as the "when", so I don't understand how I can output both movies into the data frame.
import pandas as pd
from apyori import apriori
movie_data = pd.read_csv(r"C:\Users\XY\Desktop\movie_dataset.csv", header = None) #import data
num_records = len(movie_data)
#########
records = []
for i in range(0, num_records):
records.append([str(movie_data.values[i,j]) for j in range(0, 20)])
for i, j in enumerate(records): # deletes empty items out of the data frame (this was an issue before)
while 'nan' in records[i]: # because it output "when Green Lantern is bought, nothing is bought as well)
records[i].remove('nan')
association_rules = apriori(records, min_support=0.0053, min_confidence=0.20, min_lift=3, min_length=2)
association_results = list(association_rules)
results = []
for item in association_results:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
value0 = str(items[0])
value1 = str(items[1])
#second index of the inner list
value2 = str(item[1])[:7]
#third index of the list located at 0th
#of the third index of the inner list
value3 = str(item[2][0][2])[:7]
value4 = str(item[2][0][3])[:7]
rows = (value0, value1,value2,value3,value4)
results.append(rows)
labels = ['When bought','Likely bought as well','Support','Confidence','Lift']
movie_suggestion = pd.DataFrame.from_records(results, columns = labels)
print(movie_suggestion)
The output looks like this:
|| When bought| Likely bought as well| Support| Confidence| Lift|
|----|-------------|-------------------------|---------|-----------|---------|
|0 | Red Sparrow| Green Lantern| 0.00573| 0.30069| 0.30069|
|1 |Green Lantern| Star Wars| 0.00586| 0.37288| 0.37288|
|2 |Kung Fu Panda| Jumanji| 0.01599| 0.32345| 0.32345|
|3 | Wonder Woman| Jumanji| 0.00533| 0.37735| 0.37735|
|4 | Spiderman 3| The Spy Who Dumped Me| 0.00799| 0.27149| 0.27149|
|5 | Moana| Spiderman 3| 0.00533| 0.23255| 0.23255|
etc.
Instead of:
|| When bought| Likely bought as well| Support| Confidence| Lift|
|----|-------------|-------------------------|---------|-----------|---------|
|0 | Red Sparrow| Green Lantern| 0.00573| 0.30069| 0.30069|
|1 |Green Lantern| Star Wars| 0.00586| 0.37288| 0.37288|
|2 |Kung Fu Panda| Jumanji| 0.01599| 0.32345| 0.32345|
|3 | Wonder Woman| Jumanji| 0.00533| 0.37735| 0.37735|
|4 | Spiderman 3| The Spy Who Dumped Me| 0.00799| 0.27149| 0.27149|
|5 |Moana, Tomb Rider| Spiderman 3| 0.00533| 0.23255| 0.23255|
I tried looking at all the variables to try to understand the data frame creation to figure out how to get the output i want, but i don't understand it and like I said didn't find anything matching my issue.

I just realized the table at the end of my question did not work as intended...
BUT I figured out a semi-acceptable answer for now.
Using string slicing for value0 and value1 like this:
value0 = str(item[2][0][0])[11:-2]
value1 = str(item[2][0][1])[11:-2]
And the output looks like this (didnt include confidence, support and lift for this):
|When bought |Likely bought as well|
|----------: |--------------------:|
|'Intern','Tomb Rider'|'World War Z' |
and so on
okay in the preview the tables work, but on the post they don't, hope it's still readable

Split data frame of comments into multiple rows

I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.
Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')
Comments
>>>
reviews
0 One of the rare films where every discussion leaving the theater is about how much you
just had, instead of an analysis of its quotients.
1 Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving,
and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that
re-watchability factor.
I loaded the model like this
import spacy
nlp = spacy.load("en_core_news_sm")
And using sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))
But when I check the sentence is in just one row like this
[One of the rare films where every discussion leaving the theater is about how much you just had.,
Instead of an analysis of its quotients.]
Thanks a lot for any help. I'm new using NLP tools in Data Frame.

Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.
comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
{'reviews': 'This is the first sentence of the second review. And this is the second.'}]
comments = pd.DataFrame(comments) # building your input DataFrame
+----+--------------------------------------------------------------------------+
| | reviews |
|----+--------------------------------------------------------------------------|
| 0 | This is the first sentence of the first review. And this is the second. |
| 1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+
Now let's define a function which, given a string, returns the list of its sentences as texts (strings).
def obtain_sentences(s):
doc = nlp(s)
sents = [sent.text for sent in doc.sents]
return sents
The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.
data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data
I used explode to transform the elements of the lists of sentences into rows.
And this is the obtained output!
+----+--------------------------------------------------+
| | reviews |
|----+--------------------------------------------------|
| 0 | This is the first sentence of the first review. |
| 1 | And this is the second. |
| 2 | This is the first sentence of the second review. |
| 3 | And this is the second. |
+----+--------------------------------------------------+

Remove word from one column base on another column then create and put it in new column

So, for example, I have this Dataframe:
branch_name | address
Mcdonald's - BGC | 2nd str. BGC
Jollibee - Taguig | BGC, Taguig
...
How will I be able to remove words from branch_name base on the words that is in the address like the data below then create a new column to store the output per row.
branch_name | address | store_name
Mcdonald's - BGC | 2nd str. BGC | Mcdonald's
Jollibee - Taguig | BGC, Taguig | Jollibee
...
For expected output, the special characters has also been remove except apostrophe.

You can use df.str.extract using regex:
df['store_name'] = df.branch_name.str.extract('(\S+)')

Making Table without using Texttable

I am writing Python code to show items in a store .... as I am still learning I want to know how to make a table which looks exactly like a table made by using Texttable ....
My code is
Goods = ['Book','Gold']
Itemid= [711001,711002]
Price= [200,50000]
Count= [100,2]
Category= ['Books','Jewelry']
titles = ['', 'Item Id', 'Price', 'Count','Category']
data = [titles] + list(zip(Goods, Itemid, Price, Count, Category))
for i, d in enumerate(data):
line = '|'.join(str(x).ljust(12) for x in d)
print(line)
if i == 0:
print('=' * len(line))
My Output:
|Item Id |Price |Count |Category
================================================================
Book |711001 |200 |100 |Books
Gold |711002 |50000 |2 |Jewelry
Output I want:
+------+---------+-------+-------+-----------+
| | Item Id | Price | Count | Category |
+======+=========+=======+=======+===========+
| Book | 711001 | 200 | 100 | Books |
+------+---------+-------+-------+-----------+
| Gold | 711002 | 50000 | 2 | Jewelry |
+------+---------+-------+-------+-----------+

You code is building your output by hand, using string.join(). You can do it that way but it is very tedious. Use string formatting instead.
To help you along here is one line:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:9s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books")
Texttable adjusts its cell widths to fit the data. If you want to do the same, then you will have to put computed field widths in content_format instead of using numeric literals the way I have done in the example above. Again, here is one example to get you going:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:{CategoryWidth}s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books",CategoryWidth=9)
But if you already know how to do this using Texttable, why not use that? Your comment says it's not available in Python: not true, I just downloaded version 0.9.0 using pip.

Best way to find shortest path with loading very specific data from Google Spreadsheet

Now begore you would judge me for FAQ my problem is not so easy as 'the best shortest path algorithm' (at least I think).
I have a Google Spreadsheet.
Every row begins with a town name and it's followed by the nr of roads that go through and by name of of those roads. Something like here bellow:
Ex. Spreadsheet / First sheet:
A | B | C | D | E |
1 Town name | Number of roads you find here | road name | road name | road name |
2 Manchester| 3 | M1 | M2 | M3 |
3 Leeds | 1 | M3 | | |
4 Blackpool | 2 | M1 | M2 | |
Now this one Spreadsheet has many worksheets, each for every road name (in my case M1, M2, M3. M1 is the second worksheet since the first one has the content from above. M2 is the third etc)
Ex. Spreadsheet / Second sheet:
A | B | C | D | E | F |
1 This road | Town name | Distance in km | type of road | other road | other road |
2 M1 | Manchester| 0 | M2 | M3 | |
3 M1 | Blackpool | 25 | M2 | | |
Third sheet is similar, next sheets similar structure. One town can be containd in many sheets depending on how many roads link to it. You can see it from the above example.
The Spreadsheet is not made by me. It's like this. It will not get any better.
I have no problem pulling the data from the google spreadsheet in the program. Reading spreadsheet data with python is not the question here.
What is the best way to write a programme in wxpython/python where a user inputs Starting Town and Finishing Town.
The programme will read the spreadsheet and appropriate worksheets.
Will find somehow the best path in this jungle of worksheets.
It will additionally return the total distance from Starting Town to Finishing town even if the it has to go through maybe more then 2-3 worksheets to get there.
Will return results to the users screen in a lovely form :)
I hope you find my problem challenging enough to deserve a questioning.
I beg you for help. Show me the way to go about this very specific problem.

What came of your previous attempt:
Way too slow wxPython application getting data from Google Spreadsheet and User input needs speed up solution
Did you find what was taking so long? What other issues did you encounter there?
I'm relatively new to stackoverflow but I've seen these style questions, that can be interpreted as "Could you write this code for me?", as being rejected pretty swiftly.
You might want to consider sharing some of the challenges from the above link and explaining a specific problem within the project.
UPDATED
1+5:
From the WX point of view, you'll want to keep the UI responsive whilst the search is going on. One way to do this is kick off the searching in a separate thread which calls wx.PostEvent once it has finished. Then in the main wx App you have an event handler which receives the event and processes it. In your case "shows the results on a lovely form".
See here for an example: http://wiki.wxpython.org/LongRunningTasks

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How create new column in Spark using Python, based on other column? - python

Related

Apyori, outputting a rule with more than two items isn't working

Split data frame of comments into multiple rows

Remove word from one column base on another column then create and put it in new column

Making Table without using Texttable

Best way to find shortest path with loading very specific data from Google Spreadsheet

Categories

Resources