How do I remove stop words in RDD PySpark? - python

How do I remove the stop words in PySpark RDD?
my_doc = sc.parallelize([("Alex Smith", 101, ["i", "saw", "a", "sheep"]), ("John Lee", 102, ["he", "likes", "ice", "cream"] )])
I have the following RDD below:
(("Alex Smith", 101, ["I", "saw", "a", "sheep"]), ("John Lee", 102, ["He", "likes", "ice", "cream"]))
I want to remove the stop words in the x[2], like "a", "he", "i", etc.
After removing the stop words, it should look like this below:
(("Alex Smith", 101, ["saw", "sheep"]), ("John Lee", 102, ["likes", "ice", "cream"]))

Map to rdd for create new elements with filtered value.
stop_words = ['i', 'a', 'he']
my_doc.map(lambda x: (x[0], x[1], list(filter(lambda word: word.lower() not in stop_words, x[2])))).collect()
[('Alex Smith', 101, ['saw', 'sheep']),
('John Lee', 102, ['likes', 'ice', 'cream'])]

It would be more efficient to use DataFrame APIs instead of RDD API - in this case the data won't need to be sent to Python process for filtering as everything will stay inside the JVM, and Catalyst will be able to better optimize it.
For filtering of the array elements you can use the filter function, like this:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Alex Smith", 101, ["i", "saw", "a", "sheep"]),
("John Lee", 102, ["he", "likes", "ice", "cream"] )],
schema="name string, id int, words array<string>")
stop_words = ["he", "i", "a", "the"] # ...
df2 = df.select("name", "id",
F.filter("words", lambda x: ~F.lower(x).isin(stop_words)).alias("words"))
df2.show(truncate=False)
and it will give you desired answer:
+----------+---+-------------------+
|name |id |words |
+----------+---+-------------------+
|Alex Smith|101|[saw, sheep] |
|John Lee |102|[likes, ice, cream]|
+----------+---+-------------------+
Main piece of it is ~F.lower(x).isin(stop_words)) -> this will apply standard .isin function against given list of the stop words and negate the result with ~.

Related

pandas df explode and implode to remove specific dict from the list

I have pandas dataframe with multiple columns. On the the column called request_headers is in a format of list of dictionaries, example:
[{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}]
I would like to remove only those elements from that list which do contains specific name. For example with:
blacklist = "name2"
I should get the same dataframe, with all the columns including request_headers, but it's value (based on the example above) should be:
[{"name": "name1", "value": "value1"}]
How to achieve it ? I've tried first to explode, then filter, but was not able to "implode" correctly.
Thanks,
Exploding is expensive, rather us a list comprehension:
blacklist = "name2"
df['request_headers'] = [[d for d in l if 'name' in d and d['name'] != blacklist]
for l in df['request_headers']]
Output:
request_headers
0 [{'name': 'name1', 'value': 'value1'}]
can use a .apply function:
blacklist = 'name2'
df['request_headers'] = df['request_headers'].apply(lambda x: [d for d in x if blacklist not in d.values()])
df1=pd.DataFrame([{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}])
blacklist = "name2"
col1=df1.name.eq(blacklist)
df1.loc[col1]
out:
name value
1 name2 value2

How to subset a pandas dataframe based on column names of another dataframe that may be in random order?

I want to subset the row names of the raw_clin dataframe by the column names of the common dataframe.
common dataframe example
common = pd.DataFrame([["PPP1R15A", -0.5880, 1.3980, -0.9402, -0.3741], ["AVPR1A", 1.5472, -0.8588, -0.1703, -0.5198], ["RGR", -0.3225, 0.8372, 0.2006, -0.0271]], columns=['Hugo_Symbol', 'TCGA-02-0010-01', 'TCGA-41-2571-01', 'TCGA-14-1821-01', 'TCGA-32-2632-01'])
raw_clin dataframe example
raw_clin = pd.DataFrame([["TCGA-02-0010-01", "I", "want", "to", "subset"], ["TCGA-14-1821-01", "clin_var", "rownames", "by", "common"], ["TCGA-41-2571-01", "colnames", "where", "the", "latter"], ["TCGA-32-2632-01", "may", "be", "random", "order"]], columns=['PATIENT_ID', 'Something1', 'something2', 'something3', 'something4'])
desired output
raw_clin = pd.DataFrame([["TCGA-02-0010-01", "I", "want", "to", "subset"], ["TCGA-41-2571-01", "colnames", "where", "the", "latter"], ["TCGA-14-1821-01", "clin_var", "rownames", "by", "common"], ["TCGA-32-2632-01", "may", "be", "random", "order"]], columns=['PATIENT_ID', 'Something1', 'something2', 'something3', 'something4'])
My attempt yielded no matches:
raw_clin = raw_clin[raw_clin.index.isin(common.columns)]
If I understand correct, you are mentioning row name which is index, then you need to use set_index for that dataframe.
Then your code will work raw_clin = raw_clin[raw_clin.index.isin(common.columns)] to create your desired output.
raw_clin = pd.DataFrame([["TCGA-02-0010-01", "I", "want", "to", "subset"], ["TCGA-14-1821-01", "clin_var", "rownames", "by", "common"], ["TCGA-41-2571-01", "colnames", "where", "the", "latter"], ["TCGA-32-2632-01", "may", "be", "random", "order"]], columns=['PATIENT_ID', 'Something1', 'something2', 'something3', 'something4']).set_index('PATIENT_ID')

How to change dataframe into a specific json format?

I want to convert the below dataframe into the json formation as mentioned:
Dataframe:
Desired Ouput:
{"Adam": [{"name": "ABC", "y": 2.0}, {"name": "DEF", "y": 5}],
"John": [{"name": "GHI", "y": 29.01}, {"name": "FMI", "y": 219.77}]}
I tried creating the index of Party column and use df.to_json(orient="index") but its failing due to duplicate in index columns (Party column). Can someone please help.
Use custom lambda function in GroupBy.apply:
j = (df.groupby('Party')[['name','y']]
.apply(lambda x: x.to_dict('records'))
.to_json(orient="index"))
print (j)
{"Adam":[{"name":"ABC","y":2.0},{"name":"DEF","y":5.0}],
"John":[{"name":"GHI","y":29.01},{"name":"FMI","y":219.77}]}

Clean multiple JSONs in a pandas dataframe

I have a dataframe created like below, with countries in JSON format:
df = pd.DataFrame([['matt', '''[{"c_id": "cn", "c_name": "China"}, {"c_id": "au", "c_name": "Australia"}]'''],
['david', '''[{"c_id": "jp", "c_name": "Japan"}, {"c_id": "cn", "c_name": "China"},{"c_id": "au", "c_name": "Australia"}]'''],
['john', '''[{"c_id": "br", "c_name": "Brazil"}, {"c_id": "ag", "c_name": "Argentina"}]''']],
columns =['person','countries'])
I'd like to have the output as below, with just the country names, separated by a comma and sorted in alphabetical order:
result = pd.DataFrame([['matt', 'Australia, China'],
['david', 'Australia, China, Japan'],
['john', 'Argentina, Brazil']],
columns =['person','countries'])
I tried doing this using a few methods, but none worked successfully. I was hoping the below would split the JSON format appropriately, but it didn't work out - perhaps because the JSONs are in string format in the dataframe?
result = pd.io.json.json_normalize(df, 'c_name')
One solution could be to use ast.literal_eval to treat the string as a list of dictionaries:
import ast
df["countries"] = df["countries"].map(lambda x: ast.literal_eval(x))
df["countries"] = df["countries"].map(lambda x: sorted([c["c_name"] for c in x]))

extracting values by keywords in a pandas column

I have a column that is a list of dictionary. I extracted only the values by the name key, and saved it to a list. Since I need to run the column to a tfidVectorizer, I need the columns to be a string of words. My code is as follows.
def transform(s,to_extract):
return [object[to_extract] for object in json.loads(s)]
cols = ['genres','keywords']
for col in cols:
lst = df[col]
df[col] = list(map(lambda x : transform(x,to_extract='name'), lst))
df[col] = [', '.join(x) for x in df[col]]
for testing, here's 2 rows.
data = {'genres': [[{"id": 851, "name": "dual identity"},{"id": 2038, "name": "love of one's life"}],
[{"id": 5983, "name": "pizza boy"},{"id": 8828, "name": "marvel comic"}]],
'keywords': [[{"id": 9663, "name": "sequel"},{"id": 9715, "name": "superhero"}],
[{"id": 14991, "name": "tentacle"},{"id": 34079, "name": "death", "id": 163074, "name": "super villain"}]]
}
df = pd.DataFrame(data)
I'm able to extract the necessary data and save it accordingly. However, I find the codes too verbose, and I would like to know if there's a more pythonic way to achieve the same outcome?
Desired output of one row should be a string, delimited only by a comma. Ex, 'Dual Identity,love of one's life'.
Is this what you need ?
df.applymap(lambda x : pd.DataFrame(x).name.tolist())
Out[278]:
genres keywords
0 [dual identity, love of one's life] [sequel, superhero]
1 [pizza boy, marvel comic] [tentacle, super villain]
Update
df.applymap(lambda x : pd.DataFrame(x).name.str.cat(sep=','))
Out[280]:
genres keywords
0 dual identity,love of one's life sequel,superhero
1 pizza boy,marvel comic tentacle,super villain

Categories