Storing processed text in pandas dataframe

Storing processed text in pandas dataframe - python

I've used gensim for text summarizing in Python. I want my summarized output to be stored in a different column in the same dataframe.
I've used this code:
for n, row in df_data_1.iterrows():
text=df_data_1['Event Description (SAP)']
print(text)
*df_data_1['Summary']=summarize(text)*
print(df_data_1['Summary'])
The error is coming on line 4 of this code, which states: TypeError: expected string or bytes-like object.
How to store the processed text in the pandas dataframe

If it's not string or bytes-like, what is it? You could check the type of your summarize function and move forward from there.
test_text = df_data_1['Event Description (SAP)'].iloc[0]
print(type(summarize(test_text))
Another remark: typically you'd want to avoid looping over a dataframe (see discussion). If you want to apply a function to an entire column, use df.apply() as follows:
df_data1[‘Summary’] = df_data1['Event Description (SAP)'].apply(lambda x: summarize(x))

Related

Python loop to search multiple sets of keywords in all columns of dataframe

I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!

A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.

Lexical diversity returns TypeError using Text column

I need to calculate the lexical diversity. An example of textual column from a pandas dataframe is
Text
Happy new Year!
happy new Year! Wishing you all the best
New year is coming
[Oh Oh oh... 2022 is here] # this is a string, not a list
I have tried as below:
from lexical_diversity import lex_div as ld
tok = ld.tokenize(df['Text'])
flt = ld.flemmatize(df['Text'])
ld.mtld_ma_bid(flt)
but I got the error: TypeError: expected string or bytes-like object when I run ld.tokenize. Text is an object.
Is there anything that I am missing? I also dropped rows with missing data.

ld.tokenize doesn't understand how to deal with list (or a Series). You have to apply function on each row individually:
tok = df['Text'].apply(ld.tokenize)
flt = df['Text'].apply(ld.flemmatize)
flt.apply(ld.mtld_ma_bid)

Is it possible to modify output data file names in pySpark?

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz

You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Iterating through dataframe to produce chart title (Python)

Good morning,
I am trying to iterate through a CSV to produce a title for each stock chart that I am making.
The CSV is formatted as: Ticker, Description spanning about 200 rows.
The code is shown below:
df_symbol_description = pd.read_csv('C:/TS/Combined/Tickers Desc.csv')
print(df_symbol_description['Description'])
for r in df_symbol_description['Description']:
plt.suptitle(df_symbol_description['Description'][r],size = '20')
It is erroneous as it comes back with this error: "KeyError: 'iShrs MSCI ACWI ETF'"
This error is just showing me the first ticker description in the CSV. If anyone knows how to fix this is is much appreciated!
Thank you

I don't know how to fix the error, since it's unclear what you are trying to achieve, but we can have a look at the problem itself.
Consider this example, which is essentially your code in small.
import pandas as pd
df=pd.DataFrame({"x" : ["A","B","C"]})
for r in df['x']:
print(r, df['x'][r])
The dataframe consists of one column, called x which contains the values "A","B","C". In the for loop you select those values, such that for the first iteration r is "A". You are then using "A" as an index to the column, which is not possible, since the column would need to be indexed by 0,1 or 2, but not the string that it contains.
So in order to print the column values, you can simply use
for r in df['x']:
print(r)

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.

You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing processed text in pandas dataframe - python

Related

Python loop to search multiple sets of keywords in all columns of dataframe

Lexical diversity returns TypeError using Text column

Is it possible to modify output data file names in pySpark?

Iterating through dataframe to produce chart title (Python)

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

Categories

Resources