PySpark Replace Characters using regex and remove column on Databricks - python

I am tring to remove a column and special characters from the dataframe shown below.
The code below used to create the dataframe is as follows:
dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')
The above produces the following output:
I need help with regex to remove the characters  and delete the first column.
As regards regex, I have tried the following:
dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', #"[^0-9a-zA-Z_]+"_ ""))
However, I'm getting a syntax error.
Any help much appreciated.

If the position of incoming column is fixed you can use regex to remove extra characters from column name like below
import re
colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)
And for dropping index column you can refer to this stack answer

You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below
df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

Related

How to drop rows in python that don't have date values?

I need help cleaning a very large dataframe. One of the rows is "PostingTimeUtc" should be only dates but several rows inserted wrong and they have strings of text instead. How can I select all the rows for "PostingTimeUtc" which have strings instead of dates and drop them?
I'm new to this site and to coding, so please let me know if I'm being vague.
Please remember to add examples even if short -
This may work in your case:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
df[df['column name'].map(is_datetime)]
Where map applies the is_datetime function (results in True or False) to each row and the Boolean filter is applied to the dataframe.
Don't forget to assign df to this result to retain the values as it is not done inplace.
df = df[df['column name'].map(is_datetime)]
I am assuming it's the pandas data frame. You can do this to filter rows on the basis of regex.
df.column_name.str.contains('your regex here')

Python - splitting sentences dataframe into multiple columns

I have sentences in csv that I'd like to split it by the delimiter of spaces.
I've tried using this :-
df2 = df["Text"].str.split()
but it doesnt gave out the ideal result. It shows up like this instead.
I am aware how to do it via power query in excel but I would like to learn how to do similar move using Python.
Here's the ideal result that I'd like to achieve
try this:
df2 = df["Text"].str.split(',', expand=True)
The problem in doing this is that the max length of your sentences is going to be fixed. Having said that, you could try the following code:
import pandas as pd
final_df = original_df['Sentence'].str.split(',', expand=True)
final_df = final_df.add_prefix('Text.')
Note that empty columns will be filled with None. If you want these columns to look like empty entries, you could add the following code, which will replace all None's by an empty string:
final_df = final_df.replace([None], [''])
Hope this will be useful.

Pandas Group By and Sum , Header being removed

after I run the following code I seem to lose the headers of my dataframe. If i remove the below line, my headers exist.
unifiedview = unifiedview.groupby(['key','MTM'])['MTM'].sum()
When i use to_csv my excel has no headers.
ive tried :
unifiedview = unifiedview.groupby(['key','MTM'], as_index = False)['MTM'].sum()
unifiedview = unifiedview.reset_index()
any help would be appreciated.
Calling
unifiedview.groupby(['key','MTM'])['MTM']'
will return a Pandas Series of only the 'MTM' column...
Therefore, the expression
unifiedview.groupby(['key','MTM'])['MTM'].sum() will return the sum of the GroupBy'd 'MTM' column...
unifiedview.groupby(['key','MTM']).sum().reset_index() should return the sum of all columns in unifiedview of the int or float dtype.
Are you looking to preserve all columns from the original dataframe?
Also, you must place an aggregate function after the groupby clause...
unifiedview.groupby(['key','MTM']) must have a .count(), .sum(), .mean(), ... method to group your columns...
unifiedview.groupby(['key','MTM']).sum()
unifiedview.groupby(['key','MTM']).count()
unifiedview.groupby(['key','MTM']).mean()
Is this helping you get in the right direction?
What version of pandas are you using? If you check the documentation it states:
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series
Since you are transforming your dataframe into a series object this might be the cause of your issue.
The documenation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Delete first column and then take them as a index pandas

I have a word2vec dataframe like this which saved from save_word2vec_format using Gensim under txt file. After using pandas to read this file. (Picture below). How to delete first row and make them as a index?
My txt file: https://drive.google.com/file/d/1O206N93hPSmvMjwc0W5ATyqQMdMwhRlF/view?usp=sharing
try this,
to replace index as header,
_X_T.index=_X_T.columns
to replace first row as header,
_X_T.index=_X_T.iloc[0]
save the row:
new_index = df.iloc[0]
drop it to avoid length mismatch:
df.drop(df.index[0], inplace=True)
and set it:
df.set_index(new_index, inplace=True)
you will get a SettingWithCopyWarning but that's the most elegant solution i could come up with.
if you want to set the headers (and not the first row) do:
df.index = df.columns

pandas pivot dataframe structure

raw data
pivot table
question
how can I replace industryName with tradeDate and remove that blank row? I want to make it look like:
the screenshot below is created by IPython-Dashboard
You can reset_index to convert you index to a regular column.
data_pivot.reset_index()
tradeDate is the name of your index. You can remove it via:
data_pivot.index.name = None
industryName is the name of your columns. You can change that be equal to tradeDate via:
data_pivot.columns.name = 'tradeDate'

Categories