How to optimize python script to pyspark def function

How to optimize python script to pyspark def function - python

I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?

yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.

I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")

Related

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!

You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

Writing a Python Created Pivot Table using Pandas to a Excel Document

I have been working on automating a series of reports in python. I have been trying to create a series of pivot tables from an imported csv (binlift.csv). I have found the Pandas library very useful for this however, I cant seem to find anything that helps me write the Panda created pivot tables to my excel document (Template.xlsx) and was wondering if anyone can help. So far I have the written the following code
import openpyxl
import csv
from datetime import datetime
import datetime
import pandas as pd
import numpy as np
file1 = "Template.xlsx" # template file
file2 = "binlift.csv" # raw data csv
wb1 = openpyxl.load_workbook(file1) # opens template
ws1 = wb1.create_sheet("Raw Data") # create a new sheet in template called Raw Data
summary = wb1.worksheets[0] # variables given to sheets for manipulation
rawdata = wb1.worksheets[1]
headings = ["READER","BEATID","LIFTYEAR","LIFTMONTH","LIFTWEEK","LIFTDAY","TAGGED","UNTAGGEDLIFT","LIFT"]
df = pd.read_csv(file2, names=headings)
pivot_1 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH","LIFTWEEK"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
pivot_2 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH"], values=["TAGGED","UNTAGGEDLIFT"],aggfunc=np.sum)
pivot_3 = pd.pivot_table(df, index=["READER"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
print(pivot_1)
print(pivot_2)
print(pivot_3)
wb1.save('test.xlsx')enter code here

There is an option in pandas to write the 'xlsx' files.
Here basically we get all the indices (at level 0) of the pivot table, and then one by one we go over these indices to subset the table and write that part of the table.
writer = pd.ExcelWriter('output.xlsx')
for manager in pivot_1.index.get_level_values(0).unique():
temp_df = pivot_1.xs(manager, level=0)
temp_df.to_excel(writer, manager)
writer.save()

How can I use a CSV file for Python pdblp instead of a ticker reference for getting API from con.ref

I very new to Python and I want to replace an exact ticker with a reference to a column of a Data Frame I created from a CVS file, can this be done. i'm using:
import pandas as pd
import numpy as np
import pdblp as pdblp
import blpapi as blp
con = pdblp.BCon(debug=False, port=8194, timeout=5000)
con.start()
con.ref("CLF0CLH0 Comdty","PX_LAST")
tickers = pd.read_csv("Tick.csv")
so "tickers" has a colum 'ticker1' which is a list of tickers, i want to replace
con.ref("CLF0CLH0 Comdty","PX_LAST") with somthing like
con.ref([tickers('ticker1')],"PX_LAST")
any ideas?

assuming you would want to load all tickers into one df, i think it would look something like this:
df = pd.DataFrame(columns=["set your columns"])
for ticker in tickers.tickers1:
df_tmp = pd.DataFrame()
con.ref(ticker,"PX_LAST")
df_tmp = con.fetch #you'll have to fetch the records into a df
df.append(df_tmp)

Ended up using the following .tolist() function, and worked well.
tickers = pd.read_csv("Tick.csv")
tickers1=tickers['ticker'].tolist()
con.ref(tickers1,[PX_LAST])

Overwrite specific columns after modification pandas python

I have a csv file where i did some modifications in two columns. My question is the following: How can I print my csv file with the updated columns? My code is the following :
import pandas as pd
import csv
data = pd.read_csv("testdataset.csv")
data = data.join(pd.get_dummies(data["ship_from"]))
data = data.drop("ship_from", axis=1)
data['market_name'] = data['market_name'].map(lambda x: str(x)[39:-1])
data = data.join(pd.get_dummies(data["market_name"]))
data = data.drop("market_name", axis=1)
Thank you in advance!

You can write to a file with pandas.DataFrame.to_csv
data.to_csv('your_file.csv')
However, you can view it without writing with
print(data.to_csv())

Comparing two Microsoft Excel files in Python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?

You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.

Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize python script to pyspark def function - python

I don't see you use any pyspark in your code, so I'll just use pandas this way: cols = df.columns df['Creation_DT'] =pd.to_datetime('today') df['Create_BY']="Sean" cols = cols.insert(0, 'Create_BY') cols = cols.insert(0, 'Creation_DT') df.columns = cols df.write("/home/path/new/new_file.txt")

Related

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Writing a Python Created Pivot Table using Pandas to a Excel Document

How can I use a CSV file for Python pdblp instead of a ticker reference for getting API from con.ref

Overwrite specific columns after modification pandas python

Comparing two Microsoft Excel files in Python

Categories

Resources