I have a long list of pandas transformation commands that I need to run against a pandas DataFrame:
pd['newvar_A'] = pd['somevar'] * pd['somevar']
pd['newvar_C'] = pd['somevar'] * pd['somevar']
pd['newvar_D'] = pd['somevar'] * pd['somevar']
pd['newvar_ETC'] = pd['somevar'] * pd['somevar']
It's a long list (about 150 lines). Is it possible to include this as a separate script called transformations.py in an already existing script? The idea is to keep the main script simple, so my idea is the script to look like this:
import pandas as pd
pd.read_csv ('data.csv')
...
#Run transformations
insert file = "transformations.py"
...
#rest of the main script
Is there a Python command to call another Python script (assuming this script is located in the same folder as the working directory)?
Thanks!
You can try to "import" the script as it's the best way as per this post
A small example
sample.csv
name,age
sharon,12
shalom,10
The script which I am going to import
nameChange.py
import pandas as pd
# transform the csv file
data = pd.read_csv('sample.csv')
data.iloc[0,0] = 'justin'
data.to_csv('sample.csv',index = False)
The main code
stackoverflow.py
import pandas as pd
# before transform
data = pd.read_csv('sample.csv')
print(data)
# call the script
import nameChange
# do the work after the script runs
transformed_data = pd.read_csv('sample.csv')
print(transformed_data)
Output
name age
0 sharon 12
1 shalom 10
name age
0 justin 12
1 shalom 10
To run the above code without modifying the original csv
The script which I am going to import
nameChange.py
import pandas as pd
import pickle
# transform the csv file variable which was saved by stackoverflow.py
data = pickle.load(open('data.sav','rb'))
data.iloc[0,0] = 'justin'
# saving the df
pickle.dump(data,open('data.sav','wb'))
The main code
stackoverflow.py
import pandas as pd
import pickle
# before transform
data = pd.read_csv('sample.csv')
print(data)
pickle.dump(data,open('data.sav','wb'))
# call the script
import nameChange
transformed_data = pickle.load(open('data.sav','rb'))
# do the work after the script runs
print(transformed_data)
Related
So I'm learning PySpark by playing around with the DMOZ dataset in a jupyter notebook attached to an EMR cluster. The process I'm trying to achieve is as follows:
Load a csv with the location of files in an s3 public dataset in to a PySpark DataFrame (~130k rows)
Map over the DF with a function that retrieves the file contents (html) and rips the text
Join the output with the original DF as a new column
Write the joined DF to s3 (the problem: It seems to hang forever, its not a large job and the output json should only be a few gigs)
All of the writing is done in a function called run_job()
I let it sit for about 2 hours on a cluster with 10 m5.8xlarge instances which should be enough (?). All of the other steps execute fine on their own, except for the df.write(). I have tested on a
much smaller subset and it wrote to s3 with no issue, but when I go to do the whole file it seemingly hangs at at "0/n jobs complete."
I am new to PySpark and distributed computing in general, so its probably a simple "best practice" that I am missing. (Edit: Maybe its in the config of the notebook? I'm not using any magics to configure spark currently, do I need to?)
Code below...
import html2text
import boto3
import botocore
import os
import re
import zlib
import gzip
from bs4 import BeautifulSoup as bs
from bs4 import Comment
# from pyspark import SparkContext, SparkConf
# from pyspark.sql import SQLContext, SparkSession
# from pyspark.sql.types import StructType, StructField, StringType, LongType
import logging
def load_index():
input_file='s3://cc-stuff/uploads/DMOZ_bussineses_ccindex.csv'
df = spark.read.option("header",True) \
.csv(input_file)
#df = df.select('url_surtkey','warc_filename', 'warc_record_offset', 'warc_record_length','content_charset','content_languages','fetch_time','fetch_status','content_mime_type')
return df
def process_warcs(id_,iterator):
html_textract = html2text.HTML2Text()
html_textract.ignore_links = True
html_textract.ignore_images = True
no_sign_request = botocore.client.Config(signature_version=botocore.UNSIGNED)
s3client = boto3.client('s3', config=no_sign_request)
text = None
s3pattern = re.compile('^s3://([^/]+)/(.+)')
PREFIX = "s3://commoncrawl/"
for row in iterator:
try:
start_byte = int(row['warc_record_offset'])
stop_byte = (start_byte + int(row['warc_record_length']))
s3match = s3pattern.match((PREFIX + row['warc_filename']))
bucketname = s3match.group(1)
path = s3match.group(2)
#print('Bucketname: ',bucketname,'\nPath: ',path)
resp = s3client.get_object(Bucket=bucketname, Key=path, Range='bytes={}-{}'.format(start_byte, stop_byte))
content = resp['Body'].read()#.decode()
data = zlib.decompress(content, wbits = zlib.MAX_WBITS | 16).decode('utf-8',errors='ignore')
data = data.split('\r\n\r\n',2)[2]
soup = bs(data,'html.parser')
for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
x.extract()
for x in soup.find_all(["head","script","button","form","noscript","style"]):
x.decompose()
text = html_textract.handle(str(soup))
except Exception as e:
pass
yield (id_,text)
def run_job(write_out=True):
df = load_index()
df2 = df.rdd.repartition(200).mapPartitionsWithIndex(process_warcs).toDF()
df2 = df2.withColumnRenamed('_1','idx').withColumnRenamed('_2','page_md')
df = df.join(df2.select('page_md'))
if write_out:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
df.coalesce(4).write.json(output)
return df
df = run_job(write_out=True)
So I managed to make it work. I attribute this to either of the 2 changes below. I also changed the hardware configuration and opted for a higher quantity of smaller instances. Gosh I just LOVE it when I spend an entire day in a deep state of utter confusion when all I needed to do was add an "/" to the save location......
I added a trailing "/" to the output file location in s3
1 Old:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
1 New:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML/"
I removed the "coalesce" in the "run_job()" function, I have 200 output files now, but it worked and it was super quick (under 1 min).
2 Old:
df.coalesce(4).write.json(output)
2 New:
df.write.mode('overwrite').json(output)
I've got an excel file xlsx (shape:1180,6) that I'm trying to manipulate around. Pretty much creating an empty row every other row and inserting data to it, by just re-arranging the data. The code runs fine when i try it with just 10 rows of data but fails when i run the entire 1180 rows. It also runs a long time before spitting out the same unprocessed data. Is openpyxl not built for this? Just wondering if there's a more efficient way of doing it. Here's my code. Below the code is data after using a few rows, which is what i need, but fails for the entire data set.
%%time
import pandas as pd
import numpy as np
from openpyxl import load_workbook
import os
xls = pd.ExcelFile('input.xlsx')
df = xls.parse(0)
wb = load_workbook('input.xlsx')
#print(wb.sheetnames)
sh1=wb['Sheet1']
df.head()
#print(sh1.max_column)
for y in range(2,(sh1.max_row+1)*2,2):
sh1.insert_rows(y)
wb.save('output.xlsx')
m=3
for k in range(2,sh1.max_row+1,2):
sh1.cell(row=k,column=1).value = sh1.cell(row=m,column=1).value # copy from one cell and paste
sh1.cell(row=k,column=2).value = sh1.cell(row=m,column=3).value
sh1.cell(row=k,column=3).value = sh1.cell(row=m,column=2).value
sh1.cell(row=k,column=4).value = 'A'
sh1.cell(row=m,column=4).value = 'H'
sh1.cell(row=k,column=5).value = sh1.cell(row=m,column=6).value
sh1.cell(row=k,column=6).value = sh1.cell(row=m,column=5).value
m+=2
wb.save('output.xlsx')
xls = pd.ExcelFile('output.xlsx')
df1 = xls.parse(0)
wb1 = load_workbook('output.xlsx')
df1
I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")
I have several .pcap files whose data I want write to one large dask data frame. Currently, initializes a dask data frame using data from the first file. It then is supposed to process the rest of the pcap files and add to that dask data frame using merge/concat. However, when I check the number of the rows of the merged dask dataframe it doesn't increase. What is happening?
I also am not sure if I am using the right approach for my use case. I am trying to convert my entire dataset into a giant dask dataframe and write it out to h5 file. My computer doesn't have enough memory to load the entire dataset so that's why I'm using dask. The idea is to load the dask dataframe that contains the entire dataset so I could do operations on the entire dataset. I'm new to dask and I've read over the some of the documentation but I'm still fuzzy about how dasks handles loading data from disk instead of memory. I'm also fuzzy about how partitions work in dask. Specifically, I'm also not sure how chunksize differs from partitions so I'm having trouble properly partitioning this dataframe. Any tips and advice would be helpful.
As said before, I've read over the main parts of the documentation.
I've tried using the dd.merge(dask_df, panda_df) as shown in the documentation. When I initialize the dask dataframe, it starts with 6 rows. When I use merge the row count decreases to 1
I've also tried using concat. Again, I have a count of 6 rows during initialization. However, after the concat operations the row count still remains at 6. I would expect the row count to increase.
Here is the initialization function
import os
import sys
import h5py
import pandas as pd
import dask.dataframe as dd
import gc
import pprint
from scapy.all import *
flags = {
'R': 0,
'A': 1,
'S': 2,
'DF':3,
'FA':4,
'SA':5,
'RA':6,
'PA':7,
'FPA':8
}
def initialize(file):
global flags
data = {
'time_delta': [0],
'ttl':[],
'len':[],
'dataofs':[],
'window':[],
'seq_delta':[0],
'ack_delta':[0],
'flags':[]
}
scap = sniff(offline=file,filter='tcp and ip')
for packet in range(0,len(scap)):
pkt = scap[packet]
flag = flags[str(pkt['TCP'].flags)]
data['ttl'].append(pkt['IP'].ttl)
data['len'].append(pkt['IP'].len)
data['dataofs'].append(pkt['TCP'].dataofs)
data['window'].append(pkt['TCP'].window)
data['flags'].append(flag)
if packet != 0:
lst_pkt = scap[packet-1]
data['time_delta'].append(pkt.time - lst_pkt.time)
data['seq_delta'].append(pkt['TCP'].seq - lst_pkt['TCP'].seq)
data['ack_delta'].append(pkt['TCP'].ack - lst_pkt['TCP'].ack)
panda = pd.DataFrame(data=data)
panda['ttl']=panda['ttl'].astype('float16')
panda['flags']=panda['flags'].astype('float16')
panda['dataofs']=panda['dataofs'].astype('float16')
panda['len']=panda['len'].astype('float16')
panda['window']=panda['window'].astype('float32')
panda['seq_delta']=panda['seq_delta'].astype('float32')
panda['ack_delta']=panda['ack_delta'].astype('float32')
df =dd.from_pandas(panda,npartitions=6)
gc.collect()
return df
Here is the concatenation function
def process(file):
global flags
global df
data = {
'time_delta': [0],
'ttl':[],
'len':[],
'dataofs':[],
'window':[],
'seq_delta':[0],
'ack_delta':[0],
'flags':[]
}
scap = sniff(offline=file,filter='tcp and ip')
for packet in range(0,len(scap)):
pkt = scap[packet]
flag = flags[str(pkt['TCP'].flags)]
data['ttl'].append(pkt['IP'].ttl)
data['len'].append(pkt['IP'].len)
data['dataofs'].append(pkt['TCP'].dataofs)
data['window'].append(pkt['TCP'].window)
data['flags'].append(flag)
if packet != 0:
lst_pkt = scap[packet-1]
data['time_delta'].append(pkt.time - lst_pkt.time)
data['seq_delta'].append(pkt['TCP'].seq - lst_pkt['TCP'].seq)
data['ack_delta'].append(pkt['TCP'].ack - lst_pkt['TCP'].ack)
panda = pd.DataFrame(data=data)
panda['ttl']=panda['ttl'].astype('float16')
panda['flags']=panda['flags'].astype('float16')
panda['dataofs']=panda['dataofs'].astype('float16')
panda['len']=panda['len'].astype('float16')
panda['window']=panda['window'].astype('float32')
panda['seq_delta']=panda['seq_delta'].astype('float32')
panda['ack_delta']=panda['ack_delta'].astype('float32')
#merge version dd.merge(df, panda)
dd.concat([df,dd.from_pandas(panda,npartitions=6)])
gc.collect()
And here is the main program
directory = 'dev/streams/'
files = os.listdir(directory)
df = initialize(directory+files[0])
files.remove(files[0])
for file in files:
process(directory+file)
print(len(df))
using merge:
print(len(df)) = 1
using concat:
print(len(df))=6
expected:
print(len(df)) > 10,000
Try explicitly returning df as the result of your dask concat:
df = dd.concat([df, dd.from_pandas(panda,npartitions=6)])
And don't duplicate the exact same blocks of code but encaspulate them in another function:
def process_panda(file_wpath, flags):
data = {
[...]
panda['ack_delta']=panda['ack_delta'].astype('float32')
return panda
Then you just have to test if the file to process is the first, so your main code becomes:
import os
import sys
import h5py
import pandas as pd
import dask.dataframe as dd
import gc
import pprint
from scapy.all import *
flags = {
'R': 0,
'A': 1,
'S': 2,
'DF':3,
'FA':4,
'SA':5,
'RA':6,
'PA':7,
'FPA':8
}
directory = 'dev/streams/'
files = os.listdir(directory)
for file in files:
file_wpath = os.path.join(directory, file)
panda = process_panda(file_wpath, flags)
if file == files[0]:
df = dd.from_pandas(panda, npartitions=6)
else:
df = dd.concat([df, dd.from_pandas(panda, npartitions=6)])
gc.collect()
print(len(df))
I've got a CSV file with 20 columns & about 60000 rows.
I'd like to read fields 2 to 20 only. I've tried the below code but the browser(using ipython) freezes & it just goes n for ages
import numpy as np
from numpy import genfromtxt
myFile = 'sampleData.csv'
myData = genfromtxt(myFile, delimiter=',', usecols(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
print myData
How could I tweak this to work better & actually produce output please?
import pandas as pd
myFile = 'sampleData.csv'
df = pd.DataFrame(pd.read_csv(myFile,skiprows=1)) // Skipping header
print df
This works like a charm