how to make loop in pyspark - python

i have a code:
list_files = glob.glob("/t/main_folder/*/file_*[0-9].csv")
test = sorted(list_files, key = lambda x:x[-5:])
so this code has helped me to find files that i need to work with. I found 5 csv files in different folders.
next step-im using a code down below , to work with every file i found, i need to use full outer join for every file, firstly for main_folder/folder1/file1.csv, then for main_folder/folder2/file2 and etc etc. until last file that was found one-by-one.
thats why i need loop
df_deltas = spark.read.format("csv").schema(schema).option("header","true")\
.option("delimiter",";").load(test)
df_mirror = spark.read.format("csv").schema(schema).option("header","true")\
.option("delimiter",",").load("/t/org_file.csv").cache()
df_deltas.createOrReplaceTempView("deltas")
df_mirror.createOrReplaceTempView("mirror")
df_mir2=spark.sql("""select
coalesce (deltas.DATA_ACTUAL_DATE,mirror.DATA_ACTUAL_DATE) as DATA_ACTUAL_DATE,
coalesce (deltas.DATA_ACTUAL_END_DATE,mirror.DATA_ACTUAL_END_DATE) as DATA_ACTUAL_END_DATE,
coalesce (deltas.ACCOUNT_RK,mirror.ACCOUNT_RK) as ACCOUNT_RK,
coalesce (deltas.ACCOUNT_NUMBER,mirror.ACCOUNT_NUMBER) as ACCOUNT_NUMBER,
coalesce (deltas.CHAR_TYPE,mirror.CHAR_TYPE) as CHAR_TYPE,
coalesce (deltas.CURRENCY_RK,mirror.CURRENCY_RK) as CURRENCY_RK,
coalesce (deltas.CURRENCY_CODE,mirror.CURRENCY_CODE) as CURRENCY_CODE,
coalesce (deltas.CLIENT_ID,mirror.CLIENT_ID) as CLIENT_ID,
coalesce (deltas.BRANCH_ID,mirror.BRANCH_ID) as BRANCH_ID,
coalesce (deltas.OPEN_IN_INTERNET,mirror.OPEN_IN_INTERNET) as OPEN_IN_INTERNET
from mirror
full outer join deltas on
deltas.ACCOUNT_RK=mirror.ACCOUNT_RK
""")
df_deltas = spark.read.format("csv").schema(schema).option("header","true")\
.option("delimiter",";").load(test)--HERE I'M USING MY CODE TO FILL THE .LOAD WITH FILES
how is it possible to make a loop for the first found file, then for the second and so on?

You can use a for loop to do that,
for idx, file in enumerate(test):
globals()[f"df_{idx}"] = spark.read.format("csv").schema(schema).option("header","true").option("delimiter",";").load(file)
This will create DFs in the global namespace with names df_0 for the first file, df_1 for the second file, and so on. Then you can use this DF to do whatever you want

Related

Implement pandas groupby method on a dataframe with certain conditions

I am working with about 1000 XML files. I have written a script where the program loops through the folder containing these XML files and I have achieved the following:
Created a list with all the paths of the XML files
Read the files and extract the values I need to work with.
I have a new dataframe which consists of the only two columns I need to work with.
Here is the full code :
import glob
import pandas as pd
# Empty list to store path of xml files
path_list = []
# Function to iterate folder and store path of xml files.
# Can be modified to take the path as an argument via command line if required
time_sum = []
testcase = []
def calc_time(path):
for path in glob.iglob(f'{path}/*.xml'):
path_list.append(path)
try:
for file in path_list:
xml_df = pd.read_xml(file, xpath=".//testcase")
# Get the classname values from the XML file
testcase_v = xml_df.at[0, 'classname']
testcase.append(testcase_v)
# Get the aggregate time value of all instances of the classname
time_sum_test = xml_df['time'].sum()
time_sum.append(time_sum_test)
new_df = pd.DataFrame({'testcase': testcase, 'time': time_sum})
except Exception as ex:
msg_template = "An exception of type {0} occurred. Arguments:\n{1!r}"
message = msg_template.format(type(ex).__name__, ex.args)
print(message)
calc_time('assignment-1/data')
Now I need to group these values on the following condition.
Equally distribute classname by their time into 5 groups, so that, total time for each group will approximately same.
The new_df looks like this:
'TestMensaSynthesis': 0.49499999999999994,
'SyncVehiclesTest': 0.303,
'CallsPromotionEligibilityTask': 3.722,
'TestSambaSafetyMvrOverCustomer': 8.546,
'TestScheduledRentalPricingEstimateAPI': 1.6360000000000001,
'TestBulkImportWithHWRegistration': 0.7819999999999999,
'calendars.tests.test_intervals.TestTimeInterval': 0.006,
The dataframe has more than 1000 lines containing the classname and time.
I need to add a groupby statement which will make 5 groups of these classes and the total time of these groups will be
approximately equal to each other.

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Save multiple dataFrames in a loop using to_pickle

hi i have 4 pandas dataframe: df1, df2 ,df3, df4.
What i like to do is iterate (using a for loop) the save of this dataframe using to_pickle.
what i did is this:
out = 'mypath\\myfolder\\'
r = [ orders, adobe, mails , sells]
for i in r:
i.to_pickle( out + '\\i.pkl')
The command is fine but it does not save every database with his name but overwriting the same databse i.pkl (i think because is not correct my code)
It seem it can't rename every database with his name (e.g. for orders inside the for loop orders is saved with the name i.pkl and so on with the orders dataframe involved)
What i expect is to have 4 dataframe saved with the name inserted in the object r (so : orders.pkl, adobe.pkl ,mails.pkl, sells.pkl)
How can i do this?
You can't stringify the variable name (this is not something you generally do), but you can do something simple:
import os
out = 'mypath\\myfolder\\'
df_list = [df1, df2, df3, df4]
for i, df in enumerate(df_list, 1):
df.to_pickle(os.path.join(out, f'\\df{i}.pkl')
If you want to provide custom names for your files, here is my suggestion: use a dictionary.
df_map = {'orders': df1, 'adobe': df2, 'mails': df3, 'sells': df4}
for name, df in df_map.items():
df.to_pickle(os.path.join(out, f'\\{name}.pkl')

How does one use the mapdict argument in pyexcel?

I'm having some trouble with the mapdict argument in the save_to_database function in pyexcel.
It seems that I still need to have a row of column names in the beginning of my files otherwise I get an error. Does mapdict not specify the names to use for each column once they have been converted to a dictionary?
I'm very unsure of what this argument actually does...
Any help would be appreciated!!
Look, it's simple
if you have CSV like this:
brand,sku,description,quantity,price
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
br,qw3234,s sdf sd ,4,23.5
you don't need mapdict
but if your CSV without first row
you need it. For exmple one peace from my flask project:
def article_init_func(row):
warehouse = Warehouse.query.filter_by(id=id).first()
a = Article()
a.pricelist_id = p.id
a.sku=row['sku']
a.description=row['description']
a.brand=row['brand']
a.quantity=row['quantity']
a.city=warehouse.city
a.price=row['price']
return a
map_row = ['brand', 'sku', 'description', 'quantity', 'price']
request.save_to_database(
field_name='file', session=db.session,
initializer = article_init_func,
table=Article,
mapdict=map_row)

Sort a list based on values from dataframe

Is my approach here the right way to do it in Python? As I'm new to Python, I appreciate any feedback you can provide, especially if I'm way off here.
My task is to order a list of file names based on values from a dataset. Specifically, these are file names that I need to sort based on site information. The resulting list is the order in which the reports will be printed.
Site Information
key_info = pd.DataFrame({
'key_id': ['1010','3030','2020','5050','4040','4040']
, 'key_name': ['Name_A','Name_B','Name_C','Name_D','Name_E','Name_E']
, 'key_value': [1,2,3,4,5,6]
})
key_info = key_info[['key_id','key_name']].drop_duplicates()
key_info['key_id'] = key_info.key_id.astype('str').astype('int64')
Filenames
These are the file names I need to sort. In this example, I sort by just the key_id, but I assume I could easily add a column to site information and sort it by that as well.
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
Sorting
The resulting "filenames" is the final sorted list.
names_df = pd.DataFrame({'filename': filenames})
names_df['key_id'] = names_df.filename.str[:4].astype('str').astype('int64')
merged_df = pd.merge(key_info, names_df, on='key_id', how='right')
merged_df = merged_df.sort_values('key_id')
filenames = merged_df['filename'].tolist()
I'm looking for any solutions that might be better or more Pythonic. Or, if there is a more appropriate place to post "code review" questions.
I like your use of Pandas, but it isn't the most Pythonic as it uses data structures that are a superset of Python. Nevertheless, I think we can improve on what you have. I will show an improved version and I will show a completely native Python way to do it. Either is fine I suppose?
The strictly Python version is best for people who do know Pandas as there's a large learning curve associated with that.
Common
For both examples, let's assume a function like this:
def trim_filenames(filename):
return filename[0:4]
I use this in both examples.
Improvements
# Load the DataFrame and give it a proper index (I added some data)
key_info = pd.DataFrame(index=['2020','5050','4040','4040','6000','7000','1010','3030'], data={'key_name':['Name_C','Name_D','Name_E','Name_E','Name_F','Name_G','Name_A','Name_B'], 'key_value' :[1,2,3,4,5,6,7,8]})
# Eliminate duplicates and sort in one step
key_info = key_info.groupby(key_info.index).first()
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
names_df = pd.DataFrame({'filename': filenames})
# Let's give this an index too so we can match on the index (not the function call)
names_df.index=names_df.filename.transform(trim_filenames)
combined = pd.concat([key_info,names_df], axis=1)
combined matches by index, but there are some keys with no filenames. It looks like this now:
key_name key_value filename
1010 Name_A 7 1010_Filename
2020 Name_C 1 2020_Filename
3030 Name_B 8 3030_Filename
4040 Name_E 3 4040_Filename
5050 Name_D 2 5050_Filename
6000 Name_F 5 NaN
7000 Name_G 6 NaN
Now we drop the NaN columns and create the list of filenames:
combined.filename.dropna().values.tolist()
['1010_Filename', '2020_Filename', '3030_Filename', '4040_Filename', '5050_Filename']
Python Only Version (no framework)
key_info = {'2020' : {'key_name':'Name_C', 'key_value':1},'5050' : {'key_name':'Name_D', 'key_value':2},'4040' : {'key_name':'Name_E', 'key_value':3},'4040' : {'key_name':'Name_E', 'key_value':4},'6000' : {'key_name':'Name_F', 'key_value':5},'7000' : {'key_name':'Name_G', 'key_value':6},'1010' : {'key_name':'Name_A', 'key_value':7},'3030' : {'key_name':'Name_B', 'key_value':8}}
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
# Let's get a dictionary of filenames that is keyed by the same key as in key_info:
hashed_filenames = {}
for filename in filenames:
# Note here I'm using the function again
hashed_filenames[trim_filenames(filename)] = filename
# We'll store the new filenames in new_filenames:
new_filenames = []
# sort the key info and loop it
for key in sorted(key_info.keys()):
# for each key, if the key matches in the hashed_filenames, then add it to the list
if key in hashed_filenames:
new_filenames.append(hashed_filenames[key])
Summary
Both solutions are concise, and I like Pandas, but I like better something that is immediately readable by anyone that knows Python. The Python only solution (of course, they are both Python) is the one you should go with, in my opinion.
out_list = []
for x in key_info.key_id:
for f in filenames:
if str(x) in f:
out_list.append(f)
out_list
['1010_Filename', '3030_Filename', '2020_Filename', '5050_Filename', '4040_Filename']

Categories