Function to print dataframe, that uses df name as an argument - python

In function, I can't use argument to define the name of the df in df.to_csv().
I have a long script to pull apart and understand. To do so I want to save the different dataframes it uses and store them in order. I created a function to do this and add the order number 01 (number_of_interim_exports) to the name (from argument).
My problem is that I need to use this for multiple dataframe names, but the df.to_csv part won't accept an argument in place of df...
def print_interim_results_any(name, num_exports, df_name):
global number_of_interim_exports
global print_interim_outputs
if print_interim_outputs == 1:
csvName = str(number_of_interim_exports).zfill(2) + "_" +name
interimFileName = "interim_export_"+csvName+".csv"
df.to_csv(interimFileName, sep=;, encoding='utf-8', index=False)
number_of_interim_exports += 1

I think i just screwed something else up: this works fine:
import pandas as pd
df = pd.DataFrame({1:[1,2,3]})
def f(frame):
frame.to_csv("interimFileName.csv")
f(df)

Related

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

Creating a python function to change sequence of columns

I am able to change the sequence of columns using below code I found on stackoverflow, now I am trying to convert it into a function for regular use but it doesnt seem to do anything. Pycharm says local variable df_name value is not used in last line of my function.
Working Code
columnsPosition = list(df.columns)
F, H = columnsPosition.index('F'), columnsPosition.index('H')
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df = df[columnsPosition]
My Function - Doesnt work, need to make this work
def change_col_seq(df_name, old_col_position, new_col_position):
columnsPosition = list(df_name.columns)
F, H = columnsPosition.index(old_col_position), columnsPosition.index(new_col_position)
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df_name = df_name[columnsPosition] # pycharm has issue on this line
I have tried adding return on last statement of function but I am unable to make it work.
To re-order the Columns
To change the position of 2 columns:
def change_col_seq(df_name:pd.DataFrame, old_col_position:str, new_col_position:str):
df_name[new_col_position], df_name[old_col_position] = df_name[old_col_position].copy(), df_name[new_col_position].copy()
df = df_name.rename(columns={old_col_position:new_col_position, new_col_position:old_col_position})
return df
To Rename the Columns
You can use the rename method (Documentation)
If you want to change the name of just one column:
def change_col_name(df_name, old_col_name:str, new_col_name:str):
df = df_name.rename(columns={old_col_name: new_col_name})
return df
If you want to change the name of multiple column:
def change_col_name(df_name, old_col_name:list, new_col_name:list):
df = df_name.rename(columns=dict(zip(old_col_name, new_col_name)))
return df

Create and append pandas dummy variables with pipe

I am trying to create a Pandas pipeline that creates dummy variables and append the column to the existing dataframe.
Unfortunately I can't get the appended columns to stick when the pipeline is finished.
Example:
def function(df):
pass
def create_dummy(df):
a = pd.get_dummy(df['col'])
b = df.append(a)
return b
def mah_pipe(df):
(df.pipe(function)
.pipe(create_dummy)
.pipe(print)
return df
print(mah_pipe(df))
First - I have no idea if this is good practice.
What's weird is that the .pipe(print) prints the dataframe with appended columns. Yay.
But the statement print(mah_pipe(df)) does not. I though they would behave the same way.
I have tried to read the documentation about pd.pipe but I couldn't figure it out.
Hoping someone could help shed some light on what's going on.
This is because print in Python returns None. Since you are not making a copy of df on your pipes, your df dies after print.
pipes in Pandas
Unless used as last pipe, in Pandas, we except (df) -> [pipe] -> (df_1)-> [pipe2] ->(df_2)-> [pipeN] -> df_N By having print as last pipe, the output is None.
Solution
...
def start_pipe(dataf):
# allows make a copy to avoid modifying original
dataf = dataf.copy()
def create_dummies(dataf, column_name):
dummies = pd.get_dummies(dataf[column_name])
dataf[dummies.columns] = dummies
return dataf
def print_dataf(dataf, n_rows=5):
print(dataf.head(n_rows))
return dataf # this is important
# usage
...
dt = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
def mah_pipe(df):
df = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
return df
print(mah_pipe(df))

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Chaining output between diffrent functions

I'm looking for the name for a procedure which handles output from one function in several others (trying to find better words for my problem). Some pseudo/actual code would be really helpful.
I have written the following code:
def read_data():
read data from a file
create df
return df
def parse_data():
sorted_df = read_data()
count lines
sort by date
return sorted_df
def add_new_column():
new_column_df = parse_data()
add new column
return new_column_df
def create_plot():
plot_data = add_new_column()
create a plot
display chart
What I'm trying to understand is how to skip a function, e.g. create following chain read_data() -> parse_data() -> create_plot().
As the code looks right now (due to all return values and how they are passed between functions) it requires me to change input data in the last function, create_plot().
I suspect that I'm creating logically incorrect code.
Any thoughts?
Original code:
import pandas as pd
import matplotlib.pyplot as plt
# Read csv files in to data frame
def read_data():
raw_data = pd.read_csv('C:/testdata.csv', sep=',', engine='python', encoding='utf-8-sig').replace({'{':'', '}':'', '"':'', ',':' '}, regex=True)
return raw_data
def parse_data(parsed_data):
...
# Convert CreationDate column into datetime
raw_data['CreationDate'] = pd.to_datetime(raw_data['CreationDate'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
raw_data.sort_values(by=['CreationDate'], inplace=True, ascending=True)
parsed_data = raw_data
return parsed_data
raw_data = read_files()
parsed = parsed_data(raw_data)
Pass the data in instead of just effectively "nesting" everything. Any data that a function requires should ideally be passed in to the function as a parameter:
def read_data():
read data from a file
create df
return df
def parse_data(sorted_df):
count lines
sort by date
return sorted_df
def add_new_column(new_column_df):
add new column
return new_column_df
def create_plot(plot_data):
create a plot
display chart
df = read_data()
parsed = parse_data(df)
added = add_new_column(parsed)
create_plot(added)
Try to make sure functions are only handling what they're directly responsible for. It isn't parse_data's job to know where the data is coming from or to produce the data, so it shouldn't be worrying about that. Let the caller handle that.
The way I have things set up here is often referred to as "piping" or "threading". Information "flows" from one function into the next. In a language like Clojure, this could be written as:
(-> (read-data)
(parse-data)
(add-new-column)
(create-plot))
Using the threading macro -> which frees you up from manually needing to handle data passing. Unfortunately, Python doesn't have anything built in to do this, although it can be achieved using external modules.
Also note that since dataframes seem to be mutable, you don't actually need to return the altered ones them from the functions. If you're just mutating the argument directly, you could just pass the same data frame to each of the functions in order instead of placing it in intermediate variables like parsed and added. The way I'm showing here is a general way to set things up, but it can be altered depending on your exact use case.
Use class to contain your code
class DataManipulation:
def __init__(self, path):
self.df = pd.DataFrame()
self.read_data(path)
#staticmethod
def new(file_path):
return DataManipulation(path)
def read_data(self, path):
read data from a file
self.df = create df
def parse_data(self):
use self.df
count lines
sort by date
return self
def add_new_column(self):
use self.df
add new column
return self
def create_plot(self):
plot_data = add_new_column()
create a plot
display chart
return self
And then,
d = DataManipulation.new(filepath).parse_data().add_column().create_plot()

Categories