Creating a definition that takes undefined number of parameters - python

What would be the best method of turning a code like below to be able to accept as many dataframes as we would like?
def q_grab(df, df2, df3, q): #accepts three dataframes and a column name. Looks up column in all dataframes and combine to one
data = df[q], df2[q], df3[q]
headers = [q+"_1", q+"_2", q+"_3"]
data2 = pd.concat(data, axis = 1, keys=headers)
return data2
q = 'covid_condition'
data2 = q_grab(df, df2, df3, q) #If I run function pid_set first, it will create new df based on pID it looks like

One approach is to use * operator to get a list of arguments
(but name your final argument, so it isn't part of the list):
Something like this:
def q_grab(*dfs, q=None): # q is a named argument to signal end of positional arguments
data = [df[q] for df in dfs]
headers = [q+"_"+str(i) for i in range(len(dfs))]
data2 = pd.concat(data, axis = 1, keys=headers)
return data2
q = 'covid_condition'
data2 = q_grab(df, df2, df3, q=q)
A probably cleaner alternative, is to go ahead and pass a list of dataframes as the first argument:
def q_grab(dfs,q):
called with:
data2 = q.grab([df,df2,df3], q)
using the function code as above

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Is there a way of dinamically find partial matching numbers between columns in pandas dataframes?

Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :

Cannot assign to function call when looping through and converting excel files

With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df

Pandas: apply a specific function to columns and create column in new dataframe

I have a dataframe df1, like this:
date sentence
29/03/1029 i like you
.....
I want to create new dataframe df2 like this:
date verb object
29/03/2019 like you
....
with the function like this:
def getSplit(df1):
verbList = []
objList = []
df2 = pd.DataFrame()
for row in df1['sentence']:
verb = getVerb(row)
obj = getObj(row)
verbList.append(verb)
objList.append(obj)
df2 = df1[[date]].copy
df2['verb'] = verbList
df2['object'] = objList
return df2
my function run well, but it's slow. Could someone help me improve the function so that can run faster?
Thank you
You can Use apply method of pandas to process fast:-
getverb(row):
pass # Your function
getobj(row):
passs # Your function
df2 = df1.copy() # Making copy of your dataframe.
df2['verb'] = df2['sentence'].apply(getverb)
df2['obj'] = df2['sentence'].apply(getobj)
df2.drop('sentence', axis=1, inplace=True) # Droping sentence column
df2
I hope it may help you. (accept and upvote answer)

How can i save a Panda dataframe to a django model?

I am trying to read a csv file using panda and parse it and then upload the results in my django database. Well, for now i am converting each dataframe to a list and then iterating over the list to save it in the DB. But my solution is inefficient when the list is really big for each column. How can i make it better ?
fileinfo = pd.read_csv(csv_file, sep=',',
names=['Series_reference', 'Period', 'Data_value', 'STATUS',
'UNITS', 'Subject', 'Group', 'Series_title_1', 'Series_title_2',
'Series_title_3','Series_tile_4','Series_tile_5'],
skiprows = 1)
# serie = fileinfo[fileinfo['Series_reference']]
s = fileinfo['Series_reference'].values.tolist()
p = fileinfo['Period'].values.tolist()
d = fileinfo['Data_value'].values.tolist()
st = fileinfo['STATUS'].values.tolist()
u = fileinfo['UNITS'].values.tolist()
sub = fileinfo['Subject'].values.tolist()
gr = fileinfo['Group'].values.tolist()
stt= fileinfo['Series_title_1'].values.tolist()
while count < len(s):
b = Testdata(
Series_reference = s[count],
Period = p[count],
Data_value = d[count],
STATUS = st[count],
UNITS = u[count],
Subject = sub[count],
Group = gr[count],
Series_title_1 = stt[count]
)
b.save()
count = count + 1
You can use pandas apply function. You can pass axis=1 to apply a given function to every row:
df.apply(
creational_function, # Method that creates your structure
axis=1, # Apply to every row
args=(arg1, arg2) # Additional args to creational_function
)
in creational_function the first argument received is the row, where you can access specific columns likewise the original dataframe
def creational_function(row, arg1, arg2):
s = row['Series_reference']
# For brevity I skip the others arguments...
# Create TestData
# Save
Note that arg1 and arg2 are the same for every row.
If you want to do something more with your created TestData objects, you can change creational_function to return a value, then df.apply will return a list containing all elements returned by the passed function.

Categories