Iterating on Pandas DataFrame to pass data into API - python

I am creating a script that reads a GoogleSheet, transforms the data and passes it into my ERP API to automate the creation of Purchase Orders.
I have got as far as outputting the data in a dataframe but I need help on how I can iterate through this and pass it in the correct format to the API.
DataFrame Example (dfRow):
productID vatrateID amount price
0 46771 2 1 1.25
1 46771 2 1 2.25
2 46771 2 2 5.00
Formatting of the API data:
vatrateID1=dfRow.vatrateID[0],
amount1=dfRow.amount[0],
price1=dfRow.price[0],
productID1=dfRow.productID[0],
vatrateID2=dfRow.vatrateID[1],
amount2=dfRow.amount[1],
price2=dfRow.price[1],
productID2=dfRow.productID[1],
vatrateID3=dfRow.vatrateID[2],
amount3=dfRow.amount[2],
price3=dfRow.price[2],
productID3=dfRow.productID[2],
I would like to create a function that would iterate thru the DataFrame and return the data in the correct format to pass to the API.
I'm new at Python and struggle most with iterating / loops so any help is much appreciated!

First, you can always loop over the rows of a dataframe using df.iterrows(). Each step through this iterator yields a tuple containing the row index and the row contents as a pandas Series object. So, for example, this would do the trick:
for ix, row in df.iterrows():
for column in row.index:
print(f"{column}{ix}={row[column]}")
You can also do it without resorting to loops. This is great if you need performance, but if performance isn't a concern then it is really just a matter of taste.
# first, "melt" the data, which puts all of the variables on their own row
x = df.reset_index().melt(id_vars='index')
# now join the columns together to produce the rows that we want
s = x['variable'] + x['index'].map(str) + '=' + x['value'].map(str)
print(s)
0 productID0=46771.0
1 productID1=46771.0
2 productID2=46771.0
3 vatrateID0=2.0
...
10 price1=2.25
11 price2=5.0

Related

How to create python function that performs multiple checks on a dataframe?

I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'
Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())

Iterate through two dataframes and create a dictionary one data frame that is a substring in strings found in the second dataframe (values)

I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}

What is the Best way to compare large datasets from two different sources in Python?

I have large datasets from 2 sources, one is a huge csv file and the other coming from a database query. I am writing a validation script to compare the data from both sources and log/print the differences. One thing I think is worth mentioning is that the data from the two sources is not in the exact same format or the order. For example:
Source 1 (CSV files):
email1#gmail.com,key1,1
email2#gmail.com,key1,3
email1#gmail.com,key2,1
email1#gmail.com,key3,5
email2#gmail.com,key3,2
email2#gmail.com,key3,2
email3#gmail.com,key2,3
email3#gmail.com,key3,1
Source 2 (Database):
email key1 key2 key3
email1#gmail.com 1 1 5
email2#gmail.com 3 2 <null>
email4#gmail.com 1 1 5
The output of the script I want is something like:
source1 - source2 (or csv - db): 2 rows total with differences
email2#gmail.com 3 2 2
email3#gmail.com <null> 3 1
source2 - source1 (or db-csv): 2 rows total with differences
email2#gmail.com 3 2 <null>
email4#gmail.com 1 1 5
The output format could be a little different to show more differences, more clearly (from thousands/millions of records).
I started writing the script to save the data from both sources into two dictionaries, and loop through the dictionaries or create sets from the dictionaries, but it seems like a very inefficient process. I considered using pandas, but pandas doesn't seem to have a way to do this type of comparison of dataframes.
Please tell me if theres a better/more efficient way. Thanks in advance!
You were in the right path. What do you want is to quickly match the 2 tables. Pandas is probably overkill.
You probably want to iterate through you first table and create a dictionary. What you don't want to do, is interact the two lists for each elements. Even little lists will demand a large searches.
The ReadCsv module is a good one to read your data from disk. For each row, you will put it in a dictionary where the key is the email and the value is the complete row. In a common desktop computer you can iterate 10 millions rows in a second.
Now you will iterate throw the second row and for each row you'll use the email to get the data from the dictionary. See that this way, since the dict is a data structure that you can get the key value in O(1), you'll interact through N + M rows. In a couple of seconds you should be able to compare both tables. It is really simple. Here is a sample code:
import csv
firstTable = {}
with open('firstTable.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
firstTable[row[0]] = row #email is in row[0]
for row2 in get_db_table2():
email = row2[0]
row1 = firstTable[email] #this is a hash. The access is very quick
my_complex_comparison_func(row1, row2)
If you don't have enough RAM memory to fit all the keys of the first dictionary in memory, you can use the Shelve module for the firstTable variable. That'll create a index in disk with very quick access.
Since one of your tables is already in a database, maybe what I'd do first is to use your database to load the data in disk to a temporary table. Create an index, and make a inner join on the tables (or outer join if need to know which rows don't have data in the other table). Databases are optimized for this kind of operation. You can then make a select from python to get the joined rows and use python for your complex comparison logic.
You can using pivot convert the df , the using drop_duplicates after concat
df2=df2.applymap(lambda x : pd.to_numeric(x,errors='ignore')
pd.concat([df.pivot(*df.columns).reset_index(),df2)],keys=['db','csv']).\
drop_duplicates(keep=False).\
reset_index(level=0).\
rename(columns={'level_0':'source'})
Out[261]:
key source email key1 key2 key3
1 db email2#gmail.com 3 2 2
1 csv email2#gmail.com 3 2 <null>
Notice , here I am using the to_numeric to convert to numeric for your df2

Calculating running total

I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.

Pandas not saving changes when iterating rows

let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.

Categories