Iteration over df rows to sum groups of items - python

I am new to coding. I looked for similar questions on this site, that helped me to come up with a working version of my code, but I need help with making it more professional.
I need help with iterating over rows of a data frame in pandas. What I want to do is to find same items (e.g. Groceries) in 'Description' column and total (sum) their values from 'Amount' column, and finally to write the result to a .csv file. The reason I am doing all this is to compile data for a bar graph I want to create based on those categories.
I was able to accomplish all that with the help of the following code, but that most likely is not very pythonic or efficient. What I did is that I used a print statement nested in an if statement to get category label (i) and amount to print to file. The issue is that I had to add a lot of things for the whole to work. First, I had to create an empty list to make sure that the if statement does not trigger .loc every time it sees a desired item in 'Description' column. Second, I am not sure if saving print statements is the best way to go here, as it appears very ersatz. It feels like I am a step away from using punch cards. In short, I would appreciate if someone could help me in making my code more up to standards.
'''
used_list = []
for i in df['Description']:
if i in used_list:
continue
sys.stdout = open(file_to_hist, "a")
print(i,',', df.loc[df['Description'] == i, 'Amount'].sum())
used_list.append(i)
'''
I also tried a slightly different approach (save results directly into a df), but then I get NaN values all across the 'Amount' column and no other errors (exit code 0) to help me understand what is going on:
'''
used_list = []
df_hist_data = pd.DataFrame(columns=['Description', 'Amount'])
for i in df['Description']:
if i in used_list:
continue
df_hist_data = df_hist_data.append({'Description' : i}, {'Amount' : df.loc[df['Description'] == i, 'Amount'].sum()})
used_list.append(i)
print(df_hist_data)
'''

You can select only rows that match a criteria with
df[ a boolean matrix here ]
When doing df["a column name"]=="value" you actually get a boolean matrix where rows where "a column name" == "value" are True and other are False
To sumarize this : Dataframe[Dataframe["Description"] == "banana"] is going to give you a view to a new dataframe where only rows matching your condition are kept. (original dataframe is not altered)
If you select column "Amount" of this dataframe and .sum() it, you have what you desired, in one line.
That's the typical pandadorable (equivalent of pythonic for pandas) way of doing conditionnal sums.
If require, select the rows of the dataframe where condition can take multiple values, use .isin() to get your boolean matrix
Dataframe["Description"].isin(["banana","apple"])
Then, when scanning all possible values of "Description" in your dataframe, use .unique() when generating your iterator.
And then you can finally append Series to your empty dataframe before saving it as csv.
Overal, we get the code :
import pandas as pd
Dataframe = pd.DataFrame([
{"Description":"apple","Amount":15},
{"Description":"banana","Amount":1},
{"Description":"berry","Amount":155},
{"Description":"banana","Amount":4}])
df_hist_data = pd.DataFrame(columns=['Description', 'Sum'])
for item in Dataframe["Description"].unique() :
df_hist_data = df_hist_data.append( pd.Series(
{ "Description" : item ,
"Sum" : Dataframe[(Dataframe["Description"].isin([item]))]["Amount"].sum() }
), ignore_index=True )
OUT:
>> 20
You can also do it even more pythonically, in one line with a list comprehension :
selector = "Description"
sum_on = "Amount"
new_df = pd.DataFrame([ {selector : item , sum_on : df[(df[selector].isin([item]))][sum_on].sum() } for item in df[selector].unique() ] )

Related

Remove extra index in a dataframe

I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.
My code is
data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')
The output is this table with extra index:
I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.
The dataset
https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing
I hope pandas.DataFrame.droplevel can do the job for you query
import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)
Output
df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)
MultiIndex([('service_type_id', '3 Phase Wiring'),
('service_type_id', 'AV Equipment')
...
('service_type_id', 'Yoga Lessons'),
('service_type_id', 'Zumba Classes')],
names=[None, 'service_type'], length=188)
#print(data_tr.info())
Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows
print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
dtype='object', name='service_type', length=188)
It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.
For insight, try to execute just
data.groupby(['transaction_id', 'service_type']).sum()
Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot
data_tr = data[['transaction_id'
, 'service_type'
, 'service_type_id']
].pivot(index = 'transaction_id'
, columns= 'service_type'
, values = 'service_type_id'
).fillna(0)
Depending on how you like your data structure, you can follow up with a .reset_index()

How to extract specific row from dataset and append into another list

I Have a Dataset. of Big mart sales. First I select column (ITEM_TYPE ) and then use a for loop to match the ( Dairy ) from whole column. I'm facing a problem I could not get the whole row. i,m the beginner
Dataset
mainlist
df = pd.read_csv('bm_Train.csv')
dt =df['Item_Type']
list_of_item = dt.to_list()
i=1
for x in list_of_item:
if x=="Dairy":
mainlist.append(dt[:i])
i+=i
mainlist
If you want to extract the contents of rows matching the filtering criteria into a list, you can use .loc to filter the contents and then use .values.tolist() to convert the rows contents into a list, as follows:
mainlist = df.loc[df['Item_Type'] == "Dairy"].values.tolist()
If you want to filter the dataset based on 'Item_Type' == 'Dairy', you can try with the following code.
df_new = df.loc[df['Item_Type'] == 'Dairy'].copy()
On the other hand, if you only want the values of the 'Item_Type'
list_of_item = df['Item_Type'].unique().tolist()
Otherwise, please let me know what you want.

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

I have a list where I want each element of the list to be in as a single row

I have a list of lists and I want to assign each of the lists to a specific column, I have created the columns of the Dataframe. But in each column, the elements are coming as a list. I want each element of this list to be a separate row as part of that particular column.
Here's what I did:
df = pd.DataFrame([np.array(dataset).T],columns=list1)
print(df)
Attached screenshot for the output.
I want each element of that list to be a row, as my output.
This should do the work for you:
import pandas as pd
Fasteners = ['Screws & Bolts', 'Threaded Rods & Studs', 'Eyebolts', 'U-Bolts']
Adhesives_and_Tape = ['Adhesives','Tape','Hook & Loop']
Weld_Braz_Sold = ['Electrodes & Wire','Gas Regulators','Welding Gloves','Welding Helmets & Glasses','Protective Screens']
df = pd.DataFrame({'Fastener': pd.Series(Fasteners), 'Adhesives_and_Tape': pd.Series(Adhesives_and_Tape), 'Weld_Braz_Sold': pd.Series(Weld_Braz_Sold)})
print(df)
Please provide the structure of the database you are starting from or the structure of the respective lists. I can give you are more focussed answer to your specific problem then.
If the structure is getting larger, you can also iterate through all lists when generating the data frame. This is just the basic process to solve your question.
Feel free to comment for further help.
EDIT
If you want to loop through a database of lists. Use the following code additionally:
for i in range(len(list1)): df.iloc[:,i] = pd.Series(dataset[i])

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.
why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)
Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.
You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

Categories