Dropping Columns From Data frame to only show needed ones

Dropping Columns From Data frame to only show needed ones - python

Basically i want to drop some columns that i don't need. And i'm kind of stumped why this is not working
import os
import pandas
def summarise(indir, outfile):
os.chdir(indir)
filelist = ".txt"
dflist = []
colnames = ["DSP Code", "Report Date", "Initial Date", "End Date", "Transaction Type", "Sale Type",
"Distribution Channel", "Products Origin ID", "Product ID", "Artist", "Title", "Units Sold",
"Retail Price", "Dealer Price", "Additional Revenue", "Warner Share", "Entity to be billed",
"E retailer name", "E retailer Country", "End Consumer Country", "Price Code", "Currency Code"]
for filename in filelist:
print(filename)
df = pandas.read_csv('SYB_M_20171001_20171031.txt', header=None, encoding='utf-8', sep='\t', names=colnames,
skiprows=3)
df['data_revenue'] = df['Units Sold'] * df['Dealer Price'] # Multiplying Units with Dealer price = Revenue
df = df.sort_values(['End Consumer Country', 'Currency Code']) # Sorts the columns alphabetically
df.to_csv(outfile + r"\output.csv", index=None)
dflist.append(filename)
df.drop(columns='DSP Code')
summarise(r"O:\James Upson\Sound Track Your Brand Testing\SYB Test",
r"O:\James Upson\Sound Track Your Brand Testing\SYB Test Formatted")
I want to drop all the column titles you can see in colnames excluding 'Units Sold', 'Dealer Price', 'End Consumer Country', 'Currency Code'. I tried to remove one column using df.drop(columns='DSP Code') but this doesn't seem to work.
Any help would be greatly appreciated :)

You Can do it like :
df.drop(['Col_1', 'col_2'], axis=1, inplace=True)
OR:
df = df.drop(columns=colnames)
As suggested in comment section use usecols which provides a kind of filter to trim down the column section to use only which are require rest columns will not be processes and thus efficiency will be increased and resource consumption will also be less:
df = pandas.read_csv('SYB_M_20171001_20171031.txt', encoding='utf-8', sep='\t', usecols=["col1", "col2", "col3"],skiprows=3)

df.drop(columns='DSP Code')
this bit is not working cause you are not assigning it to a new df
df = df.drop(columns='DSP Code')
You can also just keep the columns you care about by copying them into a second dataframe.

According to pandas.DataFrame.drop, it returns a dataframe unless you do the operation inplace.
Returns:
dropped : pandas.DataFrame
inplace : bool, default False
If True, do operation inplace and return None.
Either do it in place: df.drop(columns=['DSP Code'], inplace=True) or store the returned dataframe: df=df.drop(columns=['DSP Code'])

Just do:
df = df['Units Sold', 'Dealer Price', 'End Consumer Country', 'Currency Code']
You keep the ones you want, instead of dropping the others.

Related

Adding Rows to DataFrame in Pandas - Error - ValueError: cannot set a row with mismatched columns

Learning Pandas and trying to make a code for myself so noob in this.
I'm running into an error:
ValueError: cannot set a row with mismatched columns
I have an app created in PySimpleGui which would take input from user and create a database with columns predefined.
I'm unable to add the rows and running into this issue.
I've made sure I have 6 columns and therefore the number of data to be inserted is 6 but I still get this error.
The code I have at the moment is :
output_path_csv = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop\\Daily_Tracker.xlsx')
if os.path.exists(output_path_csv):
pass
else:
x_header = ['Date', 'Case Number', 'Serial Number', 'Product Name', 'Version', 'Issue']
df = pd.DataFrame( columns=x_header )
df.to_excel( output_path_csv, index=False)
p_name = prod_name(values['Serial_Num']) # Determining the product name
values['Product_Name'] = p_name
v_name = ver_name(values['Serial_Num']) # Determining the version type of the product selected
values['Product_Type'] = v_name
c_num = values.get('Case_Num')
s_num = values.get('Serial_Num')
c_Issue = values.get('Issue')
# to check for e-mail address
sp_char = "#"
if sp_char in values['Email_Add']:
doc.render(values)
output_path_doc = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop\\Notes.docx')
try:
doc.save(output_path_doc)
popup("File Saved", f"File has been saved here : {output_path_doc}")
os.startfile(output_path_doc)
except PermissionError:
popup('File seems to be opened! Please close the file!')
pass
else:
popup('Missing # character in the e-mail. Please update')
pass
df2 = pd.read_excel( output_path_csv )
df2.loc[len( df2.index )] = [to_date, c_num, s_num, p_name, v_name, c_Issue]
df2.to_excel( output_path_csv )

How to search equal product numbers in two columns from two different excel tables and copy-paste certain cells from matched row to new table

I have two excel tables:
old_data.xlsx
Product number Name Current price Other columns
1000 Product name 1 10
AB23104 Product name 2 5
430267 Product name 3 20
new_data.xlsx
Product number Name New price Other columns
AB23104 Renamed product name 2 20
1000 Renamed product name 1 5
345LKT10023 Product name 4 100
Expected result: table below + 2 feedback messages somewhere
Message 1) Product ID 430267 is missing in new data file
Message 2) Product ID 345LKT10023 is newly added
Product ID Name of product New price Old price
AB23104 Product name 2 20 5
1000 Product name 1 5 10
345LKT10023 Product name 4 100 100
I have this code for now, but it is not working and not finished due to lack of knowledge on my part:
import openpyxl
import pandas as pd
new_datacols = [0, 1, 2]
old_datacols = [0, 1, 2]
new_data = pd.read_excel('new_data.xlsx', skiprows=1, usecols=new_datacols, index_col=0)
old_data = pd.read_excel('old_data.xlsx', skiprows=1, usecols=old_datacols, index_col=0)
def format_data():
# combine_type = inner, left, right, outer
df = pd.merge(new_data, old_data, on='Product number', how='outer')
df = df.rename(columns={"Product number": "Product ID",
"Name": "Name of product",
"Current price": "Old price"})
nan_value = float("NaN")
df.replace("", nan_value, inplace=True)
df.dropna(subset=["Name of product"], inplace=True)
df = df[['Product ID', 'Name of product',
'New price', 'Old price']]
print(df.columns)
# df.to_excel('updated_table.xlsx')
if __name__ == "__main__":
format_data()

This is my attempt. It puts the messages in another sheet in the same file. The final spreadsheet looks like this:
import os
import pandas as pd
old_data_filename = r"old_data.xlsx"
new_data_filename = r"new_data.xlsx"
new_spreadsheet_filename = r"updated_products.xlsx"
# Load spreadsheets into a dataframe and set their indexes to "Product number"
old_data_df = pd.read_excel(old_data_filename).set_index("Product number")
new_data_df = pd.read_excel(new_data_filename).set_index("Product number")
# Determine which products are new/missing, and store the corresponding
# messages in a list, which will be written to its own spreadsheet at the end
old_data_products = set(old_data_df.index)
new_data_products = set(new_data_df.index)
new_products = new_data_products - old_data_products
missing_products = old_data_products - new_data_products
messages = [f"Product ID {product} is missing in new data file" for product in missing_products]
messages.extend(f"Product ID {product} is newly added" for product in new_products)
messages = [f"Message {i}) {message}" for i, message in enumerate(messages, start=1)]
# Keep the original product names
new_data_df.update(old_data_df["Name"])
# Old price is the same as new price unless the product is in old_data_df, in which
# case it is old_data_df["Current price"]
new_data_df["Old price"] = new_data_df["New price"]
new_data_df["Old price"].update(old_data_df["Current price"])
# Rename the columns
new_data_df.reset_index(inplace=True)
new_data_df.rename(columns={"Product number": "Product ID",
"Name": "Name of product"}, inplace=True)
# Remove all other columns except the ones we want
new_data_df = new_data_df[["Product ID",
"Name of product",
"New price", "Old price"]]
# Write the new products and messages to separate sheets in the same file
with pd.ExcelWriter(new_spreadsheet_filename) as writer:
new_data_df.to_excel(writer, "Products", index=False)
pd.DataFrame({"Messages": messages}).to_excel(writer, "Messages", index=False)
# Launch the new spreadsheet
os.startfile(new_spreadsheet_filename)
EDIT: Code that works with the actual spreadsheets:
import os
import pandas as pd
old_data_filename = r"old_data.xlsx"
new_data_filename = r"new_data.xlsx"
new_spreadsheet_filename = r"updated_products.xlsx"
# Load spreadsheets into a dataframe and set their indexes to "Product number"
old_data_df = pd.read_excel(old_data_filename).set_index("Product ID")
new_data_df = pd.read_excel(new_data_filename).set_index("Product ID")
# Remove duplicated indexes for both the dataframes, keeping only the first occurrence
old_data_df = old_data_df[~old_data_df.index.duplicated()]
new_data_df = new_data_df[~new_data_df.index.duplicated()]
# Determine which products are new/missing, and store the corresponding
# messages in a list, which will be written to its own spreadsheet at the end
old_data_products = set(old_data_df.index)
new_data_products = set(new_data_df.index)
new_products = new_data_products - old_data_products
missing_products = old_data_products - new_data_products
messages = [f"Product ID {product} is missing in new data file" for product in missing_products]
messages.extend(f"Product ID {product} is newly added" for product in new_products)
messages = [f"Message {i}) {message}" for i, message in enumerate(messages, start=1)]
# Keep the original product names
new_data_df.update(old_data_df["Name"])
# Old price is the same as new price unless the product is in old_data_df, in which
# case it is old_data_df["Current price"]
new_data_df["Old price"] = new_data_df["New price"]
new_data_df["Old price"].update(old_data_df["Current price"])
# Rename the "Name" column to "Name of product"
new_data_df.rename(columns={"Name": "Name of product"}, inplace=True)
# Remove all other columns except the ones we want
new_data_df.reset_index(inplace=True)
new_data_df = new_data_df[["Product ID",
"Name of product",
"New price", "Old price"]]
# Write the new products and messages to separate sheets in the same file
with pd.ExcelWriter(new_spreadsheet_filename) as writer:
new_data_df.to_excel(writer, "Products", index=False)
pd.DataFrame({"Messages": messages}).to_excel(writer, "Messages", index=False)
# Launch the new spreadsheet
os.startfile(new_spreadsheet_filename)

How do I reset the outputs generated from my ipywidget command button so that only the updated DataFrame is displayed?

Sorry for the potentially confusing phrasing of my question. Essentially, I am trying to make it so that every time I press the 'Add Data' command button there is only one DataFrame displayed. The one that should be displayed is the DF that is modified when the button is pressed. Currently, though, it will append the output with the recently modified DF, on top of the older versions that were created from earlier clicks of the button.
I'm using this code as part of a larger program for performing Monte Carlo simulations and back testing. My goal for these widgets is to input all the option positions I take on certain assets. That way, I can have a consolidated DF of my positions to speed up my analysis in later sections of this program and be available for other programs. The 'Add Data' button will input the values of the other widgets into a dictionary and concat that dictionary with the existing portfolio DF (which is saved in a CSV file).
I believe my problem is caused by me not properly utilizing the ipywidget Output() function, but have not been able to find a workable solution to my problem.
Also, I am writing in a Jupyter Notebook.
import pandas as pd
import datetime
from datetime import *
import ipywidgets as widgets
from ipywidgets import *
############################################## The following section is usually in a seperate cell so I can
df = { # refresh my portfolio every day, but still add to the DF throughout the day
'Datetime' : [],
'Expire' : [],
'Type' : [],
'Quantity' : [],
'Strike' : [],
'Spot' : []
}
df = pd.DataFrame(df)
df.to_csv("portfolio.csv", index=False)
##############################################
Type = widgets.Dropdown(
options = ['Call', 'Put'],
value = 'Call',
description= 'Select Binary Type',
disabled=False,
layout={'width': 'max-content'},
style={'description_width': 'max-content'}
)
Quantity = widgets.BoundedIntText(value=1,
min=1,
max=10,
step=1,
description='Quantity:',
disabled=False,
layout={'width': 'max-content'},
style={'description_width': 'max-content'}
)
Strike = widgets.BoundedIntText(
min=1500,
max=3500,
step=1,
description='Strike:',
disabled=False,
layout={'width': 'max-content'},
style={'description_width': 'max-content'}
)
Spot = widgets.BoundedIntText(
min=1500,
max=3500,
step=1,
description='Spot:',
disabled=False,
layout={'width': 'max-content'},
style={'description_width': 'max-content'}
)
Add = widgets.Button(description="Add Data")
out = Output()
def add_on_click(b):
dt = datetime.now()
option = Type.value
quant = Quantity.value
strike = Strike.value
spot = Spot.value
df = pd.read_csv("portfolio.csv")
now = datetime.now()
add = {
'Datetime' : dt,
'Expire' : datetime(now.year, now.month, now.day, 14, 15,0,1),
'Type' : option,
'Quantity': quant,
'Strike' : strike,
'Spot': spot
}
add = pd.DataFrame(add, index=[0])
df = pd.concat([df, add],sort=True) #ignore_index=True)
df.to_csv("portfolio.csv", index=False)
display(df, out)
Add.on_click(add_on_click)
items = [Type, Quantity, Strike, Spot, Add]
box_layout = Layout(display='flex',
flex_flow='row',
align_items='stretch',
width='100%')
box_auto = Box(children=items, layout=box_layout)
display_widgets = VBox([box_auto])
display_widgets

Change your last lines of add_on_click to:
out.clear_output()
with out:
display(df)

You can try
def add_on_click(b):
with out:
clear_output()
display(df)
#rest of the code goes here

How to save output of my python script as a CSV file?

Having a bit of an issue trying to figure out how to save the output of my python script as a CSV. When I run this script, the file does not appear in the location that i need in order to access it. Any suggestions?
import pandas as pd
import os
folder_path = os.path.join("T:", "04. Testing","3. Wear Testing","TESTS","CKUW","180604 OP STRAPLESS","Survey Response Data")
mapping_path = os.path.join(folder_path + r'\Survey_MappingTable Strapless.xlsx')
# Read mapping table
mapping = pd.ExcelFile(mapping_path)
mapping.sheet_names
# ['SurveyInfo', 'Question Mapping', 'Answer Mapping']
# Transform sheets to 3 tables (surveyinfo, Q_mapping, A_mapping)
surveyinfo = mapping.parse("SurveyInfo")
Q_mapping = mapping.parse("Question Mapping", skiprows = 2)
A_mapping = mapping.parse("Answer Mapping", skiprows = 3)
# Get input file name and read the data. Table name is df.
input_file_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Input File Name','Value'].to_string(index=False)
path = os.path.join(r'T:\04. Testing\3. Wear Testing\TESTS\CKUW\180604 OP STRAPLESS\Survey Response Data',input_file_name)
df = pd.read_csv(path,header=None,engine='python')
# ,encoding='utf-8' Tried this as a way to fix but it didn't work
# Fill in previous colunmn names if blank, using the preceeding header
df.iloc[0] = df.iloc[0].fillna(method='ffill')
# Read the count of columns
n_col = len(df.iloc[0])
n_respondent = len(df)-2
c_name = []
for i in range(n_col):
# Multiple columns; each columns with differnt single answer. and the question text is to combine the category ex. support, comfort, are both in the satisfaction category etc.
# If it's satisfaction question, concatenate first row and second row
if "satisfaction" in df.iloc[0][i]:
c_name.append(df.iloc[0][i]+df.iloc[1][i])
elif "functionality" in df.iloc[0][i]:
c_name.append(df.iloc[0][i]+df.iloc[1][i])
elif ("shape" in df.iloc[0][i]) and ("please specify" in df.iloc[1][i]):
c_name.append(df.iloc[0][i]+df.iloc[1][i])
elif ("room in the cup" in df.iloc[0][i]) and ("please specify" in df.iloc[1][i]):
c_name.append(df.iloc[0][i]+df.iloc[1][i])
# - in the column header which is part of the question and part of the response
elif ("wire" in df.iloc[0][i]) and ("Response" not in df.iloc[1][i]):
if "-" in df.iloc[1][i]:
c_name.append(df.iloc[0][i]+df.iloc[1][i][df.iloc[1][i].find("-")+2:])
else:
c_name.append(df.iloc[0][i]+df.iloc[1][i])
for j in range(n_respondent):
if pd.notnull(df.iloc[j+2,i]) and "please specify" not in df.iloc[1,i]:
df.iloc[j+2,i] = df.iloc[1,i][:df.iloc[1][i].find("-")-1]
# Multiple columns; each columns with differnt single answer. and the question text is not to combine the category.
# Use to combine band and cup size
elif "size bra do you typically wear?" in df.iloc[0][i]:
c_name.append(df.iloc[0][i])
for j in range(n_respondent):
if pd.notnull(df.iloc[j+2,i]):
df.iloc[j+2,i] = df.iloc[1,i] + df.iloc[j+2,i]
# Single answer to the question; or multiple answers to the question but the answer is the same as the column header
else:
c_name.append(df.iloc[0][i])
# Make the column names as the first row
df.columns = c_name
# Drop the first and second rows
df2 = df.drop(df.index[[0,1]])
# Transform the wide dataset to a long dataset;
r = list(range(10))+list(range(17,20)) # skipping "What size bra do you typically wear? (only select one size)"
df_long = pd.melt(df2,id_vars = list(df.columns[r]), var_name = 'Question', value_name = 'Answer')
# Delete rows with null value to answer
df_long_notnull = df_long[pd.notnull(df_long['Answer'])]
# Make typically wear as a column dimension
sizewear = df_long_notnull.loc[df_long_notnull['Question'] == 'What size bra do you typically wear? (Only select one size)']
sizewear2 = sizewear[['Respondent ID','Collector ID','Email Address','Answer']]
sizewear2.columns = ['Respondent ID','Collector ID','Email Address','What size bra do you typically wear?']
df_long_notnull2 = df_long_notnull[df_long_notnull['Question'] != 'What size bra do you typically wear? (Only select one size)']
df_final = pd.merge(df_long_notnull2, sizewear2, how='left', on=['Respondent ID','Collector ID','Email Address'])
# Join Answer description mapping table
df_full = pd.merge(df_final, A_mapping, how='left', left_on = ['Question','Answer'], right_on = ['Question','Answer Description'])
df_full.loc[df_full['Answer_y'].isnull(),'Answer_y'] = df_full['Answer_x']
df_full.loc[df_full['Answer Description'].isnull(),'Answer Description'] = df_full['Answer_x']
df_full = df_full.drop(labels = ['Answer_x'], axis=1)
df_full = df_full.rename(columns = {'Answer_y':'Answer','Answer Description':'Answer Desc'})
# Join Question Mapping table
df_full = pd.merge(df_full,Q_mapping, how='left', left_on = ['Question'], right_on = ['Raw Column Name'])
df_full = df_full.drop(labels = ['Raw Column Name'], axis=1)
# Get Survey Info
product_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Product Name','Value'].to_string(index=False)
if "," in surveyinfo.loc[surveyinfo['Parameter Name']=='Style Number','Value'].item():
style_number = surveyinfo.loc[surveyinfo['Parameter Name']=='Style Number','Value'].to_string(index=False).split(',')
style_number = [s.strip() for s in style_number]
else:
style_number = surveyinfo.loc[surveyinfo['Parameter Name']=='Style Number','Value'].to_string(index=False)
if "," in surveyinfo.loc[surveyinfo['Parameter Name']=='Style Name','Value'].item():
style_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Style Name','Value'].to_string(index=False).split(',')
style_name = [s.strip() for s in style_name]
else:
style_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Style Name','Value'].to_string(index=False)
# get survey information
survey_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Survey Name','Value'].to_string(index=False)
survey_id = surveyinfo.loc[surveyinfo['Parameter Name']=='Survey ID','Value'].item()
survey_year = surveyinfo.loc[surveyinfo['Parameter Name']=='Survey Year','Value'].item()
survey_mo = surveyinfo.loc[surveyinfo['Parameter Name']=='Survey Month','Value'].item()
output_file_name = surveyinfo.loc[surveyinfo['Parameter Name']=='Output File Name','Value'].to_string(index=False)
# adding columns for survey information
df_full['Product Name'] = product_name
df_full['Survey Name'] = survey_name
df_full['Survey ID'] = survey_id
df_full['Survey Year'] = survey_year
df_full['Survey Month'] = survey_mo
### create a table with style_number and style_name
if type(style_name) == list:
style_t = pd.DataFrame(list(zip(style_name, style_number)), columns = list(["Style_Name","Style_Number"]))
df_full = pd.merge(df_full, style_t, how='left', left_on = ['Which style did you receive?'], right_on = ['Style_Name'])
else:
df_full['Style Name'] = style_name
df_full['Style Number'] = style_number
# Identify the path for saving output file
path_out = os.path.join("C:","Users","Sali3",output_file_name)
# Save as comma separated csv file
df_full.to_csv(path_out, sep=',', index = False)
The last portion of this script here is where i am having a problem. The path_out should be on my local "C" drive as a CSV file. Please help.

Assuming you are on Windows, the documentation on os.path.join says:
On Windows, the drive letter is not reset when an absolute path component (e.g., r'\foo') is encountered. If a component contains a drive letter, all previous components are thrown away and the drive letter is reset. Note that since there is a current directory for each drive, os.path.join("c:", "foo") represents a path relative to the current directory on drive C: (c:foo), not c:\foo.
This should fix your problem:
path_out = os.path.join("C:\\","Users","Sali3",output_file_name)

Change the name of excel worksheet with pandas

import pandas as pd
import numpy as np
df = pd.read_excel(r"C:\Users\venkagop\Subbu\promo validation testing\P 02. Promotions-UK C1.xls")
df = df[['Promotions', 'Promotions: AE', 'Promotions: Anaplan ID', 'Promotions: Is Optima Scenario?', 'Promotions: SIDs', 'Set Inactive?', 'Start Date', 'End Date', 'Promo Period', 'Promo Optima Status', 'Change Promo Status']]
df = df[(df['Promo Period'] == 'FY1819')]
df = df[(df['Set Inactive?'] == 0 ) & (df['Promotions: Is Optima Scenario?'] == 1)]
df.dropna(subset=['Promotions: SIDs'], inplace=True)
df['Optima vs Anaplan Promo Status Validation'] = ""
df['Optima vs Anaplan Promo Status Validation'] = np.where(df['Promo Optima Status'] == df['Change Promo Status'], 'True', 'False')
df.to_excel(r"C:\Users\venkagop\Subbu\mytest.xls", index = False)
#after this i want to change sheeet1 name to some other name#

There are 2 ways you can approach this problem.
Approach 1
Save the excel file to the correct worksheet name from the beginning, by using the sheet_name argument.
import pandas as pd
writer = pd.ExcelWriter(r'C:\Users\venkagop\Subbu\mytest.xls')
df.to_excel(writer, sheet_name='MySheetName', index=False)
writer.save()
Approach 2
If Approach 1 is not possible, change the worksheet name at a later stage using openpyxl. The advantage of this method is you remove the cost of converting pandas dataframe to Excel format again.
import openpyxl
file_loc = r'C:\Users\venkagop\Subbu\mytest.xls'
ss = openpyxl.load_workbook(file_loc)
ss_sheet = ss.get_sheet_by_name('Sheet1')
ss_sheet.title = 'MySheetName'
ss.save(file_loc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dropping Columns From Data frame to only show needed ones - python

df.drop(columns='DSP Code') this bit is not working cause you are not assigning it to a new df df = df.drop(columns='DSP Code') You can also just keep the columns you care about by copying them into a second dataframe.

Just do: df = df['Units Sold', 'Dealer Price', 'End Consumer Country', 'Currency Code'] You keep the ones you want, instead of dropping the others.

Related

Adding Rows to DataFrame in Pandas - Error - ValueError: cannot set a row with mismatched columns

How to search equal product numbers in two columns from two different excel tables and copy-paste certain cells from matched row to new table

How do I reset the outputs generated from my ipywidget command button so that only the updated DataFrame is displayed?

How to save output of my python script as a CSV file?

Change the name of excel worksheet with pandas

Categories

Resources