I'm playing with some data from an Excel file. I imported the file, made it into a dataframe, and now want to iterate over a column named 'Category' for certain keywords, fine them, and retun another column ('Asin'). I'm having trouble finding the correct syntax to make this work.
the code below is my attempt at an if statement:
import pandas as pd
import numpy as np
file = r'C:/Users/bryanmccormack/Downloads/hasbro_dummy_catalog.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('asins')
df
check = df.loc[df.Category == 'Action Figures'] = 'Asin'
print(check)
Alex Fish provided the correct answer, if I understand the question.
To elaborate, df.loc[df.Category == 'Action Figures'] returns a data frame with the rows that meet the bracketed condition, so ['Asin'] at the end returns the "Asin" column from that data frame.
Fyi,
check = df.loc[df.Category == 'Action Figures'] = 'Asin'
This is a multiple assignment statement - that is,
a = b = 4
is the same as
b = 4
a = b
So your code is apparently rewriting some values of your data frame df, which you probably don't want.
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
Here is a sample CSV I'm working with
Here is my code:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
#(2) Filter every row where the first letter is 's' from search term
df = df[df['productOMS'].str.contains('^[a-z]+')]
#REGEX to filter anything that would ^ (start with) a letter
inputFile = inputFile
deleteSearchTerm(inputFile)
What I want to do:
Anything in the column ProductOMS that begins with a letter would be a row that I don't want. So I'm trying to delete them based on a condition and I was also trying would regular expressions just so I'd get a little bit more comfortable with them.
I tried to do that with:
df = df[df['productOMS'].str.contains('^[a-z]+')]
where if any of the rows starts with any lower case letter I would drop it (I think)
Please let me know if I need to add anything to my post!
Edit:
Here is a link to a copy of the file I'm working with.
https://drive.google.com/file/d/1Dsw2Ana3WVIheNT43Ad4Dv6C8AIbvAlJ/view?usp=sharing
Another Edit: Here is the dataframe I'm working with
productNum,ProductOMS,productPrice
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
New Edit:
Here's some updated code using an answer
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
inputFile = inputFile
deleteSearchTerm(inputFile)
When I run this code and print out the dataframes this gets rid of the rows that start with 'search'. However my CSV file is not updating
The issue here is that you're most likely dealing with mixed data types.
if you just want numeric values you can use pd.to_numeric
df = pd.DataFrame({'A' : [0,1,2,3,'a12351','123a6']})
df[~pd.to_numeric(df['A'],errors='coerce').isnull()]
A
0 0
1 1
2 2
3 3
but if you only want to test the first letter then :
df[~df['A'].astype(str).str.contains('^[a-z]')==True]
A
0 0
1 1
2 2
3 3
5 123a6
Edit, it seems the first solution works, but you need to write this back to your csv?
you need to use the to_csv method, i'd recommend you read 10 minutes to pandas here
As for your function, lets edit it a little to take a source csv file and throw out an edited version, it will save the file to the same location with _edited added on. feel free to edit/change.
from pathlib import Path
def delete_search_term(input_file, column):
"""
Takes in a file and removes any strings from a given column
input_file : path to your file.
column : column with strings that you want to remove.
"""
file_path = Path(input_file)
if not file_path.is_file():
raise Exception('This file path is not valid')
df = pd.read_csv(input_file)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df[column],errors='coerce').isnull()]
print(f"Creating file as:\n{file_path.parent.joinpath(f'{file_path.stem}_edited.csv')}")
return df.to_csv(file_path.parent.joinpath(f"{file_path.stem}_edited.csv"),index=False)
Solution:
import numpy as np
import pandas as pd
def deleteSearchTerm(inputFile):
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df.to_csv(inputFile)
inputFile = filePath
inputFile = deleteSearchTerm(inputFile)
Data from the source csv as shared at the google drive location:
'''
productNum,ProductOMS,productPrice,Unnamed: 3
2463448,1002623072,419.95,
2463413,1002622872,289.95,
2463430,1002622974,309.95,
2463419,1002622908,329.95,
2463434,search?searchTerm=2463434,,
2463423,1002622932,469.95,
'''
import pandas as pd
df = pd.read_clipboard()
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
4 2463434 search?searchTerm=2463434 NaN NaN
5 2463423 1002622932 469.95 NaN
.
df1 = df.loc[df['ProductOMS'].str.isdigit()]
print(df1)
Output:
productNum ProductOMS productPrice Unnamed: 3
0 2463448 1002623072 419.95 NaN
1 2463413 1002622872 289.95 NaN
2 2463430 1002622974 309.95 NaN
3 2463419 1002622908 329.95 NaN
5 2463423 1002622932 469.95 NaN
I hope it helps you:
df = pd.read_csv(filename)
df = df[~df['ProductOMS'].str.contains('^[a-z]+')]
df.to_csv(filename)
For the most part your function is fine but you seem to have forgotten to save the CSV, which is done by df.to_csv() method.
Let me rewrite the code for you:
import pandas as pd
def processAndSaveCSV(filename):
# Read the CSV file
df = pd.read_csv(filename)
# Retain only the rows with `ProductOMS` being numeric
df = df[df['ProductOMS'].str.contains('^\d+')]
# Save CSV File - Rewrites file
df.to_csv(filename)
Hope this helps :)
It looks like a scope problem to me.
First we need to return df:
def deleteSearchTerm(inputFile):
#(1) Open the file
df = pd.read_csv(inputFile)
print(df)
#(2) Filter every row where the first letter is 's' from search term
df = df[~pd.to_numeric(df['ProductOMS'],errors='coerce').isnull()]
print(df)
return df
Then replace the line
DeleteSearchTerm(InputFile)
with:
InputFile = DeleteSearchTerm(InputFile)
Basically your function is not returning anything.
After you fix that you just need to redefine your inputFile variable to the new dataframe your function is returning.
If you already defined df earlier in your code and you're trying to manipulate it, then the function is not actually changing your existing global df variable. Instead it's making a new local variable under the same name.
To fix this we first return the local df and then re-assign the global df to the local one.
You should be able to find more information about variable scope at this link:
https://www.geeksforgeeks.org/global-local-variables-python/
It also appears you never actually update your original file.
Try adding this to the end of your code:
df.to_csv('CSV file name', index=True)
Index just says whether you want to have a line index.
I am trying to write a following matlab code in python:
function[x,y,z] = Testfunc(filename, newdata, a, b)
sheetname = 'Test1';
data = xlsread(filename, sheetname);
if data(1) == 1
newdata(1,3) = data(2);
newdata(1,4) = data(3);
newdata(1,5) = data(4);
newdata(1,6) = data(5)
else
....
....
....
It is very long function but this is the part where I am stuck and have no clue at all.
This is what I have written so far in python:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data[0] == 1:
I am stuck here guys and I am also even not sure if the 'if' statement is right or not. I am looking for suggestions and help.
Info: excel sheet has 1 row and 13 columns, newdata is also a 2-D Matrix
Try running that code and printing out your dataframe (print(data)). You will see that a dataframe is different than a MATLAB matrix. read_excel will try to infer your columns, so you will probably have no rows and just columns. To prevent pandas from reading the column use:
data = pd.read_excel(filepath, sheet_name='Test1', header=None)
Accessing data using an index will index that row. So your comparison is trying to find if the row is equal to 1 (which is never true in your case). To index a given cell, you must first index the row. To achieve what you are doing in MATLAB, use the iloc indexer on your dataframe: data.iloc[0,0]. What this does in accesses row 0, element 0. Your code should look like this:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data.iloc[0,0] == 1:
newdata.iloc[0,2:6] = data.iloc[0,1:5]
....
I suggest you read up on indexing in pandas.
I have one excel sheet with right format(Certain number of headers and specific names). Here I have another excel sheet and I have to check this excel sheet for right format or not(have to be the same number of header and same header names, no issue if the values below header will changed.). how can solve this issue ? NLP or any other suitable method is there?
If you have to compare two Excel you could try something like this (I add also some example Excels):
def areHeaderExcelEqual(excel1, excel2) :
equals = True
if len(excel1.columns) != len(excel2.columns):
return False
for i in range(len(excel1.columns)):
if excel1.columns[i] != excel2.columns[i] :
equals = False
return equals
And that's an application:
import pandas as pd
#create first example Excel
df_out = pd.DataFrame([('string1',1),('string2',2), ('string3',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp1.xlsx')
#create second example Excel
df_out = pd.DataFrame([('string5',1),('string2',5), ('string2',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp2.xlsx')
# create third example Excel
df_out = pd.DataFrame([('string1',1),('string4',2), ('string3',3)], columns=['MyName', 'MyValue'])
df_out.to_excel('tmp3.xlsx')
excel1 = pd.read_excel('tmp1.xlsx')
excel2 = pd.read_excel('tmp2.xlsx')
excel3 = pd.read_excel('tmp3.xlsx')
print(areHeaderExcelEqual(excel1, excel2))
print(areHeaderExcelEqual(excel1, excel3))
Note: Excel's files are provided just to see the different outputs.
For example, excel1 looks like this:
The idea is the same for the other files. To have more insights, see How to create dataframes.
Here's you're code:
f1 = pd.read_excel('file1.xlsx')
f2 = pd.read_excel('file2.xlsx')
print(areHeaderExcelEqual(f1, f2))
You can use pandas for that comparison.
import pandas as pd
f1 = pd.read_excel('sheet1.xlsx')
f2 = pd.read_excel('sheet2.xlsx')
header_threshold = 5 # any number of headers
print(len(f1.columns) == header_threshold)
print(f1.columns) # get the column names as values
I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.