how to only read rows with the first column containing name i - python

I have two excel worksheets I am reading in Python. The first worksheet has a list of companies names. The second is a sheet with multiple of the same companies' names and data to the right that corresponds to the row.
[![Worksheet 1][1]][1]
[![Worksheet 2][2]][2]
I want to make some kind of condition, if the name in column A WS 2 matches the name in WS 1, then print the data (columns A:F WS 2) only for the rows corresponding to the name.
I am pretty new to coding, so I've been playing with it a lot without finding much luck. Right now I don't have much code because I tried restarting again. Been trying to use just pandas to read, sometimes I've been trying openpyxl.
import pandas as pd
import xlsxwriter as xlw
import openpyxl as xl
TickList = pd.read_excel("C:\\Users\\Ashley\\Worksheet1.xlsx",sheet_name='Tickers', header=None)
stocks = TickList.values.ravel()
Data = pd.read_excel("C:\\Users\\Ashley\\Worksheet2.xlsx", sheet_name='Pipeline', header=None, usecols="A:F")
data = Pipeline.values.ravel()
for i in stocks:
for t in data:
if i == t:
print(data)
[1]: https://i.stack.imgur.com/f6mXI.png
[2]: https://i.stack.imgur.com/4vKGR.png

I would imagine that the first thing you are doing wrong is not stipulating the key value on which the "i" in stocks is meant to match on the values in "t". Remember - "t" are the values - all of them. You have to specify that you wish to match the value of "i" to (probably) the first column of "t". What you appear to be doing here is akin to a vlookup without the target range properly specified.
Whilst I do not know the exact method in which the ravel() function stores the data, I have to believe something like this would be more likely to work:
for i in stocks:
for t in data:
if i == t[0]:
print(t)

Related

How to replace the blank cells in Excel with 0 using Python?

I'm trying to replace the blank cells in excel to 0 using Python. I have a loop script that will check 2 excel files with the same WorkSheet, Column Headers and Values. Now, from the picture attached,
enter image description here
the script will write the count to Column Count in Excel 2 if the value of Column A in Excel 2 matches to the value of Column A in Excel 1. Now, the problem that I have are the values in Column A in Excel 2 that doesn't have a match in Column A in Excel 1, it leaves the Column Count in Excel 2 blank.
Below is a part of the script that will check the 2 excel files I have. I'm trying the suggestion from this link Pandas: replace empty cell to 0 but it doesn't work for me and I get result.fillna(0, inplace=True) NameError: name 'result' is not defined error message. Guidance on how to achieve my goal would be very nice. Thank you in advance.
import pandas as pd
import os
import openpyxl
daily_data = openpyxl.load_workbook('C:/Test.xlsx')
master_data = openpyxl.load_workbook('C:/Source.xlsx')
daily_sheet = daily_data['WorkSheet']
master_sheet = master_data['WorkSheet']
for i in daily_sheet.iter_rows():
Column_A = i[0].value
row_number = i[0].row
for j in master_sheet.iter_rows():
if j[0].value == Column_A:
daily_sheet.cell(row=row_number, column=6).value = j[1].value
#print(j[1].value)
daily_data.save('C:/Test.xlsx')
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
it seems you've made a few fundamental mistakes in your approach. First off, "result" is an object, specifically its a dataframe that someone else made (from that other post) it is not your dataframe. Thus, you need to run it on your dataframe. In python, we have whats called an object oriented approach, meaning that objects are the key players. .fillna() is a mthod that operates on your object. Thus the usage for a toy example is as follows:
my_df = pd.read_csv(my_path_to_my_df_)
my_df.fillna(0, inplace=True)
also this method is for dataframes thus you will need to convert it from the object the openpyxl library creates, at least thats what i would assume i haven't used this library before. Therefore in your data you would do this:
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)

Start reading from a column where data exists is in excel using pandas [duplicate]

How could I read an Excel file from column AF and onwards? I don't know the last column letter name and the file is too large to constantly keep checking.
df = pd.read_excel(r"Documents\file.xlsx", usecols="AF:")
You can't write it directly in read_excel function so we can only look for other possible options.
For the moment we could write 'AF:XFD' because 'XFD' is the last column in excel, but it returns information that it will be depracated soon and start returning ParseError, so it's not recommended.
You can use other libraries to find the last column, but it doesn't work too fast - excel file is read, then we check last column and after that we create a dataframe.
If I had such problem I would do everything in Pandas, by adding .iloc at the end. We know that 'AF' is 32th column in excel, so:
df = pd.read_excel(r"Documents\file.xlsx").iloc[:, 32:]
It will return all columns from 'AF' till the end without having a noticeable impact on performance.
You can use the xlwings library to determine the last column in which there is data and then replace it in your code line.
import xlwings as xw
app = xw.App(visible = False)
book = app.books.open(r"Documents\file.xlsx")
sheet = xw.Book(r"Documents\file.xlsx").sheets[0]
app.quit()
colNum = sheet.range('AF1').current_region.last_cell.column
df = pd.read_excel(r"Documents\file.xlsx", usecols = list(range(32, colNum, 1)))
where 32 corresponds to the column number for AF.

How to read a main row or main column in DataFrame and search the string for cell values?

I have a excel workbook that I convert into a dataframe but I'm trying to rea each row and column and assign variables but the problem is the data is not always in the same place so I can't hardcode the location of the data. So I capture the first row and first column and I'm just trying to search them to find the data and get the location.
Here is my code:
wb = load_workbook(strTotalFile,data_only = True)
for sheet in wb.sheetnames:
ws = wb[sheet]
data = ws.values
pd.set_option('display.max_columns', None)
df = pd.DataFrame(data)
mainRow = df.iloc[[17]]
mainCol = df.iloc[:,1]
This part of your question really bothers me:
" the data is not always in the same place "
I am taking this to mean that your columns are not formatted, or rows.
This would cause no end of problems for you.
But assuming I am wrong, you can just essentially use simple commands, The accepted answer I link below covers what you need, not my answer.
How do I select rows from a DataFrame based on column values?

Change substring in a column from a dict (Python Pandas)

I've got 2 dataframe from 2 excel sheets (in the same file). I want to change the name of each molecules in the first sheet 1 with the "official id" from a database present in the second sheet 2.
screen first dataframe
screen second dataframe
import pandas as pd
reactions = pd.read_excel ("/Users/Python/reactions.xlsx")
molecules = pd.read_excel ("/Users/Python/reactions.xlsx" ,
sheet_name= 'METS')
d = molecules.set_index('MOLID')['MOLNAME'].to_dict()
#not work
reactions['EQUATION'] = reactions['EQUATION'].str.replace('\d+','').replace(d)
I have the old/new molecules name in a dictionary, that I also created from the 2nd sheet:
d
And it is like
{....'glucose[c]': 'glc_D',
'glucose[s]': 'glc_D',
'glucose[x]': 'glc_D', ....}
In the first database the column where I want to change the molecules name is call EQUATION and it is like: "ATP[c] + glucose[c] => ADP[c] + glucose6phosphate[c]"
I try change with this code, it doesn't error, but the molecules in my dataframe haven't changed.
Thank you for the time
How to replace multiple substrings in a Pandas series using a dictionary?
I think if you adjust the code to something like this you are able to do it
reactions['EQUATION'].apply(lambda x: ' '.join([d.get(i, i) for i in x.split()]))

Output Issues with pandas, Python

Begin Code
import pandas as pd
df = pd.read_csv('C:/Users/lhicks/Documents/Corporate/test.csv', 'r')
saved_column = df.FirstName
saved_column2 = df.LastName
saved_column3 = df.Email
print saved_column
print saved_column2
print saved_column3
Itemlist = []
Itemlist.append(saved_column)
print Itemlist
End of Code
The objective is to select specific columns from a specified xls sheet, grab all the rows from the specified columns, and then print that data out.
The current issue is the data is grabbed correctly, but after 29-30 rows, it prints/stores a "...", and then jumps to line item 880s, and finishes out from there.
The additional issue is that it also stores this as the value, making it worthless due to not providing the full dataset.
The eventual process is to add the selected columns to a new xls sheet to clean up the old data, and then add the rows to a templated document to generate an advertisement letter.
The first question is how to I have all the fields populate? The second is what is the best approach for this? Please provide additional links as well if possible, this is a practical learning experience for me.
Pandas tries to shorten your data when printing it.
NOTE: all the data is still there (print(df.shape) - to check it, print the shape of your DataFrame), it's just a convenient way not to flood your screen with tons of data rows/columns
Try this:
fn = 'C:/Users/lhicks/Documents/Corporate/test.csv'
cols = ['FirstName','LastName','Email']
df = pd.read_csv(fn, usecols=cols)
df.to_excel('/path/to/excel.xlsx', index=False)
This will parse only ['FirstName','LastName','Email'] columns from a CSV file and will export them to Excel file
UPDATE:
if you want to control how many rows should Pandas print:
with pd.option_context("display.max_rows",200):
print(df)

Categories