I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.
Related
I am trying to read an excel file and write every fourth row into a new Excel file. I'm using Pandas to read and write, and if int(num%4) == 0 to determine which rows to select, but the iteration and subsequent writing continue to escape me. I've tried my best to look up answers, but I'm a new programmer and struggling :/
If you're using Pandas I'm assuming you've loaded the data into a dataframe?
If so then consider this:
import pandas as pd
df = pd.read_csv('YourFile.csv')
df.iloc[::4]
#once you're done with the data you can save it to another csv file
df.to_csv('OutputFile.csv')
This will leave your dataframe df with the 4th, 8th, 12th, etc. rows from your original dataframe/file. You can then read/write to each row left in the dataframe df. To visualize the before and after just insert df.head() before and after the df.iloc[::4] expression.
I did not understand what the problem is to be more specific, but you should try pandas' iloc property (or even loc depending on your df), check more info in here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
I'm new to this page. I've managed to find myself in a little bit of an issue. Using python I'm looking for a way to loop through the different cells of an excel column using pandas and dataframes. The code I'm using is:
variable = pd.DataFrame(data, columns=['Column'])
for cell in variable:
print(cell)
And this only prints the first cell.
What am I doing wrong?
Not exactly sure what you are trying to do but here is a way to remove duplicate entries of the same text within a column in a dataframe.
df = df[df.column_name.apply(lambda x: x != 'Player')]
This loops through the whole column in the dataframe and ofcourse you can update the code to the action you want after the colon.
So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
I'm trying to write a script that merges two excel files together. One has been has been hand processed and has a bunch custom formatting done to it, and the other is an auto-generated file. Doing the merge in pandas is simple enough, but preserving the formatting is proving troublesome. I found the styleframe library, which seems like it should simplify what I'm trying to do, as it can import style info in addition to the raw data. However, I'm having problems actually implementing the code.
My questions is this: how can I pull style information from each individual cell in the excel and then apply that to my merged dataframe? Note that the data is not formatted consistently across columns or rows, so I don't think I can apply styles in this manner. Here's the relevant portion of my code:
#iterate thorough all cells of merged dataframe
for rownum, row in output_df.iterrows():
for column, value in row.iteritems():
filename = row['File Name']
cur_style = orig_excel.loc[orig_excel['File Name'] == filename, column][0].style #pulls the style of relevant cell in the original excel document
target_style = output_df.loc[output_df['File Name'] == filename, column][0].style #style of the cell in the merged dataframe
target_style = cur_style #set style in current output_df cell to match original excel file style
This code runs (slowly) but it doesn't seem to actually apply any styling to the output styleframe
Looking through the documentation, I don't really see a method for applying styles at an individual styleframe container level--everything is geared towards doing it as a row or column. It also seems like you need to use a styler object to set the style.
Figured it out. I rejiggered my dataframe so that I could just us a .at instead of a .loc lookup. This, coupled with the apply_style_by_indexes method got me where I needed to be:
for index, row in orig_excel.iterrows():
for column, value in row.iteritems():
index_num = output_df.index.get_loc(index)
#Pull style to copy to new df
cur_style = orig_excel.at[index, column].style
#Apply original style to new df
output_df.apply_style_by_indexes(output_df.index[index_num],
cur_style,
cols_to_style = column)
I have an existing excel. That looks like
and I have another excel that has around 40000 rows and around 300 columns. shortened version looks like
I would like to append values to my existing excel from second excel. But only values that match values in col4 from my existing excel. So i would get something like this
Hope you guys get the picture of what I am trying to do.
yes, that is possible in pandas and it is way faster than anything in excel
df_result = pd.merge(FirstTable, SecondTable, how='left', on='col4')
this will look into both the tables for column "col4" so it needs to be named this way in both the tables.
Also be aware of the fact that if you have multiple values in second table for single value in the first table it will make as many lines in the result as in the second table.
to read the excel you can use:
import pandas as pd
xl=pd.ExcelFile('MyFile.xlsx')
FirstTable = pd.read_excel(xl, 'sheet_name_FIRST_TABLE')
for more detailed description see documentation