I'm looking to insert a few characters into the beginning of a cell on a CSV using Python. Python needs to do this with the same cell on each row.
As an example, see:
Inserting values into specific cells in csv with python
So:
row 1 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
row 2 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
row 3 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
Does anyone know how to do this?
I found this link:
https://www.protechtraining.com/blog/post/python-for-beginners-reading-manipulating-csv-files-737
But it doesn't go into enough detail.
Simplest way is probably to use Pandas. First run 'pip install pandas'
import pandas as pd
# read the CSV file and store into dataframe
df = pd.read_csv("test.csv")
# change value of a single cell directly
# this is selecting index 4 (row index) and then the column name
df.at[4,'column-name'] = 'HALQwerty'
# change multiple values simultaneously
# here we have a range of rows (0:4) and a couple column values
df.loc[0:4 ,['Num','NAME']] = [100, 'HALQwerty']
# write out the CSV file
df.to_csv(f"output.csv")
Pandas allows for a lot of control over your CSV files, and its well documented.
https://pandas.pydata.org/docs/index.html
Edit: To allow conditional appending of text:
df = pd.DataFrame({'col1':['a', 'QWERTY', "QWERTY", 'b'], 'col2':['c', 'tortilla', 'giraffe', 'monkey'] })
mask = (df['col1'] == 'QWERTY')
df.loc[mask, 'col1'] = 'HAL' + df['col1'].astype(str)
The mask is the subset of rows that match a condition (where cell value equals "QWERTY"). The ".loc" function identifies where in the dataframe that subset is, and helps to apply whatever change you want.
Related
I'm trying to replace the blank cells in excel to 0 using Python. I have a loop script that will check 2 excel files with the same WorkSheet, Column Headers and Values. Now, from the picture attached,
enter image description here
the script will write the count to Column Count in Excel 2 if the value of Column A in Excel 2 matches to the value of Column A in Excel 1. Now, the problem that I have are the values in Column A in Excel 2 that doesn't have a match in Column A in Excel 1, it leaves the Column Count in Excel 2 blank.
Below is a part of the script that will check the 2 excel files I have. I'm trying the suggestion from this link Pandas: replace empty cell to 0 but it doesn't work for me and I get result.fillna(0, inplace=True) NameError: name 'result' is not defined error message. Guidance on how to achieve my goal would be very nice. Thank you in advance.
import pandas as pd
import os
import openpyxl
daily_data = openpyxl.load_workbook('C:/Test.xlsx')
master_data = openpyxl.load_workbook('C:/Source.xlsx')
daily_sheet = daily_data['WorkSheet']
master_sheet = master_data['WorkSheet']
for i in daily_sheet.iter_rows():
Column_A = i[0].value
row_number = i[0].row
for j in master_sheet.iter_rows():
if j[0].value == Column_A:
daily_sheet.cell(row=row_number, column=6).value = j[1].value
#print(j[1].value)
daily_data.save('C:/Test.xlsx')
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
it seems you've made a few fundamental mistakes in your approach. First off, "result" is an object, specifically its a dataframe that someone else made (from that other post) it is not your dataframe. Thus, you need to run it on your dataframe. In python, we have whats called an object oriented approach, meaning that objects are the key players. .fillna() is a mthod that operates on your object. Thus the usage for a toy example is as follows:
my_df = pd.read_csv(my_path_to_my_df_)
my_df.fillna(0, inplace=True)
also this method is for dataframes thus you will need to convert it from the object the openpyxl library creates, at least thats what i would assume i haven't used this library before. Therefore in your data you would do this:
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
I have a excel file which has some unwanted rows (both blank and some with text) before my real header. When I read it through pandas, the below code works fine.
df = pd.read_excel("C:/path.xlsx", skiprows=15)
BUT, problem is , the unwanted rows can change every time I pull the data. I do not want to manually check and change skiprows value.
What is the easiest way of fixing it? I am a beginner so if you provide solution then explain bit in detail.
If it matters, my 1st column header is "No." always. Just in case you wanna use it as reference in code.
Pandas is not the best library to do such operations. I would recommend using openpyxl with 'for loop'. I do not see any sample of your data, but if you really want to use pandas we can consider something like this:
Let's say we have such structure of the data:
Then:
Read all data and do not skip any row:
df = pd.read_excel("C:/path.xlsx")
Find index of first row with 'No." in the first column and take everything till the end:
df = df.iloc[df[df.iloc[:, 0].eq('No.')].index[0]:, :].reset_index(drop=True)
Set first row as columns:
df.columns = df.iloc[0]
Drop first row (we do not need it because there are column names repeated there):
df = df.drop(0).reset_index(drop=True)
After these operations we get our output:
print(df)
No. Column1 Column2 Column3
0 1 a b c
1 2 c a b
2 3 b c a
3 4 y b z
I would like to sort out only the columns and rows I want to use when downloading a CSV.
with
df = pd.read_csv("https://data.org/data.csv",usecols = ['Lion','Tree'])
I can read only the columns I want, but how can I read only the rows whose column "Lion" contains the word "animal" for example?
If what you're asking for is to filter rows while reading the csv file, the answer is that there is no built-in way to do that.
But you can do what you want when the csv file has been loaded in a DataFrame like that:
df = df.loc[df['Lion'] == 'animal']
Explanation:
DataFrame.loc allows you to access a group of rows and columns by label(s) or a boolean array.
And here, df['Lion'] == 'animal' will return a boolean array like for example:
0 True
3 True
This means that rows 0 and 3 match the condition where the values are equal to the string 'animal'.
So, loc will select these rows 0 and 3.
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I'm trying to compare two dataframes (headers same in both of them)and highlighted the data which is not similar in both the frames .
Now I want to print those rows which are highlighted to an excel sheet keeping the headers. And I'm unable to do that
You can check for differences by comparing each element of each corresponding row (here I use the unique id column to find corresponding rows). If there is a difference you can append it to a new dataframe. Finally save the new dataframe to excel format.
df_differnt_rows = pd.DataFrame(columns=['id','B','C'])
df1 = pd.DataFrame([[1,2,3],[2,2,3],[3,2,3]], columns=['id','B','C'])
df2 = pd.DataFrame([[1,2,3],[2,"different",2],[3,2,3]], columns=['id','B','C'])
for i, row in df1.iterrows():
compare_row = df2.loc[df2['id'] == row['id']].iloc[0]
if all(row == compare_row):
continue
df_differnt_rows = df_differnt_rows.append(compare_row)
This produces another df that has all the rows that are different between df1 and df2.
print(df_differnt_rows)
id B C
1 2 different 2
Save using .to_excel() method:
df_differnt_rows.to_excel('df_differnt_rows.xlsx')
Check out openpyxl (i.e. patternfill) if you want to highlight cells in the excel file.
Step 1 :- Select the row you want and store it in a new frame, say df
( selecting rows in python can be done using this)
.
Step 2 :-
Use this :-
df.to_excel (r'C:\Users\Desktop\selected_dataframe.xlsx')
#Don't forget to add '.xlsx' at the end of the path
I have to import tables from multiple excel files to a python, specifically with a Panda Dataframe format.
The problem is that the excel files do not have all the same structure,
particularly in some of them the table starts at the cell A1 while in another start at cell A2 or even B1 or B2.
The only thing that stays constant through all excel files are the headers positioned in the first row of the table.
So for instance in the first row in the very right of the table is always written "setting". But "setting" is sometimes written at position A1 and sometimes at position B2.
Currently, I am just modifying manually numHeader and skip parameters in the pandas.read_excel method for each single excel, but since there is quite a lot of file doing this manually every time is quite a waste of time.
CurrExcelFile = pd.read_excel(Files[i], \
header=numHeader, skiprows=skip)
Is there exist, or can it be easily written a package which takes as a parameter a string for identifying the first element of a table?
So that I could just pass "setting" and the script could get automatically the index of the cell and start fetching data from there.
UPDATE:
So currently I managed to it by first importing the whole sheet, find where it is the "setting" value, then dropping the unnecessary columns, rename the data frame and finally dropping the unnecessary rows.
Test = pd.read_excel('excelfile',sheet_name='sheetname')
#Find the index and the column of the first cell
for column in Test.columns:
tmp = Test[column] == "setting"
if len(Test.loc[tmp].index) == 1:
RowInd = Test.loc[tmp].index[0]
ColPos = Test.columns.get_loc(column)
#Drop Columns
ColumnsToDrop = Test.columns[np.arange(0,ColPos)]
Test.drop(ColumnsToDrop, inplace=True, axis=1)
#Rename Axis
Test.columns = (Test.iloc[RowInd])
#Drop Rows
Test.drop(np.arange(0,RowInd+1), inplace=True, axis=0)
This is rather a workaround and I wish there was a easier solution