I have 2 files that contain partly the same items. To detect them there exists a unique identifier (UID).
What I try to achieve is to compare the UIDs in the first file, and compare them to the UIDs in the second file. If those are identical, another column in the first file should be filled with content in the second file of the respective column.
import pandas as pd
dfFile2 = pd.read_csv("File2.csv", sep=";")
dfFile1 = pd.read_csv("File1.csv", sep=";")
UIDURLS = dfFile2["UID"]
UIDKonf = dfFile1["UID"]
URLSurl = dfUrls["URL"]
URLSKonf = dfKonf["URL"]
for i in range(0, len(UIDKonf)):
for j in range(0, len(UIDURLS)):
if UIDKonf.at[i] == UIDURLS.at[j]:
URLSKonf.at[i] = URLSurl[j]
The code above does not give me any errors, but I also want it to directly write into the original .csv and not into the Dataframe. How could I achieve that?
Best
If you create the DataFrame with the updated information you want, you can write it back to a csv in pandas using DataFrame.to_csv
Related
I am new to programming and I am trying to learn. I am comparing 2 documents that have very similar data. I want to find out if data from column "concatenate" is found in the same column "contatenate" from the other document because I want to find out what changes where made during the last update of the file.
If the value cannot be found this whole row should be copied to a new document. Then I know that this dataset has been changed.
Here is the code I have:
import pandas as pd
# load the data from the two files into Pandas dataframes
df1 = pd.read_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/MergedKeepa_2023-02-05.xlsx')
df2 = pd.read_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/MergedKeepa_2023-02-04.xlsx')
# extract the values from column Concatenate in both dataframes
col_a_df1 = df1['concatenate']
col_a_df2 = df2['concatenate']
# find the intersection of the values in column A of both dataframes
intersection = col_a_df1.isin(col_a_df2)
# filter the rows of df1 where the value in column A is not found in df2
df1 = df1[intersection]
# write the filtered data to a new Excel file
df1.to_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/filtered_data.xlsx', index=False)
I just duplicated the 2 inputfiles which means I should receive a blank document but the document is still copying data to the new sheet.
What did I do wrong?
Many thanks for your support!
If the value cannot be found, this whole row should be copied to a new
document.
IIUC, you need (~), the NOT operator, to negate your boolean mask :
df1 = df1[~intersection]
I'm trying to replace the blank cells in excel to 0 using Python. I have a loop script that will check 2 excel files with the same WorkSheet, Column Headers and Values. Now, from the picture attached,
enter image description here
the script will write the count to Column Count in Excel 2 if the value of Column A in Excel 2 matches to the value of Column A in Excel 1. Now, the problem that I have are the values in Column A in Excel 2 that doesn't have a match in Column A in Excel 1, it leaves the Column Count in Excel 2 blank.
Below is a part of the script that will check the 2 excel files I have. I'm trying the suggestion from this link Pandas: replace empty cell to 0 but it doesn't work for me and I get result.fillna(0, inplace=True) NameError: name 'result' is not defined error message. Guidance on how to achieve my goal would be very nice. Thank you in advance.
import pandas as pd
import os
import openpyxl
daily_data = openpyxl.load_workbook('C:/Test.xlsx')
master_data = openpyxl.load_workbook('C:/Source.xlsx')
daily_sheet = daily_data['WorkSheet']
master_sheet = master_data['WorkSheet']
for i in daily_sheet.iter_rows():
Column_A = i[0].value
row_number = i[0].row
for j in master_sheet.iter_rows():
if j[0].value == Column_A:
daily_sheet.cell(row=row_number, column=6).value = j[1].value
#print(j[1].value)
daily_data.save('C:/Test.xlsx')
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
it seems you've made a few fundamental mistakes in your approach. First off, "result" is an object, specifically its a dataframe that someone else made (from that other post) it is not your dataframe. Thus, you need to run it on your dataframe. In python, we have whats called an object oriented approach, meaning that objects are the key players. .fillna() is a mthod that operates on your object. Thus the usage for a toy example is as follows:
my_df = pd.read_csv(my_path_to_my_df_)
my_df.fillna(0, inplace=True)
also this method is for dataframes thus you will need to convert it from the object the openpyxl library creates, at least thats what i would assume i haven't used this library before. Therefore in your data you would do this:
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
I've placed the contents of a .csv file in the data list
with the following code in Jupyter notebook
data = []
with open("president_county_candidate.csv", "r") as f:
contents = csv.reader(f)
for c in contents:
data.append(c)
I can only get an element through the index number, but that gives me the whole row of the list. How can I choose specific elements and the count? In the image, you can see the content of the List(data).
data
You can use pandas library to read csv in form of Dataframes and also you can perform various operations on the same. Refer the below code:
import pandas as pd
df = pd.read_csv('president_county_candidate.csv')
print(df.shape)
This will give you the number of rows and columns present in CSV file.
Also in order to extract any specific column you can use:
newdf = df[['candidates', 'votes']]
This will give you the new dataframe with the above mentioned columns.
Also please find the solution for your approach mentioned in question
You can extract the column index and then while parsing the contents, mention the index number of the columns you need.
For example:
data = []
with open("president_county_candidate.csv", "r") as f:
contents = csv.reader(f)
for c in contents:
data.append([c[2], c[4]])
This will only get the data of Candidate and Votes.
Note: It's better to get the index number of specific column using list.index('columnName') and pass the same variable instead of hard-coded index.
I am trying to export some data from python to excel using Pandas, and not succeeding. The data is a dictionary, where the keys are a tuple of 4 elements.
I am currently using the following code:
df = pd.DataFrame(data)
df.to_excel("*file location*", index=False)
and I get an exported 2-column table as follows:
I am trying to get an excel table where the first 3 elements of the key are split into their own columns, and the 4th element of the key (Period in this case) becomes a column name, similar to the example below:
I have tried using different additions to the above code but I'm a bit new to this, and so nothing is working so far
Based on what you show us (which is unreplicable), you need pandas.MultiIndex
df_ = df.set_index(0) # `0` since your tuples seem to be located at the first column
df_.index = pd.MultiIndex.from_tuples(df_.index) # We convert your simple index into NDimensional index
# `~.unstack` does the job of locating your periods as columns
df_.unstack(level=-1).droplevel(0, axis=1).to_excel(
"file location", index=True
)
you could try exporting to a csv instead
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv', index = False)
which can then be converted to an excel file easily
I'm trying to standardize data in a large CSV file. I want to replace a string "Greek" with a different string "Q35497" but only in a single column (I don't want to replace every instance of the word "Greek" to "Q35497" in every column but just in a column named "P407"). This is what I have so far
data_frame = pd.read_csv('/data.csv') data_frame["P407"] = data_frame['P407'].astype(str) data_frame["P407"].str.replace('Greek', 'Q35497')
But what this does is just create a single column "P407" with a list of strings (such as 'Q35497') and I can't append it to the whole csv table.
I tried using DataFrame.replace
data_frame = data_frame.replace( #to_replace={"P407":{'Greek':'Q35497'}}, #inplace=True #)
But this just creates an empty set. I also can't figure out why data_frame["P407"] creates a separate series that cannot be added to the original csv file.
Your approach is correct but you have missing to store the modified dataframe.
data_frame = pd.read_csv('/data.csv')
data_frame["P407"] = data_frame["P407"].str.replace('Greek', 'Q35497')