combine/merge two csv using pandas/python - python

I have two csvs, I want to combine or merge these csvs as left join...
my key column is "id", I have same non-key column as "result" in both csvs, but I want to override "result" column if any value exists in "result" column of 2nd CSV . How can I achieve that using pandas or any scripting lang. Please see my final expected output.
Input
input.csv:
id,scenario,data1,data2,result
1,s1,300,400,"{s1,not added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,
output.csv:
id,result
1,"{s1,added}"
3,"{s3,added}"
Expected Output
final_output.csv
id,scenario,data1,data2,result
1,s1,300,400,"{s1,added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,"{s3,added}"
Current Code:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='test_id',how='left')
merged.to_csv("final_output.csv", index=False)
Question:
Using this code I am getting the result column twice. I want only once and it should override if value exists in that column. How do I get a single result column?

try this, this works as well
import pandas as pd
import numpy as np
c=pd.merge(a,b,on='id',how='left')
lst=[]
for i in c.index:
if(c.iloc[i]['result_x']!=''):
lst.append(c.iloc[i]['result_x'])
else:
lst.append(c.iloc[i]['result_y'])
c['result']=pd.Series(lst)
del c['result_x']
del c['result_y']

This will combine the columns as desired:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='id', how='outer')
def merge_results(row):
y = row['result_y']
return row['result_x'] if isinstance(y, float) else y
merged['result'] = merged.apply(merge_results, axis=1)
del merged['result_x']
del merged['result_y']
merged.to_csv("final_output.csv", index=False)

You can also use concat as below.
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
frames=[a,b]
mergedFrames=pd.DataFrame()
mergedFrames=pd.concat(frames, sort=True)
mergedFrames.to_csv(path/to/location)
NOTE: The sort=True is added to avoid some warnings

Related

Is there a way to remove header and split columns with pandas read_csv?

[Edited: working code at the end]
I have a CSV file with many rows, but only one column. I want to separate the rows' values into columns.
I have tried
import pandas as pd
df = pd.read_csv("TEST1.csv")
final = [v.split(";") for v in df]
print(final)
However, it didn't work. My CSV file doesn't have a header, yet the code reads the first row as a header. I don't know why, but the code returned only the header with the splits, and ignored the remainder of the data.
For this, I've also tried
import pandas as pd
df = pd.read_csv("TEST1.csv").shift(periods=1)
final = [v.split(";") for v in df]
print(final)
Which also returned the same error; and
import pandas as pd
df = pd.read_csv("TEST1.csv",header=None)
final = [v.split(";") for v in df]
print(final)
Which returned
AttributeError: 'int' object has no attribute 'split'
I presume it did that because when header=None or header=0, it appears as 0; and for some reason, the final = [v.split(";") for v in df] is only reading the header.
Also, I have tried inserting a new header:
import pandas as pd
df = pd.read_csv("TEST1.csv")
final = [v.split(";") for v in df]
headerList = ['Time','Type','Value','Size']
pd.DataFrame(final).to_csv("TEST2.csv",header=headerList)
And it did work, partly. There is a new header, but the only row in the csv file is the old header (which is part of the data); none of the other data has transferred to the TEST2.csv file.
Is there any way you could shed a light upon this issue, so I can split all my data?
Many thanks.
EDIT: Thanks to #1extralime, here is the working code:
import pandas as pd
df = pd.read_csv("TEST1.csv",sep=';')
df.columns = ['Time','Type','Value','Size']
df.to_csv("TEST2.csv")
Try:
import pandas as pd
df = pd.read_csv('TEST1.csv', sep=';')
df.columns = ['Time', 'Type', 'Value', 'Size']

Extra column appears when appending selected row from one csv to another in Python

I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:
You have to add index=False to your to_csv() method

How to delete CSV specific row.(like user_name = Max)

I have a CSV file of around 40K rows. And I want to delete 10K rows with conditions(eg: user_name = Max). And my data is like :
user1_name,user2_name,distance
"Unews","CCSSuptConnelly",""
"Unews","GwapTeamFre",""
"Unews","WilsonRecDept","996.27"
"Unews","ChiOmega_ISU","1025.03"
"Unews","officialtshay",""
"Unews","hari",""
"Unews","lashaunlester7",""
"Unews","JakeSlaughter5","509.53"
Thank you!
import pandas as pd
Read the csv
df = pd.read_csv('filename')
Create an index
index_names = df[ df['user2_name'] == 'Max' ].index
Drop it
df.drop(index_names, inplace = True)
You can use the Pandas library for this kind of problems and then use the .loc[] function. Link to the docs: Loc Function in pandas
import pandas as pd
df = pd.read_csv('name.csv')
df_filtered = df.loc[!(df['user_name'] == 'Max']),:]

Removing duplicates for a row with duplicates in one column dynamic data

I am attempting to remove duplicates for Column D for dynamic data with no headers or identifying features. I am attempting to delete all the rows where there are duplicates for Column D. I am converting excel to a dataframe, removing duplicates and then putting it back into excel. However I keep getting an assortment of errors or no duplicates removed. I am from a VBA background but we are migrating to Python
Attempted:
df.drop_duplicates(["C"])
df = pd.DataFrame({"C"})
df.groupby(["C"]).filter(lambda df:df.shape[0] == 1)
As well an assortment of other variations. I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
\\ import pandas as pd
df = pd.DataFrame({"C"]})
df.drop_duplicates(subset=[''C'], keep=False)
DG=df.groupby([''C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
Code itself Template-
df = pd.read_excel("C:/wadwa.xlsx", sheetname=0)
columns_to_drop = ['d.1']
#columns_to_drop = ['d.1', 'b.1', 'e.1', 'f.1', 'g.1']
import pandas as pd
Df = df[[col for col in df.columns if col not in columns_to_drop]]
print(df)
writer = pd.ExcelWriter('C:/dadwa/dwad.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
Code:
import pandas as pd
df = pd.read_excel("C:/Users/Documents/Book1.xlsx", sheetname=0)
import pandas as pd
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)
writer = pd.ExcelWriter('C:/Users//Documents/Book2.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
I think you need assign back and select 4.th columns by position:
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)

pandas Combine Excel Spreadsheets

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Categories