Remove duplicates from rows of a csv file using Python

Remove duplicates from rows of a csv file using Python - python

I'm new to Python and trying to do the following.
I have a csv file like below, (input.csv)
a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
where I'd like to remove duplicates with respect to each row to get the below.
a,v,s,f (output.csv)
china,usa, and uk,france
india,australia,usa,uk
japan,south africa,,new zealand
Notice that though 'usa' is repeated in two different rows, it still is kept intact, unlike 'china' and 'japan', which are repeated in same rows.
I tried doing using OrderedDict from collections in the following way
from collections import OrderedDict
out = open ("output.csv","w")
items = open("input.csv").readlines()
print >> out, list(OrderedDict.fromkeys(items))
but it moved all the data into one single row

we might hurt the dataset while iterating rows and deleting items without caring the related original position. There is related index (Column/Rows) to every item, deleting it can move the next items to other position.
Try to use pandas in such scenarios. by selecting items in the same row, you can apply a function to re-construct the row respecting their position. We use in operator to deal with such scenarios china and uk, and we replace the duplicated values with a an empty str.
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
Your code would look like:
import pandas as pd
from io import StringIO
data="""a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand"""
df= pd.read_csv(StringIO(data.decode('UTF-8')) )
from collections import Counter
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
print df.apply(lambda x:trans(x),axis=1 )
a v s f
0 china usa and uk france
1 india australia usa uk
2 japan south africa new zealand
In order to read your csv file, you just need to replace the name. More details should be found here
df= pd.read_csv("filename.csv")

This can actually be asked more specifically as, "How to remove duplicate items from lists." For which there's an existing solution: Removing duplicates in lists
So, assuming that your CSV file looks like this:
items.csv
a,v,s,f
china,usa,china,uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
I intentionally changed "china and uk" in line 2 to "china,uk". Note below.
Then the script to remove duplicates could be:
import sys
with open('items.csv', 'r') as csv:
for line in csv.readlines():
print list(set(line.split(',')))
Note: Now, if the 2nd really does contain "china and uk", you'd have to do something different than processing the file as a CSV.

Related

How do I filter out elements in a column of a data frame based upon if it is in a list?

I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France

If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC

Replace double backslash in Pandas csv file

Python 3.9.5/Pandas 1.1.3
I have a very large csv file with values that look like:
Ac\\Nme Products Inc.
and all the values are different company names with double backslashes in random places throughout.
I'm attempting to get rid of all the double backslashes. It's not working in Pandas. But a simple test against the standalone value just using string.replace does work.
Example:
org = "Ac\\Nme Products Inc."
result = org.replace("\\","")
print(result)
returns AcNme Products Inc. as the output, as I would expect.
However, using Pandas with the names in a csv file:
import pandas as pd
csv_input = pd.read_csv('/Users/me/file.csv')
csv_input.replace("\\", "")
csv_input.to_csv('/Users/me/file_revised.csv', index=False)
When I open the new file_revised.csv file, the value still shows as Ac\\Nme Products Inc.
EDIT 1:
Here is a snippet of file.csv as requested:
id,company_name,address,country 1000566,A1 Comm\\Nodity Traders,LEVEL 28 THREE PACIFIC PLACE 1 QUEEN'S RD EAST HK,TH 1000579,"A2 A Mf\\g. Co., Ltd.",53 YONG-AN 2ND ST. TAINAN TAIWAN,CA 1000585,"A2 Z Logisitcs Indi\\Na Pvt., Ltd.",114A/1 1ST FLOOR SOUTH RAJA ST TUTICORIN - 628 001 TAMILNADU - INDIA,PE

Pandas doesn't have a dataframe level string operation, but it can be updated per-column:
for col in csv_input.columns:
if col == 'that_int_column':
continue
csv_input[col] = csv_input[col].str.replace(r"\\N", "")

Saving a Pandas data frame as CSV or XLSX in a more readable fashion

I have a Pandas dataframe (shape 40x2; two rows are shown below):
Honduras ['Water\nAgriculture\nHealth\nBiodiversity and...
Hungary ['not explicitly mentioned']
I would like to get rid off list signs [, ] and quotation ' sign so that it is saved in csv or xlsx in the following form:
Honduras Water,
Agriculture,
Health,
Biodiversity and...
Hungary not explicitly mentioned
Thank you very much.
Best,
Sharif

Considering only first row but you can apply this all rows of your dataframe,
l1 = "Honduras ['Water\nAgriculture\nHealth\nBiodiversity']"
l1 = l1.replace(']', '').replace('[', '').replace("'", '').replace(' ', '')
l1 = l1.split('\n')
l1
Output:
['HondurasWater', 'Agriculture', 'Health', 'Biodiversity']
Note: After applying to all rows, you will get list for each row and now you can convert these lists to dataframe and then write to csv.

Convert rows of text into pandas structure

I have thousands of rows in a list like the one below that I would like to convert into a pandas table consisting of different columns.
2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe
Ideally I would like to to create a pandas structure with the following columns: Date, Sale, ID, Region, and then fill it with values that fit the values.
E.g. so in the first row I have sales = 120, ID = 534343, region = North America and date = 2018-12-03 21:15:24.
Given that I have thousands of rows, what code could make this work?

Supposing your list is in a file, read it first into a string (or into a list already, in which case following code will differ) and then apply code.
To read into a string:
with open('/file/path/myfile.txt','r') as f:
s = f.read()
Code for parsing:
import re
import pandas as pd
s = """2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe"""
sales_re = "Sales:([0-9]+)"
id_re = "ID:([0-9]+)"
lst = []
for line in s.split('\n'):
date = line[0:19]
sale = re.search(sales_re, line).groups()[0]
id = re.search(id_re, line).groups()[0]
region = line[line.rfind(":")+1+len(id)+1:] # Search from last ":", add one to go over ":" and 1 to skip space
x = [date, sale, id, region]
lst.append(x)
df = pd.DataFrame(lst)
df.columns = ['date', 'sale', 'id', 'region']
In the example above, I assume everything is loaded into a string. Then I use regular expressions to extract harder part of each line and append everything into a list that. Then I use the pandas.DataFrame constructor to convert into a dataframe.

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.

Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35

Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates from rows of a csv file using Python - python

Related

How do I filter out elements in a column of a data frame based upon if it is in a list?

Replace double backslash in Pandas csv file

Saving a Pandas data frame as CSV or XLSX in a more readable fashion

Convert rows of text into pandas structure

Python data wrangling issues

Categories

Resources