How to merge and combine rows with same id (index) in python? - python

I am new in python and I am working with CSV file with over 10000 rows. In my CSV file, there are many rows with the same id which I would like to merge them in one and also combine their information as well.
For instance, the data.csv look like (id and info is the name of columns):
id| info
1112| storage is full and needs extra space
1112| there is many problems with space
1113| pickup cars come and take the garbage
1113| payment requires for the garbage
and I want to get the output as:
id| info
1112| storage is full and needs extra space there is many problems with space
1113| pickup cars come and take the garbage payment requires for the garbage
I already looked at a few posts such as 1 2 3 but none of them helped me to answer my question.
It would be great if you could use python code to describe your help that I can also run and learn in my side.
Thank you

I think about some simplier way:
some_dict = {}
for idt, txt in line: #~ For line use your id, info reader.
some_dict[idt] = some_dict.get(idt, "") + txt
It should create your dream structure without imports, and i hope most efficient way.
Just to understand, get have secound argument, what must return if something isn't finded in dict. Then create empty string and add text, if some was finded, then add text to that.
#Edit:
Here is complete example with reader :). Try to replace correctly variable instead of reader entry, which shows how to do it :)
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for idt, info in reader:
temp = some_dict.get(idt, "")
some_dict[idt] = temp+" "+txt if temp else txt
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")
This is full program which should work for you.
But, it won't work if you got more than 2 columns in file, then u can just replace idt, info with row, and use indexes for first and secound element.
#Next Edit:
For more then 2 columns:
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for row in reader:
temp = some_dict.get(row[0], "")
some_dict[row[0]] = temp+" "+row[1] if temp else row[1]
#~ There you can add something with another columns if u want.
#~ Example: another_dict[row[2]] = another_dict.get(row[2], "") + row[3]
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")

Just make a dictionary where id's are keys:
from collections import defaultdict
by_id = defaultdict(list)
for id, info in your_list:
by_id[id].append(info)
for key, value in by_id.items():
print(key, value)

Related

Difficulties in extracting data from CSV and matching it with Product model

I have a Django Python project where I’m trying to build an application where you can upload a csv file to then extract its values to generate a sales report pdf. Products can be added via the admin console and if those items appear (at the right place for now, eg item 4 in each row gets checked on) in the csv file they get extracted and added to the report (calculating the sum, storing the date of each purchase).
The csv file is as follows:
POS,Transaction id,Product,Quantity,Customer,Date
1,E100,TV,1,Test Customer,2022-09-19
2,E100,Laptop,3,Test Customer,2022-09-20
3,E200,TV,1,Test Customer,2022-09-21
4,E300,Smartphone,2,Test Customer,2022-09-22
5,E300,Laptop,5,New Customer,2022-09-23
6,E300,TV,1,New Customer,2022-09-23
7,E400,TV,2,ABC,2022-09-24
8,E500,Smartwatch,4,ABC,2022-09-25
I’m having 2 main problems, the first is that following a tutorial with someone using a Mac (I’m on Windows but also saving the csv file in Macintosh format didn’t fix this) the code he uses just doesn’t work for me. It literally returns an empty string:
with open(obj.csv_file.path, 'r') as f:
reader = csv.reader(f)
reader.__next__()
for row in reader:
data = "".join(row)
data = data.split(';')
data.pop()
My workaround here is then to write the following code which generates a string separated by ‘;’:
for row in reader:
print(row,type(row))
data = " ".join(row)
data = data.split(";")
As part of this first problem I’m currently unable to grab the elements. I’m thinking that I probably need to convert the values into a list, but there’s a problem to that which is my main problem.
Going further in the code:
transaction_id = data[1]
product = data[2]
quantity = int(data[3])
customer = data[4]
date = parse_date(data[5])
print(transaction_id, product, quantity, customer, date)
try:
product_obj = Product.objects.get(name__iexact=product)
except Product.DoesNotExist:
product_obj = None
return HttpResponse()
The terminal prints out:
Quit the server with CONTROL-C.
file is being uploaded
['1', 'E100', 'TV', '1', 'Test Customer', '9/19/2022'] <class 'list'>
['1 E100 TV 1 Test Customer 9/19/2022'] <class 'list'>
(…)
transaction_id = data[1]
IndexError: list index out of range
It turns out the product_obj always has the value None which it also had when I was playing around with the iteration where I occasionally could grad elements but never product_obj which is constantly set to None.
using csv.reader - no need to join and then split back row, etc.
with open(obj.csv_file.path, 'r') as f:
rdr = csv.reader(f)
next(rdr) # skip the header row
# alternative 1
for row in rdr:
pos, transaction_id, product, quantity, customer, transaction_date = row
# here work with product to check if exists
# alternative 2
for pos, transaction_id, product, quantity, customer, transaction_date in rdr:
# here work with product to check if exists
Using csv.DictReader:
with open(obj.csv_file.path, 'r') as f:
rdr = csv.DictReader(f)
for row in rdr:
product = row['Product'] # this will work even when column order/number has changed
# here work with product to check if exists
Now I see this is your third question in a row on the same problem. You've got some good working code and some ill advise. Anyway, it doesn't look like you actually try to understand the code you are given.

What is wrong my python code, I'm trying to add implement multiple file search criteria for files

I want to make changes in my code so I can search for multiple input files and type multiple inputs, let's say if client order number 7896547 exist in the input file I put there. What is the best way to implement multiple search criteria for multiple files.
What I meant is giving around let's say like more than 50 inputs, 1234567, 1234568 etc...…, and also search through multiple files(I mean more than 10). What is the most efficient way to achieve this?
My code:
import csv
data=[]
with open("C:/Users/CSV/salesreport1.csv", "C:/Users/CSV//salesreport2.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
name = input("Enter a string: ")
col = [x[0] for x in data]
if name in col:
for x in range(0, len(data)):
if name == data[x] [0]:
print(data[x])
else:
print("Does not exist")
I thought I can just add input by adding one file name in the open() part?
Also to add multiple input when typing, is there way to not use array?
I mean identifierlist = ["x","y","z"], not doing this
As mentioned in comments, you can read CSV file as a dataframe and use df.srt.find() to check occurrence of a value in a column.
import pandas as pd
df = pd.read_csv('file.csv') # read CSV file as dataframe
name = input("Enter a string: ")
result = df["column_name"].str.findall(name) # The lowest index of its occurrence is returned.
Just to be clear, you are loading in some .csvs, getting a name from the user, and then printing out the rows that have that name in column 0? This seems like a nice use case for a Python dictionary, a common and simple data type in Python. If you aren't familiar, a dictionary lets you store information by some key. For example, you might have a key 'Daniel' that stores a list of information about someone named Daniel. In this situation, you could go through the array and put everyone row into a dict with the name as the key. This would look like:
names_dict = {}
for line in file:
//split and get name
name = split_line[0]
names_dict[name] = line
and then to look up a name in the dict, you would just do:
daniel_info = names_dict['Daniel']
or more generally,
info = names_dict[name]
You could also use the dict for getting your list of names and checking if a name exists, because in Python dicts have a built in method for finding if a key exists in a dict, "in". You could just say:
if 'Daniel' in names_dict:
or again,
if name in names_dict:
Another cool thing you could do with this project would be to let the user choose what column they are searching. For example, let them put in 3 for the column, and then search on whatever is in column 3 of that row, location, email, etc.
Finally, I will just show a complete concrete example of what I would do:
import csv
##you could add to this list or get it from the user
files_to_open = ["C:/Users/CSV/salesreport1.csv","C:/Users/CSV//salesreport2.csv"]
data=[]
##iterate through list of files and add body to list
for file in files_to_open:
csvfile = open(file,"r")
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
keys_dict = {}
column = int(input("Enter the column you want to search: "))
val = input("Enter the value you want to search for in this column: ")
for row in data:
##gets the thing in the right column
v = row[column]
##adds the row to the list with the right key
keys_dict[v] = row
if val in keys_dict:
print(keys_dict[val])
else:
print("Nothing was found at this column and key!")
EDIT: here's a way to write a large number of text files and combine them into one file. For the multiple inputs, you could ask them to type commas, like "Daniel,Sam,Mike"... and then split the output on these commas with output.split(","). You could then do:
for name in names:
if name in names_dict:
##print them or say not found
You can't use open like that.
According to the documentation, you can only pass one file per each call.
Seeing as you want to check a large number of files, here's an example of a very simple approach that checks all the CSVs in the same folder as this script:
import csv
import os
data = []
# this gets all the filenames of files that end with .csv in the directory
csv_files = [x for x in os.listdir() if x.endswith('.csv')]
name = input("Enter a string: ")
# loops over every filename
for path in csv_files:
# opens the file
with open(path) as file:
reader = csv.reader(file)
for row in reader:
# if the word is an exact match for an entry in a row
if name in row:
# prints the row
print(row)

writing to a single CSV file from multiple dictionaries

Background
I have multiple dictionaries of different lengths. I need to write the values of dictionaries to a single CSV file. I figured I can loop through each dictionary one by one and write the data to CSV. I ran in to a small formatting issue.
Problem/Solution
I realized after I loop through the first dictionary the data of the second writing gets written the row where the first dictionary ended as displayed in the first image I would ideally want my data to print as show in the second image
My Code
import csv
e = {'Jay':10,'Ray':40}
c = {'Google':5000}
def writeData():
with open('employee_file20.csv', mode='w') as csv_file:
fieldnames = ['emp_name','age','company_name','size']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for name in e:
writer.writerow({'emp_name':name,'age':e.get(name)})
for company in c:
writer.writerow({'company_name':company,'size':c.get(company)})
writeData()
PS: I would have more than 2 dictionaries so I am looking for a generic way where I can print data from row under the header for all the dictionaries. I am open to all solutions and suggestions.
If all dictionaries are of equal size, you could use zip to iterate over them in parallel. If they aren't of equal size, and you want the iteration to pad to the longest dict, you could use itertools.zip_longest
For example:
import csv
from itertools import zip_longest
e = {'Jay':10,'Ray':40}
c = {'Google':5000}
def writeData():
with open('employee_file20.csv', mode='w') as csv_file:
fieldnames = ['emp_name','age','company_name','size']
writer = csv.writer(csv_file)
writer.writerow(fieldnames)
for employee, company in zip_longest(e.items(), c.items()):
row = list(employee)
row += list(company) if company is not None else ['', ''] # Write empty fields if no company
writer.writerow(row)
writeData()
If the dicts are of equal size, it's simpler:
import csv
e = {'Jay':10,'Ray':40}
c = {'Google':5000, 'Yahoo': 3000}
def writeData():
with open('employee_file20.csv', mode='w') as csv_file:
fieldnames = ['emp_name', 'age', 'company_name', 'size']
writer = csv.writer(csv_file)
writer.writerow(fieldnames)
for employee, company in zip(e.items(), c.items()):
writer.writerow(employee + company)
writeData()
A little side note: If you use Python3, dictionaries are ordered. This isn't the case in Python2. So if you use Python2, you should use collections.OrderedDict instead of the standard dictionary.
There might be a more pythonic solution, but I'd do something like this:
I haven't used your .csv writer thing before, so I just made my own comma separated output.
e = {'Jay':10,'Ray':40}
c = {'Google':5000}
dict_list = [e,c] # add more dicts here.
max_dict_size = max(len(d) for d in dict_list)
output = ""
# Add header information here.
for i in range(max_dict_size):
for j in range(len(dict_list)):
key, value = dict_list[j].popitem() if len(dict_list[j]) else ("","")
output += f"{key},{value},"
output += "\n"
# Now output should contain the full text of the .csv file
# Do file manipulation here.
# You could also do it after each row,
# Where I currently have the output += "\n"
Edit: A little more thinking and I found something that might polish this a bit. You could first map the dictionary into a list of keys using the .key() function on each dictionary and appending those to an empty list.
The advantage with that is that you'd be able to go "forward" instead of popping the dictionary items off the back. It also wouldn't destroy the dictionary.

How do you check if a list has an entry with certain values?

It's a bit hard to explain my problem, but I'll do my best.
I have a .csv file with around 38k entries, and all entries are the same format. The format is:
Name1, party1, name2, party2, date, URL
Now I need to search through this .csv file and check for each entry if the turned names and parties exists.
So for example, I have the following entry:
S. Faber, CDA, J.A. v. Kemenade, PvdA, 1980.06.24, http://polidocs.nl/XML/MOT/1970028.xml
Where
name1 = S. Faber,
party1 = CDA,
name2 = J.A. v. Kemenade,
party2 = PvdA,
date = 1980.06.24,
URL = http://polidocs.nl/XML/MOT/1970028.xml
Now I need to check if there is an entry with these exact values:
J.A. v. Kememande, PvdA, S. Faber, CDA, date, URL <- where date and URL dont matter
Any ideas?
If I understand right your question, the following piece of code should help you:
f = open("file.csv")
parsed_lines = []
for line in f:
vals = line.split(",")
parsed_lines.append(map(str.strip, vals))
for idx, vals in enumerate(parsed_lines):
for jdx in range(idx+1, len(parsed_lines)):
if (vals[0]==parsed_lines[jdx][0]) and \
(vals[1]==parsed_lines[jdx][1]) and \
(vals[2]==parsed_lines[jdx][2]) and \
(vals[3]==parsed_lines[jdx][3]):
print "line #%s looks similar to line #%s" % (idx+1,jdx+1)
You can try built-in csv module and its DictReader class. Try something like this:
your_data = []
with open('data.csv') as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
# check all of your conditions here
if row['name1'] == 'S. Faber' and row['party1'] == 'CDA':
your_data.append(row)
I did not test the code. But looks fine. You can find an example and more here
If you're sure about the data format, there's no need for object creation.
fo = open("data.csv", "r")
lines = fo.readlines()
"S. Faber, CDA, J.A. v. Kemenade, PvdA, 1980.06.24, http://polidocs.nl/XML/MOT/1970028.xml\n" in lines
notice the additional \n at the end of the string

Trouble with sorting list and "for" statement snytax

I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()
You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.
You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.

Categories