reshaping csv by eliminating redundant row keys and merging fields using python - python

I have a csv file in the form of:
'userid','metric name (1-10)','value'
the column 'metric name' has upwards of 10 different metrics so the same userid will have multiple rows associated with it. what I would like to accomplish would be something like this:
'userid1', 'metric name 1'='value1', 'metric name 2'='value2', 'metric name 3'='value3'... 'metric name 10' = 'value10'
A single row for each userid with all the metrics and values associated with that user in k/v pairs
I started playing around with pivot but that function doesn't really do what I need it to...
import pandas as pd
data=pd.read_csv('bps.csv')
data.pivot('entityName', 'metricName', 'value').stack()
I am thinking I need to iterator through the dataset by user and then grab the metrics associated with that user and build the metric k/v pairs during each iteration before going on to a new user. I did a pretty thorough job of searching the internet but I didn't find exactly what I was looking for. Please let me know if there is a simple library I could use.

Here come a solution using only standard python, not any framework.
Starting from the following data file :
id1,name,foo
id1,age,10
id2,name,bar
id2,class,example
id1,aim,demonstrate
You can execute the following code :
separator = ","
userIDKey = "userID"
defaultValue = "No data"
data = {}
#collect the data
with open("data.csv", 'r') as dataFile:
for line in dataFile:
#remove end of line character
line = line.replace("\n", "")
userID, fieldName, value = line.split(separator)
if not userID in data.keys():
data[userID] = {userIDKey:userID}
data[userID][fieldName] = value
#retrieve all the columns header in use
columnsHeaders = set()
for key in data:
dataset = data[key]
for datasetKey in dataset :
columnsHeaders.add(datasetKey)
columnsHeaders.remove(userIDKey)
columnsHeaders = list(columnsHeaders)
columnsHeaders.sort()
def getValue(key, dic):
if key in dic.keys():
return dic[key]
else:
return defaultValue
#then export the result
with open("output.csv", 'w') as outputFile:
#export first line of header
outputFile.write(userIDKey)
for header in columnsHeaders:
outputFile.write(", {0}".format(header))
outputFile.write("\n")
#and export each line
for key in data:
dataset = data[key]
outputFile.write(dataset[userIDKey])
for header in columnsHeaders:
outputFile.write(", {0}".format(getValue(header, dataset)))
outputFile.write("\n")
And then you get the following result :
userID, age, aim, class, name
id1, 'age'='10', 'aim'='demonstrate', 'class'='No data', 'name'='foo'
id2, 'age'='No data', 'aim'='No data', 'class'='example', 'name'='bar'
I think this code can be easily modified to match you objectives if required.
Hope it helps.
Arthur.

Related

What is wrong my python code, I'm trying to add implement multiple file search criteria for files

I want to make changes in my code so I can search for multiple input files and type multiple inputs, let's say if client order number 7896547 exist in the input file I put there. What is the best way to implement multiple search criteria for multiple files.
What I meant is giving around let's say like more than 50 inputs, 1234567, 1234568 etc...…, and also search through multiple files(I mean more than 10). What is the most efficient way to achieve this?
My code:
import csv
data=[]
with open("C:/Users/CSV/salesreport1.csv", "C:/Users/CSV//salesreport2.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
name = input("Enter a string: ")
col = [x[0] for x in data]
if name in col:
for x in range(0, len(data)):
if name == data[x] [0]:
print(data[x])
else:
print("Does not exist")
I thought I can just add input by adding one file name in the open() part?
Also to add multiple input when typing, is there way to not use array?
I mean identifierlist = ["x","y","z"], not doing this
As mentioned in comments, you can read CSV file as a dataframe and use df.srt.find() to check occurrence of a value in a column.
import pandas as pd
df = pd.read_csv('file.csv') # read CSV file as dataframe
name = input("Enter a string: ")
result = df["column_name"].str.findall(name) # The lowest index of its occurrence is returned.
Just to be clear, you are loading in some .csvs, getting a name from the user, and then printing out the rows that have that name in column 0? This seems like a nice use case for a Python dictionary, a common and simple data type in Python. If you aren't familiar, a dictionary lets you store information by some key. For example, you might have a key 'Daniel' that stores a list of information about someone named Daniel. In this situation, you could go through the array and put everyone row into a dict with the name as the key. This would look like:
names_dict = {}
for line in file:
//split and get name
name = split_line[0]
names_dict[name] = line
and then to look up a name in the dict, you would just do:
daniel_info = names_dict['Daniel']
or more generally,
info = names_dict[name]
You could also use the dict for getting your list of names and checking if a name exists, because in Python dicts have a built in method for finding if a key exists in a dict, "in". You could just say:
if 'Daniel' in names_dict:
or again,
if name in names_dict:
Another cool thing you could do with this project would be to let the user choose what column they are searching. For example, let them put in 3 for the column, and then search on whatever is in column 3 of that row, location, email, etc.
Finally, I will just show a complete concrete example of what I would do:
import csv
##you could add to this list or get it from the user
files_to_open = ["C:/Users/CSV/salesreport1.csv","C:/Users/CSV//salesreport2.csv"]
data=[]
##iterate through list of files and add body to list
for file in files_to_open:
csvfile = open(file,"r")
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
keys_dict = {}
column = int(input("Enter the column you want to search: "))
val = input("Enter the value you want to search for in this column: ")
for row in data:
##gets the thing in the right column
v = row[column]
##adds the row to the list with the right key
keys_dict[v] = row
if val in keys_dict:
print(keys_dict[val])
else:
print("Nothing was found at this column and key!")
EDIT: here's a way to write a large number of text files and combine them into one file. For the multiple inputs, you could ask them to type commas, like "Daniel,Sam,Mike"... and then split the output on these commas with output.split(","). You could then do:
for name in names:
if name in names_dict:
##print them or say not found
You can't use open like that.
According to the documentation, you can only pass one file per each call.
Seeing as you want to check a large number of files, here's an example of a very simple approach that checks all the CSVs in the same folder as this script:
import csv
import os
data = []
# this gets all the filenames of files that end with .csv in the directory
csv_files = [x for x in os.listdir() if x.endswith('.csv')]
name = input("Enter a string: ")
# loops over every filename
for path in csv_files:
# opens the file
with open(path) as file:
reader = csv.reader(file)
for row in reader:
# if the word is an exact match for an entry in a row
if name in row:
# prints the row
print(row)

How to merge and combine rows with same id (index) in python?

I am new in python and I am working with CSV file with over 10000 rows. In my CSV file, there are many rows with the same id which I would like to merge them in one and also combine their information as well.
For instance, the data.csv look like (id and info is the name of columns):
id| info
1112| storage is full and needs extra space
1112| there is many problems with space
1113| pickup cars come and take the garbage
1113| payment requires for the garbage
and I want to get the output as:
id| info
1112| storage is full and needs extra space there is many problems with space
1113| pickup cars come and take the garbage payment requires for the garbage
I already looked at a few posts such as 1 2 3 but none of them helped me to answer my question.
It would be great if you could use python code to describe your help that I can also run and learn in my side.
Thank you
I think about some simplier way:
some_dict = {}
for idt, txt in line: #~ For line use your id, info reader.
some_dict[idt] = some_dict.get(idt, "") + txt
It should create your dream structure without imports, and i hope most efficient way.
Just to understand, get have secound argument, what must return if something isn't finded in dict. Then create empty string and add text, if some was finded, then add text to that.
#Edit:
Here is complete example with reader :). Try to replace correctly variable instead of reader entry, which shows how to do it :)
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for idt, info in reader:
temp = some_dict.get(idt, "")
some_dict[idt] = temp+" "+txt if temp else txt
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")
This is full program which should work for you.
But, it won't work if you got more than 2 columns in file, then u can just replace idt, info with row, and use indexes for first and secound element.
#Next Edit:
For more then 2 columns:
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for row in reader:
temp = some_dict.get(row[0], "")
some_dict[row[0]] = temp+" "+row[1] if temp else row[1]
#~ There you can add something with another columns if u want.
#~ Example: another_dict[row[2]] = another_dict.get(row[2], "") + row[3]
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")
Just make a dictionary where id's are keys:
from collections import defaultdict
by_id = defaultdict(list)
for id, info in your_list:
by_id[id].append(info)
for key, value in by_id.items():
print(key, value)

Creating a nested dictionary from file that has only text

I have a csv file that has the headers and the value lines are:
site,access_key,secret_access_key
sa1,something,something
na1,something,something
and so on. I would like the dictionary to look like
site_dict = {"sa1" :{"access_key" : "something", "secret_access_key" : "something"}, "na1" :{"access_key" : "something", "secret_access_key" : "something"}}
I tried what was suggested here : How to create a nested dictionary from a csv file with N rows in Python but it deals with numeric values and I could not get my head around changing it to string values. Any help would be appreciated. If you make a suggestion or provide an answer please make it an answer so I can mark it appropriately. EDIT: I changed the sa1 and na1 to keys by adding the quotes.
You can use the csv module for reading and preread the first line to get the key-names:
# create data
with open("f.txt","w") as f:
f.write("""site,access_key,secret_access_key
sa1,something111,something111
na1,something222,something222""")
import csv
result = {}
with open("f.txt") as f:
# get the keynames from the 1st line
fields = next(f).strip().split(",")
reader = csv.reader(f)
# process all other lines
for line in reader:
# outer key is 1st value
# inner key/values are from the header line and rest of line data
result[line[0]] = dict(zip(fields[1:],line[1:]))
print(result)
Output:
{'sa1': {'access_key': 'something111', 'secret_access_key': 'something111'},
'na1': {'access_key': 'something222', 'secret_access_key': 'something222'}}
Lookup: csv

Python code to process CSV file

I am getting the CSV file updated on daily basis. Need to process and create new file based on the criteria - If New data then should be tagged as New against the row and if its an update to the existing data then should be tagged as Update. How to write a Python code to process and output in CSV file as follows based on the date.
Day1 input data
empid,enmname,sal,datekey
1,cholan,100,8/14/2018
2,ram,200,8/14/2018
Day2 input Data
empid,enmname,sal,datekey
1,cholan,100,8/14/2018
2,ram,200,8/14/2018
3,sundar,300,8/15/2018
2,raman,200,8/15/2018
Output Data
status,empid,enmname,sal,datekey
new,3,sundar,300,8/15/2018
update,2,raman,200,8/15/2018
I'm feeling nice, so I'll give you some code. Try to learn from it.
To work with CSV files, we'll need the csv module:
import csv
First off, let's teach the computer how to open and parse a CSV file:
def parse(path):
with open(path) as f:
return list(csv.DictReader(f))
csv.DictReader reads the first line of the csv file and uses it as the "names" of the columns. It then creates a dictionary for each subsequent row, where the keys are the column names.
That's all well and good, but we just want the last version with each key:
def parse(path):
data = {}
with open(path) as f:
for row in csv.DictReader(f):
data[row["empid"]] = row
return data
Instead of just creating a list containing everything, this creates a dictionary where the keys are the row's id. This way, rows found later in the file will overwrite rows found earlier in the file.
Now that we've taught the computer how to extract the data from the files, let's get it:
old_data = parse("file1.csv")
new_data = parse("file2.csv")
Iterating through a dictionary gives you its keys, which are the ids defined in the data set. For consistency, key in dictionary says whether key is one of the keys in the dictionary. So we can do this:
new = {
id_: row
for id_, row in new_data.items()
if id_ not in old_data
}
updated = {
id_: row
for id_, row in new_data.items()
if id_ in old_data and old_data[id_] != row
}
I'll put csv.DictWriter here and let you sort out the rest on your own.

Write data from one csv to another python

I have three CSV files with attributes Product_ID, Name, Cost, Description. Each file contains Product_ID. I want to combine Name (file1), Cost(file2), Description(File3) to new CSV file with Product_ID and all three above attributes. I need efficient code as files contains over 130000 rows.
After combining all data to new file, I have to load that data in a dictionary.
Like: Product_Id as Key and Name,Cost,Description as Value.
It might be more efficient to read each input .csv into a dictionary before creating your aggregated result.
Here's a solution for reading in each file and storing the columns in a dictionary with Product_IDs as the keys. I assume that each Product_ID value exists in each file and that headers are included. I also assume that there are no duplicate columns across the files aside from Product_ID.
import csv
from collections import defaultdict
entries = defaultdict(list)
files = ['names.csv', 'costs.csv', 'descriptions.csv']
headers = ['Product_ID']
for filename in files:
with open(filename, 'rU') as f: # Open each file in files.
reader = csv.reader(f) # Create a reader to iterate csv lines
heads = next(reader) # Grab first line (headers)
pk = heads.index(headers[0]) # Get the position of 'Product_ID' in
# the list of headers
# Add the rest of the headers to the list of collected columns (skip 'Product_ID')
headers.extend([x for i,x in enumerate(heads) if i != pk])
for row in reader:
# For each line, add new values (except 'Product_ID') to the
# entries dict with the line's Product_ID value as the key
entries[row[pk]].extend([x for i,x in enumerate(row) if i != pk])
writer = csv.writer(open('result.csv', 'wb')) # Open file to write csv lines
writer.writerow(headers) # Write the headers first
for key, value in entries.items():
writer.writerow([key] + value) # Write the product IDs
# concatenated with the other values
A general solution that produces a record, maybe incomplete, for each id it encounters processing the 3 files needs the use of a specialized data structure that fortunately is just a list, with a preassigned number of slots
d = {id:[name,None,None] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d:
d[id][1] = cost
else:
d[id] = [None, cost, None]
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d:
d[id][2] = desc
else:
d[id] = [None, None, desc]
for id in d:
if all(d[id]):
print ','.join([id]+d[id])
else: # for this id you have not complete info,
# so you have to decide on your own what you want, I have to
pass
If you are sure that you don't want to further process incomplete records, the code above can be simplified
d = {id:[name] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d: d[id].append(name)
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d: d[id].append(desc)
for id in d:
if len(d[id])==3: print ','.join([id]+d[id])

Categories