There is a table which contains data from 2014.The structure is as follows:
Each user can issue different number of book category.
User-id|Book-Category
1 |Thrill
2 |Thrill
3 |Mystery
3 |Mystery
The requirement is to find for each user,each type of book category issued.This data is already there in csv files but it is year wise available.
I have to add all those values.
eg:
data for 2014
u-id|book|count
1 |b1 |2
1 |b2 |4
... ... ...
data for 2015
u-id|book|count
1 |b1 |21
2 |b3 |12
//like the above format,available till 2018.(user1 with book b1 should have a count of 23
Now,I wrote a python script in which I just made a dictionary and iterated each row,if the key(u-id+book-category) was present,added the values of count otherwise,inserted key-value pair in that dictionary,did this for every year wise file in that script,since some files have size>1.5GB,the script kept on running for 7/8 hours,had to stop it.
Code:
import requests
import csv
import pandas as pd
Dict = {}
with open('data_2012.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['a']+row['b'] not in Dict:
Dict[row['a']+row['b']] = row['c']
##like this,iterating over the year wise files and finally writing the data to a different file.'a' and 'b' are mentioned at the first line of the data files for an easy access.
Is there any way with which we can achieve this functionality more elegantly in python or write a Map-Reduce job?
Related
I want all the combinations of records in a csv file(only one column)
Here is how the column looks like
Here is my code for what I had done
corpus = []
for index, row in df.iterrows():
corpus.append(pre_process(row['Chief Complaints']))
print(embeddings_similarity(corpus))
def embeddings_similarity(sentences):
# first we need to get data into | sentence_a | sentence_b | format
sentence_pairs = list(itertools.combinations(sentences, 2))
sentence_a = [pair[0] for pair in sentence_pairs]
sentence_b = [pair[1] for pair in sentence_pairs]
sentence_pairs_df = pd.DataFrame({'sentence_a': sentence_a, 'sentence_b': sentence_b})
From the above code I was able to get output fairly good that is 6X6 36 rows for given input in the picture.
But It takes a long time for more records so I was wondering is there any other way we can do to obtain the combinations of all records of a single column in a csv file.
I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]
I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18
I have a data that comes from MS SQL Server. The data from the query returns a list of names straight from a public database. For instance, If i wanted records with the name of "Microwave" something like this would happen:
Microwave
Microwvae
Mycrowwave
Microwavee
Microwave would be spelt in hundreds of ways. I solve this currently with a VLOOKUP in excel. It looks for the value on the left cell and returns value on the right. for example:
VLOOKUP(A1,$A$1,$B$4,2,False)
Table:
A B
1 Microwave Microwave
2 Microwvae Microwave
3 Mycrowwave Microwave
4 Microwavee Microwave
I would just copy the VLOOKUP formula down the CSV or Excel file and then use that information for my analysis.
Is there a way in Python to solve this issue in another way?
I could make a long if/elif list or even a replace list and apply it to each line of the csv, but that would save no more time than just using the VLOOKUP. There are thousands of company names spelt wrong and i do not have the clearance to change the database.
So Stack, Any ideas on how to leverage python in this scenario?
If you had have data like this:
+-------------+-----------+
| typo | word |
+-------------+-----------+
| microweeve | microwave |
| microweevil | microwave |
| macroworv | microwave |
| murkeywater | microwave |
+-------------+-----------+
Save it as typo_map.csv
Then run (in the same directory):
import csv
def OpenToDict(path, index):
with open(path, 'rb') as f:
reader=csv.reader(f)
headings = reader.next()
heading_nums={}
for i, v in enumerate(headings):
heading_nums[v]=i
fields = [heading for heading in headings if heading <> index]
file_dictionary = {}
for row in reader:
file_dictionary[row[heading_nums[index]]]={}
for field in fields:
file_dictionary[row[heading_nums[index]]][field]=row[heading_nums[field]]
return file_dictionary
map = OpenToDict('typo_map.csv', 'typo')
print map['microweevil']['word']
The structure is slightly more complex than it needs to be for your situation but that's because this function was originally written to lookup more than one column. However, it will work for you, and you can simplify it yourself if you want.
I am new to Python. I would like to do the difference between two rows of a csv file when they have the same id. This csv dataset is built from an sql table export which has more than 3 millions rows.
This is an example on how my timeserie's dataset looks like :
DATE - Product ID - PRICE
26/08 - 1 - 4
26/08 - 2 - 3
27/08 - 1 - 5
27/08 - 2 - 3
For instance I would like to calculate the difference between the price of the product with id 1 on the 26/08 and the price of this same product on the next day (27/08) to estimate the price's variation over time. I wondered what could be the best way to manipulate and do calculation over these datas in Python to do my calculations, whether with Python's csv module or with SQL queries in the code. I also heard of Pandas library... Thanks for your help !
try building a dictionary by product id and analyzing each id after loading
dd = {}
with open('prod.csv', 'rb') as csvf:
csvr = csv.reader(csvf, delimiter='-')
for row in csvr:
if if len(row) == 0 or row[0].startswith('DATE'):
continue
dd.setdefault(int(row[1]), []).append((row[0].strip(), int(row[2])))
dd
{1: [('26/08', 4), ('27/08', 5)],
2: [('26/08', 3), ('27/08', 3)]}
this will make it pretty easy to do comparisons