Related
my code so far:
import csv
myIds = ['1234','3456','76']
countries = []
# open the file
with open('my.csv', 'r') as infile:
# read the file as a dictionary for each row ({header : value})
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# extract the variables and assign to lists
myFileIds = data['id']
myFileCountry = data['country']
listfromfile = [a + " " + b for a, b in zip(myFileIds, myFileCountry)]
the above gives this results in listfromfile as follows:
listfromfile = ['1 Uruguay', '2 Vatican', '1234 US', '3456 UK', '5678 Brazil','10111 Argentina','234567 Spain']
I'm aiming for a list with countries for which IDs occur in my.csv file, but it's also possible that id from myIds list won't be present in my.csv file. Then I need that place on the list show value as 'Unsupported Country'. Both lists should have the same length myIds and countries so I will know that first id on my list corresponds with first country on the other list etc. Desired outcome:
myIds = ['1234','3456','76']
countries = ['US', 'UK', 'Unsupported Country']
Alternatively I'm trying with pandas but also no luck :(
import pandas as pd
df=pd.read_csv('my.csv')
myIds = ['1234','3456','76']
countries = df.loc[df["id"].isin(myIds),"country"].tolist()
my.csv:
id country
1 Uruguay
2 Vatican
1234 US
3456 UK
5678 Brazil
10111 Argentina
234567 Spain
Could someone help me with this please? thanks in advance!
You can achieve this using dataframes.
import pandas as pd
input_df = pd.read_csv("test.csv")
myIds = ['1234','3456','76']
my_ids_df = pd.DataFrame(myIds,columns=['id']).astype(int)
output_df = pd.merge(input_df, my_ids_df, on=['id'], how='right')
output_df['country'] = output_df['country'].fillna('Unsupported Country')
print(list(zip(output_df['id'].values.tolist(),output_df['country'].values.tolist())))
Maybe this would be useful for your purposes:
This assumes your file data is as you have it in your example. Otherwise, you could split on another character.
>>> from collections import defaultdict
>>> country_data = defaultdict(lambda: 'Unsupported Country')
>>>
>>> for line in open("my.csv", 'r'):
... try:
... id, country = line.split()
... country_data[int(id)] = country
... country_data[country] = int(id)
... except ValueError:
... pass # Row isn't in the right format. Skip it.
...
>>> country_data['Vatican']
2
>>> country_data[2]
'Vatican'
>>> country_data['Moojoophorbovia']
'Unsupported Country'
>>>
If you're not trying to fit a square peg into a round hole by assuming you need two lists that you have to keep in sync - and then trying to fit your file data into them, the above might solve the problem of reading in country data and having it accessible by ID index, or getting the ID from the name of the country.
import pandas as pd
myIds = ['1234','3456','76']
df = pd.DataFrame(myIds, columns=['id'])
fields=['id', 'country']
df = df1
df2 = pd.read_csv('my.csv', sep = ',', usecols=fields)
df3 = df1.merge(df2, on="id", how='left')
df3['country'].fillna('Unsupported Country', inplace=True)
del df3['id']
countries = df3['country'].tolist()
the above works for me. However, still trying to find an easier solution.
I am currently facing a problem to make my cvs data into dictionary.
I have 3 columns that I'd like to use in the file:
userID, placeID, rating
U1000, 12222, 3
U1000, 13333, 2
U1001, 13333, 4
I would like to make the result look like this:
{'U1000': {'12222': 3, '13333': 2},
'U1001': {'13333': 4}}
That is to say,
I would like to make my data structure look like:
sample = {}
sample["U1000"] = {}
sample["U1001"] = {}
sample["U1000"]["12222"] = 3
sample["U1000"]["13333"] = 2
sample["U1001"]["13333"] = 4
but I have a lot of data to be processed.
I'd like to get the result with loop, but i have tried for 2 hours and failed..
---the following codes may confuse you---
My result look like this now:
{'U1000': ['12222', 3],
'U1001': ['13333', 4]}
the value of the dict is a list rather a dictionary
the user "U1000" appears multiple times but in my result theres only one time
I think my code has many mistakes.. if you don't mind please take a look:
reader = np.array(pd.read_csv("rating_final.csv"))
included_cols = [0, 1, 2]
sample= {}
target=[]
target1 =[]
for row in reader:
content = list(row[i] for i in included_cols)
target.append(content[0])
target1.append(content[1:3])
sample = dict(zip(target, target1))
how can I improve the codes?
I have looked through stackoverflow but due to personal lacking in ability,
can anyone please kindly help me with this?
Many thanks!!
This should do what you want:
import collections
reader = ...
sample = collections.defaultdict(dict)
for user_id, place_id, rating in reader:
rating = int(rating)
sample[user_id][place_id] = rating
print(sample)
# -> {'U1000': {'12222': 3, '1333': 2}, 'U1001': {'13333': 4}}
defaultdict is a convenience utility that provides default values whenever you try to access a key that is not in the dictionary. If you don't like it (for example because you want sample['non-existent-user-id] to fail with KeyError), use this:
reader = ...
sample = {}
for user_id, place_id, rating in reader:
rating = int(rating)
if user_id not in sample:
sample[user_id] = {}
sample[user_id][place_id] = rating
The expected output in the example is impossible, since {'1333': 2} would not be associated with a key. You could get {'U1000': {'12222': 3, '1333': 2}, 'U1001': {'13333': 4}} though, with a dict of dicts:
sample = {}
for row in reader:
userID, placeID, rating = row[:3]
sample.setdefault(userID, {})[placeID] = rating # Possibly int(rating)?
Alternatively, using collections.defaultdict(dict) to avoid the need for setdefault (or alternate approaches that involve a try/except KeyError or if userID in sample: that sacrifice the atomicity of setdefault in exchange for not creating empty dicts unnecessarily):
import collections
sample = collections.defaultdict(dict)
for row in reader:
userID, placeID, rating = row[:3]
sample[userID][placeID] = rating
# Optional conversion back to plain dict
sample = dict(sample)
The conversion back to plain dict ensures future lookups don't auto-vivify keys, raising KeyError as normal, and it looks like a normal dict if you print it.
If the included_cols is important (because names or column indices might change), you can use operator.itemgetter to speed up and simplify extracting all the desired columns at once:
from collections import defaultdict
from operator import itemgetter
included_cols = (0, 1, 2)
# If columns in data were actually:
# rating, foo, bar, userID, placeID
# we'd do this instead, itemgetter will handle all the rest:
# included_cols = (3, 4, 0)
get_cols = itemgetter(*included_cols) # Create function to get needed indices at once
sample = defaultdict(dict)
# map(get_cols, ...) efficiently converts each row to a tuple of just
# the three desired values as it goes, which also lets us unpack directly
# in the for loop, simplifying code even more by naming all variables directly
for userID, placeID, rating in map(get_cols, reader):
sample[userID][placeID] = rating # Possibly int(rating)?
I've written a piece of script that at the moment I'm sure can be condensed. What I'm try to achieve is an automatic version of this:
file1 = tkFileDialog.askopenfilename(title='Select the first data file')
file2 = tkFileDialog.askopenfilename(title='Select the first data file')
TurnDatabase = tkFileDialog.askopenfilename(title='Select the turn database file')
headers = pd.read_csv(file1, nrows=1).columns
data1 = pd.read_csv(file1)
data2 = pd.read_csv(file2)
This is how the data is collected.
There's many more lines of code which focus on picking out bits of the data. I'm not going to post it all.
This is what I'm trying to condense:
EntrySummary = []
for key in Entries1.viewkeys():
MeanFRH = Entries1[key].hRideF.mean()
MeanFRHC = Entries1[key].hRideFCalc.mean()
MeanRRH = Entries1[key].hRideR.mean()
# There's 30 more lines of these...
# Then the list is updated with this:
EntrySummary.append({'Turn Number': key, 'Avg FRH': MeanFRH, 'Avg FRHC': MeanFRHC, 'Avg RRH': MeanRRH,... # and so on})
EntrySummary = pd.DataFrame(EntrySummary)
EntrySummary.index = EntrySummary['Turn Number']
del EntrySummary['Turn Number']
This is the old code. What I've tried to do is this:
EntrySummary = []
for i in headers():
EntrySummary.append({'Turn Number': key, str('Avg '[i]): str('Mean'[i])})
print EntrySummary
# The print is only there for me to see if it's worked.
However I'm getting this error at the minute:
for i in headers():
TypeError: 'Index' object is not callable
Any ideas as to where I've made a mistake? I've probably made a few...
Thank you in advance
Oli
If I'm understanding your situation correctly, you want to replace the long series of assignments in the "old code" you've shown with another loop that processes all of the different items automatically using the list of headers from your data files.
I think this is what you want:
EntrySummary = []
for key, value in Entries1.viewitems():
entry = {"Turn Number": key}
for header in headers:
entry["Avg {}".format(header)] = getattr(value, header).mean()
EntrySummary.append(entry)
You might be able to come up with some better variables names, since you know what the keys and values in Entries1 are (I do not, so I used generic names).
I need to find a duplicates in my txt file. The file looks like this:
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2
There are ~30000 lines of this. And now, I need to find duplicates in the second column and save to the the new files without any duplicates. My code is:
def dedupe(data):
d = []
for l in lines:
if l[0] in d:
d[l[0]] += l[:1]
else:
d[l[0]] = l[1]
return d
#m - male
#f - female
data = open('plec.txt', 'r')
save_m = open('plec_m.txt', 'w')
save_f = open('plec_f.txt', 'w')
lines = data.readlines()[1:]
for line in lines:
gender = line.strip().split(',')[3]
if gender is 'f':
dedupe(line)
save_f.write(line)
elif gender is 'm':
dedupe(line)
save_m.write(line)
But I'm getting this error:
Traceback (most recent call last):
File "plec.py", line 88, in <module>
dedupe(line)
File "plec.py", line 75, in dedupe
d[l[0]] = l[1]
TypeError: list indices must be integers, not str'
EDIT 2018-10-28:
I don't remember what I had to sort in this file, I think that 2nd and 4th column must be unique but I'm not sure now. But I found wrong part in my code and because of it, I rebuilt all of code which is also working.
def dedup(my_list, new_file):
d = list()
for single_line in my_list:
if single_line.split(',')[1] not in [i.split(',')[1] for i in d]:
d.append(single_line)
print(len(my_list), len(d))
new_file.writelines(d)
data = open('plec.txt', 'r').readlines()[:1]
males = open('m.txt', 'w')
females = open('f.txt', 'w')
males_list = list()
females_list = list()
for line in data:
gender = line.split(',')[3]
if gender == 'm':
males_list.append(line)
if gender == 'f':
females_list.append(line)
dedup(males_list, males)
dedup(females_list, females)
You can use Pandas to read your input file and remove the duplicates based on any column you want.
from StringIO import StringIO
from pandas import DataFrame
data =StringIO("""col1,col2,col3,col4,col5,col6
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2""")
df = DataFrame.from_csv(data, sep=",", index_col=False)
df.drop_duplicates(subset='col2')
df.to_csv("no_dups.txt", index = false)
seen = set()
for row in my_filehandle:
my_2nd_col = row.split(",")[1]
if my_2nd_col in seen:
continue
output_filehandle.write(row)
seen.add(my_2nd_column)
is one very verbose way of doing this
OP, I don't know what's wrong with your code but this solution should fit your requirements, assuming your requirements are:
Filter the file based on second column
Store male and female entries in seperate files
Here's the code:
with open('plec.txt') as file:
lines = map(lambda line: line.split(','), file.read().split('\n')) # split the file into lines and the lines by comma
filtered_lines_male = []
filtered_lines_female = []
second_column_set = set()
for line in lines:
if(line[1] not in second_column_set):
second_column_set.add(line[1]) # add to index set
if(line[3] == 'm'):
filtered_lines_male.append(line) # add to male list
else:
filtered_lines_female.append(line) # add to female list
filtered_lines_male = '\n'.join([','.join(line) for line in filtered_lines_male]) # apply source formatting
filtered_lines_female = '\n'.join([','.join(line) for line in filtered_lines_female]) # apply source formatting
with open('plec_m.txt', 'w') as male_write_file:
male_write_file.write(filtered_lines_male) # write male entries
with open('plec_f.txt', 'w') as female_write_file:
female_write_file.write(filtered_lines_female) # write female entries
Please use better variable naming the next time you write code and please make sure your questions are more specific.
Would anybody help me to solve the following problem. I have tried it on my own and I have attached the solution also. I have used 2-d list, but I want a different solution without 2-d list, which should be more pythonic.
pl suggest me any of you have any other way of doing this.
Q) Consider Share prices for a N number of companies given for each month since year 1990 in a CSV file. Format of the file is as below with first line as header.
Year,Month,Company A, Company B,Company C, .............Company N
1990, Jan, 10, 15, 20, , ..........,50
1990, Feb, 10, 15, 20, , ..........,50
.
.
.
.
2013, Sep, 50, 10, 15............500
The solution should be in this format.
a) List for each Company year and month in which the share price was highest.
Here is my answer using 2-d list.
def generate_list(file_path):
'''
return list of list's containing file data.'''
data_list=None #local variable
try:
file_obj = open(file_path,'r')
try:
gen = (line.split(',') for line in file_obj) #generator, to generate one line each time until EOF (End of File)
for j,line in enumerate(gen):
if not data_list:
#if dl is None then create list containing n empty lists, where n will be number of columns.
data_list = [[] for i in range(len(line))]
if line[-1].find('\n'):
line[-1] = line[-1][:-1] #to remove last list element's '\n' character
#loop to convert numbers from string to float, and leave others as strings only
for i,l in enumerate(line):
if i >=2 and j >= 1:
data_list[i].append(float(l))
else:
data_list[i].append(l)
except IOError, io_except:
print io_except
finally:
file_obj.close()
except IOError, io_exception:
print io_exception
return data_list
def generate_result(file_path):
'''
return list of tuples containing (max price, year, month,
company name).
'''
data_list = generate_list(file_path)
re=[] #list to store results in tuple formet as follow [(max_price, year, month, company_name), ....]
if data_list:
for i,d in enumerate(data_list):
if i >= 2:
m = max(data_list[i][1:]) #max_price for the company
idx = data_list[i].index(m) #getting index of max_price in the list
yr = data_list[0][idx] #getting year by using index of max_price in list
mon = data_list[1][idx] #getting month by using index of max_price in list
com = data_list[i][0] #getting company_name
re.append((m,yr,mon,com))
return re
if __name__ == '__main__':
file_path = 'C:/Document and Settings/RajeshT/Desktop/nothing/imp/New Folder/tst.csv'
re = generate_result(file_path)
print 'result ', re
I have tried to solve it with generator also, but in that case it was giving result for only one company i.e. only one column.
p = 'filepath.csv'
f = open(p,'r')
head = f.readline()
gen = ((float(line.split(',')[n]), line.split(',',2)[0:2], head.split(',')[n]) for n in range(2,len(head.split(','))) for i,line in enumerate(f))
x = max((i for i in gen),key=lambda x:x[0])
print x
you can take the below provided input data which is in csv format..
year,month,company 1,company 2,company 3,company 4,company 5
1990,jan,201,245,243,179,133
1990,feb,228,123,124,121,180
1990,march,63,13,158,88,79
1990,april,234,68,187,67,135
1990,may,109,128,46,185,236
1990,june,53,36,202,73,210
1990,july,194,38,48,207,72
1990,august,147,116,149,93,114
1990,september,51,215,15,38,46
1990,october,16,200,115,205,118
1990,november,241,86,58,183,100
1990,december,175,97,143,77,84
1991,jan,190,68,236,202,19
1991,feb,39,209,133,221,161
1991,march,246,81,38,100,122
1991,april,37,137,106,138,26
1991,may,147,48,182,235,47
1991,june,57,20,156,38,245
1991,july,165,153,145,70,157
1991,august,154,16,162,32,21
1991,september,64,160,55,220,138
1991,october,162,72,162,222,179
1991,november,215,207,37,176,30
1991,december,106,153,31,247,69
expected output is following.
[(246.0, '1991', 'march', 'company 1'),
(245.0, '1990', 'jan', 'company 2'),
(243.0, '1990', 'jan', 'company 3'),
(247.0, '1991', 'december', 'company 4'),
(245.0, '1991', 'june', 'company 5')]
Thanks in advance...
Using collections.OrderedDict and collections.namedtuple:
import csv
from collections import OrderedDict, namedtuple
with open('abc1') as f:
reader = csv.reader(f)
tup = namedtuple('tup', ['price', 'year', 'month'])
d = OrderedDict()
names = next(reader)[2:]
for name in names:
#initialize the dict
d[name] = tup(0, 'year', 'month')
for row in reader:
year, month = row[:2] # Use year, month, *prices = row in py3.x
for name, price in zip(names, map(int, row[2:])): # map(int, prices) py3.x
if d[name].price < price:
d[name] = tup(price, year, month)
print d
Output:
OrderedDict([
('company 1', tup(price=246, year='1991', month='march')),
('company 2', tup(price=245, year='1990', month='jan')),
('company 3', tup(price=243, year='1990', month='jan')),
('company 4', tup(price=247, year='1991', month='december')),
('company 5', tup(price=245, year='1991', month='june'))])
I wasn't entirely sure how you wanted to output so for now I just have it print the output to screen.
import os
import csv
import codecs
## Import data !!!!!!!!!!!! CHANGE TO APPROPRIATE PATH !!!!!!!!!!!!!!!!!
filename= os.path.expanduser("~/Documents/PYTHON/StackTest/tailor_raj/Workbook1.csv")
## Get useable data
data = [row for row in csv.reader(codecs.open(filename, 'rb', encoding="utf_8"))]
## Find Number of rows
row_count= (sum(1 for row in data)) -1
## Find Number of columns
## Since this cannot be explicitly done, I set it to run through the columns on one row until it fails.
## Failure is caught by try/except so the program does not crash
columns_found = False
column_try =1
while columns_found == False:
column_try +=1
try:
identify_column = data[0][column_try]
except:
columns_found=True
## Set column count to discoverd column count (1 before it failed)
column_count=column_try-1
## Set which company we are checking (start with the first company listed. Since it starts at 0 the first company is at 2 not 3)
companyIndex = 2
#This will keep all the company bests as single rows of text. I was not sure how you wanted to output them.
companyBest=[]
## Set loop to go through each company
while companyIndex <= (column_count):
## For each new company reset the rowIndex and highestShare
rowIndex=1
highestShare=rowIndex
## Set loop to go through each row
while rowIndex <=row_count:
## Test if data point is above or equal to current max
## Currently set to use the most recent high point
if int(data[highestShare][companyIndex]) <= int(data[rowIndex][companyIndex]):
highestShare=rowIndex
## Move on to next row
rowIndex+=1
## Company best = Company Name + year + month + value
companyBest.append(str(data[0][companyIndex])+": "+str(data[highestShare][0]) +", "+str(data[highestShare][1])+", "+str(data[highestShare][companyIndex]))
## Move on to next company
companyIndex +=1
for item in companyBest:
print item
Be sure to change your filename path one more appropriate.
Output is currently displayed like this:
Company A: 1990, Nov, 1985
Company B: 1990, May, 52873
Company C: 1990, May, 3658
Company D: 1990, Nov, 156498
Company E: 1990, Jul, 987
No generator unfortunately but small code size, especially in Python 3:
from operator import itemgetter
from csv import reader
with open('test.csv') as f:
year, month, *data = zip(*reader(f))
for pricelist in data:
name = pricelist[0]
prices = map(int, pricelist[1:])
i, price = max(enumerate(prices), key=itemgetter(1))
print(name, price, year[i+1], month[i+1])
In Python 2.X you can do the same thing but slightly more clumsy, using the following (and the different print statement):
with open('test.csv') as f:
columns = zip(*reader(f))
year, month = columns[:2]
data = columns[2:]
Okay I came up with some gruesome generators! Also it makes use of lexicographic tuple comparison and reduce to compare consecutive lines:
from functools import reduce # only in Python 3
import csv
def group(year, month, *prices):
return ((int(p), year, month) for p in prices)
def compare(a, b):
return map(max, zip(a, group(*b)))
def run(fname):
with open(fname) as f:
r = csv.reader(f)
names = next(r)[2:]
return zip(names, reduce(compare, r, group(*next(r))))
list(run('test.csv'))