00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
01,0,8749
01,1,8749
01,2,8749
01,3,88609
01,4,88609
01,5,88609
01,6,88611
01,7,88611
01,8,88611
02,0,9006
02,1,9006
02,2,4355
02,3,9013
02,4,9013
02,5,9013
02,6,4341
02,7,4341
02,8,4341
02,9,4341
03,0,6285
03,1,6285
03,2,6285
03,3,6285
03,4,6278
03,5,6278
03,6,6278
03,7,6278
03,8,8960
I have a csv file and a bit of it is shown above.
What I want to do is if the column 0 has the same value, it makes a an array of column 2, prints the array. ie- for 00, it makes an array-
a = [6098,6098,6098,6098,6094,6094]
for 01, it makes an array-
a = [8749,8749,88609,88609,88609,88611,88611,88611]
I don't know how to loop over this file.
This solution assumes that the first column will appear in sorted order in the file.
def main():
import csv
from itertools import groupby
with open("csv.csv") as file:
reader = csv.reader(file)
rows = [[row[0]] + [int(item) for item in row[1:]] for row in reader]
groups = {}
for key, group in groupby(rows, lambda row: row[0]):
groups[key] = [row[2] for row in group]
print(groups["00"])
print(groups["01"])
print(groups["02"])
print(groups["03"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
[6098, 6098, 6098, 6098, 6094, 6094]
[8749, 8749, 8749, 88609, 88609, 88609, 88611, 88611, 88611]
[9006, 9006, 4355, 9013, 9013, 9013, 4341, 4341, 4341, 4341]
[6285, 6285, 6285, 6285, 6278, 6278, 6278, 6278, 8960]
The idea is to use a dictionary in which 00, 01 etc will be the keys and value will be a list. So you need to iterate through the csv data and push these data to corresponding keys.
import csv
result = {}
with open("you csv file", "r") as csvfile:
data = csv.reader(csvfile)
for row in data:
if result.has_key(row[0]):
result[row[0]].append(row[2])
else:
result[row[0]] = [row[2]]
print (result)
Here
from collections import defaultdict
txt = '''00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
01,0,8749
01,1,8749
01,2,8749
01,3,88609
01,4,88609
01,5,88609
01,6,88611
01,7,88611
01,8,88611
02,0,9006
02,1,9006
02,2,4355
02,3,9013
02,4,9013
02,5,9013
02,6,4341
02,7,4341
02,8,4341
02,9,4341
03,0,6285
03,1,6285
03,2,6285
03,3,6285
03,4,6278
03,5,6278
03,6,6278
03,7,6278
03,8,8960'''
data_holder = defaultdict(list)
lines = txt.split('\n')
for line in lines:
fields = line.split(',')
data_holder[fields[0]].append(fields[2])
for k,v in data_holder.items():
print('{} -> {}'.format(k,v))
output
02 -> ['9006', '9006', '4355', '9013', '9013', '9013', '4341', '4341', '4341', '4341']
03 -> ['6285', '6285', '6285', '6285', '6278', '6278', '6278', '6278', '8960']
00 -> ['6098', '6098', '6098', '6098', '6094', '6094']
01 -> ['8749', '8749', '8749', '88609', '88609', '88609', '88611', '88611', '88611']
Related
The goal is to write a function get_mapping(map_file) whose only argument map_file is the name of the file containing the IDs. So in our case map_file will be one of "mapping/rno.map", "mapping/mmu.map" and "mapping/hsa.map". These files have respectively four, two and three columns. The first column always corresponds to the Ensembl IDs we use. The get_mapping function should return a list of dictionaries. Every dictionary should have as a key a non-Ensembl ID and as value the corresponding Ensembl ID. The number of dictionaries in the list should be equal to the number of columns minus one.
import sys
def get_mapping(map_file):
f = open(map_file, "r")
# Result is a list of dictionaries.
mapping_list = []
# Skip the header on the first line.
header = f.readline()
header = header.split()
#dicts in mapping_list
col =len(header)-1
for i in range(col):
d={}
mapping_list.append(d)
for line in f:
line=line.strip('\n').split('\t');
for dic in range(len(mapping_list)):
mapping_list[dic][line[dic+1]]=line[0]
print(mapping_list)
f.close()
return mapping_list
get_mapping('rno.map')
get_mapping('mmu.map')
get_mapping('hsa.map')
items that shouldnt be in the output:
['ENSRNOP00000058792', '', '', '']
['ENSMUSP00000100465', 'MGI:3645509']
['ENSP00000375105', 'P63162', 'Q6LBS1']
How can i remove this?
This is what I should get in the output:
rno mapping
1302936 ENSRNOP00000015679
1302939 ENSRNOP00000027305
1302944 ENSRNOP00000025813
1302945 ENSRNOP00000010637
1302952 ENSRNOP00000003046
1302957 ENSRNOP00000020169
1302959 ENSRNOP00000006804
...
hsa.map
a0a2g6 ensp00000327895
a0a5b6 ensp00000374923
a0a962 ensp00000315112
a0aul9 ensp00000227459
a0av47 ensp00000260810
a0av56 ensp00000292123
a0avg4 ensp00000305200
a0avk6 ensp00000250024
a0avt1 ensp00000313454
a0ejg6 ensp00000321606
mmu.map
mgi:101761 ensmusp00000072556
mgi:101762 ensmusp00000008542
mgi:101763 ensmusp00000077262
mgi:101764 ensmusp00000099514
mgi:101765 ensmusp00000030814
mgi:101766 ensmusp00000035142
mgi:101769 ensmusp00000044048
mgi:101770 ensmusp00000028907
mgi:101771 ensmusp00000037324
mgi:101772 ensmusp00000028991
mgi:101773 ensmusp00000023618
mgi:101774 ensmusp00000003469
mgi:101775 ensmusp00000097404
mgi:101776 ensmusp00000027740
I get in the output:
rno mapping
ENSRNOP00000058792
1302936 ENSRNOP00000015679
1302939 ENSRNOP00000027305
1302944 ENSRNOP00000025813
1302945 ENSRNOP00000010637
1302952 ENSRNOP00000003046
1302957 ENSRNOP00000020169
1302959 ENSRNOP00000006804
1302965 ENSRNOP00000012050
1302972 ENSRNOP00000042145
1302973 ENSRNOP00000033541
...
File sample rno.map
Ensembl_Protein_ID UniProt/SwissProt_Accession UniProt/TrEMBL_Accession RGD_ID
ENSRNOP00000000008 P18088 C9E895 2652
ENSRNOP00000000008 P18088 B3VQJ0 2652
ENSRNOP00000000009 D3ZEM1 1310201
ENSRNOP00000000025 B4F7C7
ENSRNOP00000000029 Q9ES39 620038
ENSRNOP00000000037 Q7TQM3 735156
ENSRNOP00000000052 O70352 Q6IN14 69070
ENSRNOP00000000053 Q9JLM2 68400
ENSRNOP00000000064 P97874 621589
ENSRNOP00000000072 P29419 621377
ENSRNOP00000000074 B2RZ28 1304584
File sample mmu.map
Ensembl_Protein_ID MGI_ID
ENSMUSP00000000001 MGI:95773
ENSMUSP00000000028 MGI:1338073
ENSMUSP00000000033 MGI:96434
ENSMUSP00000000049 MGI:88058
ENSMUSP00000000058 MGI:107571
ENSMUSP00000000090 MGI:88474
ENSMUSP00000000094 MGI:2138865
ENSMUSP00000000095 MGI:98494
ENSMUSP00000000122 MGI:97323
ENSMUSP00000000127 MGI:98955
ENSMUSP00000000129 MGI:105917
ENSMUSP00000000137 MGI:1913963
ENSMUSP00000000153 MGI:95767
ENSMUSP00000000161 MGI:1277979
ENSMUSP00000000163 MGI:1919308
ENSMUSP00000000175 MGI:1914175
ENSMUSP00000000186 MGI:1891427
ENSMUSP00000000187 MGI:95520
File sample hsa.map
Ensembl_Protein_ID UniProt/SwissProt_Accession UniProt/TrEMBL_Accession
ENSP00000000233 P84085 A4D0Z3
ENSP00000000412 P20645 Q96AH2
ENSP00000000442 P11474 Q96I02
ENSP00000000442 P11474 Q96F89
ENSP00000000442 P11474 Q569H8
ENSP00000001008 Q02790
ENSP00000002829 Q13275
ENSP00000003084 P13569 Q9UML7
ENSP00000003084 P13569 Q9UJ19
ENSP00000003084 P13569 Q99989
ENSP00000003084 P13569 Q6KEJ7
ENSP00000003084 P13569 Q6KEJ4
I have lists that are formatted like so:
order_ids = ['Order ID', '026-2529662-9119536', '026-4092572-3574764', '026-4267878-0816332', '026-5334006-4073138', '026-5750353-4848328', '026-5945233-4883500', '026-5966822-8160331', '026-8799392-8255522', '202-5076008-9615516', '202-5211901-8584318', '202-5788153-3773918', '202-6208325-9677946', '203-1024454-3409960', '203-1064201-9833131', '203-4104559-7038752', '203-5013053-9959554', '203-5768187-0573905', '203-8639245-4145958', '203-9473169-4807564', '204-1577436-4733125', '204-7025768-1965915', '204-9196762-0226720', '205-6427246-2264368', '205-9028779-8764322', '206-0703454-9777135', '206-0954144-1685131', '206-3381432-7615531', '206-3822931-6939555', '206-4658913-5563533', '206-5213573-9997926', '206-5882801-0583557', '206-7158700-9326744', '206-7668862-3913143', '206-8019246-1474732', '206-8541775-0545153']
one = [['Order ID', 'Amount'], ['026-2529662-9119536', '10.42'], ['026-4092572-3574764', '10.42'], ['026-4267878-0816332', '1.75'], ['026-5334006-4073138', '17.990000000000002'], ['026-5750353-4848328', '16.25'], ['026-5945233-4883500', '1.83'], ['026-5966822-8160331', '11.92'], ['026-8799392-8255522', '8.5'], ['202-5076008-9615516', '1.83'], ['202-5211901-8584318', '1.83'], ['202-5788153-3773918', '8.08'], ['202-6208325-9677946', '11.33'], ['203-1024454-3409960', '8.08'], ['203-1064201-9833131', '1.5'], ['203-4104559-7038752', '8.5'], ['203-5013053-9959554', '9.67'], ['203-5113131-7525963', '-8.5'], ['203-5768187-0573905', '3.66'], ['203-8639245-4145958', '5.08'], ['203-9473169-4807564', '3.66'], ['204-1577436-4733125', '1.83'], ['204-7025768-1965915', '1.83'], ['204-9196762-0226720', '11.33'], ['205-8348990-1889964', '-11.33'], ['205-9028779-8764322', '6.91'], ['206-0703454-9777135', '23.84'], ['206-0954144-1685131', '22.66'], ['206-3381432-7615531', '8.08'], ['206-3822931-6939555', '11.92'], ['206-4658913-5563533', '9.67'], ['206-5213573-9997926', '3.66'], ['206-5882801-0583557', '13.92'], ['206-7158700-9326744', '27.5'], ['206-7668862-3913143', '6.58'], ['206-8541775-0545153', '1.83']]
What I want to do is cycle through every item inside order_ids, and if the order_id is present in one - get the "value"
So far what I have tried is:
with open('test.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile)
i = 0
while i < len(order_ids):
for order in order_ids:
try:
if order == one[i][0]:
value_a = one[i][1]
print(order, value_a)
writer.writerow([order, value_a])
i += 1
else:
i += 1
pass
except IndexError:
i += 1
This is working somewhat - but there are 36 items inside "order_ids" and 36 lists inside "one", however only 18 rows are being wrote to my outfile.
An example of one order_id that isn't being wrote is "206-7668862-3913143", even though this clearly has a value of "6.58" inside "one"
What is stopping the rest of my rows being written?
You can do this simply with a dictionary. The dict() constructor will accept a nested list of pairs and create a dictionary mapping order_id to amount. Then we can just loop over the order_ids list, and write out any order_id that appears to test.csv.
Code:
import csv
d = dict(one)
with open('test.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile)
for order_id in order_ids:
if order_id in d:
writer.writerow([order_id, d[order_id]])
test.csv:
Order ID,Amount
026-2529662-9119536,10.42
026-4092572-3574764,10.42
026-4267878-0816332,1.75
026-5334006-4073138,17.990000000000002
026-5750353-4848328,16.25
026-5945233-4883500,1.83
026-5966822-8160331,11.92
026-8799392-8255522,8.5
202-5076008-9615516,1.83
202-5211901-8584318,1.83
202-5788153-3773918,8.08
202-6208325-9677946,11.33
203-1024454-3409960,8.08
203-1064201-9833131,1.5
203-4104559-7038752,8.5
203-5013053-9959554,9.67
203-5768187-0573905,3.66
203-8639245-4145958,5.08
203-9473169-4807564,3.66
204-1577436-4733125,1.83
204-7025768-1965915,1.83
204-9196762-0226720,11.33
205-9028779-8764322,6.91
206-0703454-9777135,23.84
206-0954144-1685131,22.66
206-3381432-7615531,8.08
206-3822931-6939555,11.92
206-4658913-5563533,9.67
206-5213573-9997926,3.66
206-5882801-0583557,13.92
206-7158700-9326744,27.5
206-7668862-3913143,6.58
206-8541775-0545153,1.83
While reading the csv file, I am iterating over it using itertuples:
df = pd.read_csv("/home/aviral/dev/misc/EKYducHrK93oSKCt7nY0ZYne.csv", encoding='utf-8')
count = 0
for row in df.itertuples():
print (row)
if count == 0:
sys.exit()
count += 1
The value of row is:
Pandas(Index=0, _1='07755aa8-3a15-42ca-8757-58da8a9a298f', _2=nan,
_3='07755aa8-3a15-42ca-8757-58da8a9a298f', _4='2018-03-14T04:43:21.309Z', _5='2018-03-14T04:43:30.679Z', _6='2018-03-14T04:43:30.679Z', User='Vaibhav Inzalkar', Username=919766148649, _9=24, _10=24, _11='Android',
_12='6E6498d700e51ebf', _13='2.3.0', _14='2.3.0', _15=nan, Tags=nan, vill_name='911B675b-B422-41E9-A2ae-6Ccb1e9de2e4', vill_name_parent_response_id='02d93f9d-80df-4e12-8c6b-4c7859a50862', vill_name_taluka_code=4082, census_district_2011='Yavatmal', vill_name_gp_code=194782, vill_name_village_code=542432.0, subdistrict_code=4082, vill_name_anganwadi_code=27510160609.0, census_village_sd_2011='Dhamani', district_code=510, vill_name_taluka_name='Yavatmal', vill_name_gp_name='Dhamni', vill_name_staffname='Anjali Rajendra Pachkawade', vill_name_gram_panchayat_survey_phase='Phase_3', census_subdistrict_2011='Yavatmal', vill_name_auditor_mobile_number=919766148649, vill_name_auditor_name='Vaibhav Inzalkar', vill_name_dist_sd_vill_comb_code=5104082194782542848, vill_name_anganwadi_worker_mobile_number=9404825268.0, _36=nan, buildtype='Pucca House', buildown='Temporary Arrangement', _39=1,
_40=1, staffname='Anjali Rajendra Pachakawde', anganwadi_sevika_own_mobile_yn=nan, anganwadi_worker_mobile_number=nan, wrkrvillyn='Yes', sahayakname='Pandharnishe Madam', helpervillyn='Yes', pregno=6, mothlactano=6, _49=1, imr_child_birth=3, imr_child_deaths=1, child0to6no=43, _53=1.0, angawadi_children_not_suffering_from_sam_age_zero_six=42.0, _55=0.0, adolegirlno=0, _57=nan, _58=1, _59=1, _60=1, _61=0, _62=1, _63=0,
_64=1, _65=1, _66=0, _67=0, _68=0, _69=0, _70=0, _71=0, _72=0, _73=0, _74=0, _75=0, _76=1, _77=0, _78=0, drugs_nothing=0, drugs_nothing_646PdPW9FewY3edC2LeG=0, _81=0, _82=0, _83=0, _84=0,
_85=0, _86=1, _87=1, _88=0, _89=0, _90=0, _91=0, _92=0, _93=0, _94=0, _95=0, _96=0, infraphy_nota=0, basic_util_awc_Register=0, _99=1, _100=1, _101=0, _102=0, _103=0, basic_util_awc_Ventilation=0, _105=0, _106=1, _107=1, _108=1, _109=1, basic_util_awc_Phenyl=1, basic_util_awc_Register_2DN84oFiz565JzqFegx7=0, _112=0, _113=0,
_114=0, _115=0, _116=0, basic_util_awc_Ventilation_2DN84oFiz565JzqFegx7=0, _118=0, _119=0,
_120=0, _121=0, _122=0, basic_util_awc_Phenyl_2DN84oFiz565JzqFegx7=0, basic_util_awc_nota=0, solar_unit=nan, elec_bill_paid=nan, _127=nan,
_128=nan, _129=nan, _130=nan, _131=nan, _132=nan, _133=nan, _134=nan, _135=nan, _136=nan, other=nan, _138=1, _139=1, _140=1, _141=1, _142=1, _143=1, _144=1, _145=0, _146=0, _147=0, _148=0, _149=0, _150=0, _151=0, servpregbaby03_nota=0, anganwadi_children_vaccination_BCG=1.0, anganwadi_children_vaccination_DPT=1.0, anganwadi_children_vaccination_OPV=1.0, _156=0.0, anganwadi_children_vaccination_Measles=1.0, _158=1.0, _159=1.0,
_160=0.0, _161=0.0, _162=0.0, anganwadi_children_vaccination_nota=0.0, _164=1, _165=1, _166=1, _167=1, _168=0, _169=0, _170=0, _171=0, servchild3to6_nota=0, servadolgirl_registration=0, _174=0, _175=0,
_176=0, _177=0, servadolgirl_registration_KoXmKreRO4DAxGuLelRP=0, _179=0, _180=0, _181=0, _182=0, servadolgirl_nothing=0, servadolgirl_nothing_KoXmKreRO4DAxGuLelRP=1, anganwadi_photo='Https://Collect-V2-Production.s3.Ap-South-1.Amazonaws.com/Omurh3lmkcmftts4muxn%2Fwmse68bvz5zwmzcfj9tx%2Fcpo0bvh0kuday74e9cqw%2F94c13850-6008-4087-B742-7B31ad2e4d02.Jpeg', anganwadi_map_latitude=20.338352399999998, anganwadi_map_longitude=78.1930695, anganwadi_map_accuracy=10.0,
_189=nan, problem_1='1)Pakki Building', problem_2='1) Toilet', problem_3='Kichan', problem_4='Elictric Line', problem_5='Water Connection', popserv=55, census_country='India', state_name='Maharashtra', state_code=27, sc_ang_id=1, village_code_census2011_raw=542432.0, phase='Phase 3')
How can I get just the columns and the values?
Something just like this:
_1='07755aa8-3a15-42ca-8757-58da8a9a298f', _2=nan, _3='07755aa8-3a15-42ca-8757-58da8a9a298f', _4='2018-03-14T04:43:21.309Z', _5='2018-03-14T04:43:30.679Z', _6='2018-03-14T04:43:30.679Z', User='Vaibhav Inzalkar', Username=919766148649, _9=24, _10=24, _11='Android',
_12='6E6498d700e51ebf', _13='2.3.0', _14='2.3.0'
Something like this?
Sample dataframe
name city cell
0 A X 124
1 ABC Y 345
2 BAD Z 76
Code:
for i in df.itertuples():
# Python 3.6
print(','.join(f' {j}="{getattr(i,j)}"' for j in df.columns))
# Python 3.5
print(','.join(' {0}="{1}"'.format(j, getattr(i,j)) for j in df.columns))
Output:
name="A", city="X", cell="124"
name="ABC", city="Y", cell="345"
name="BAD", city="Z", cell="76"
Writing to file:
with open("dummy.json", "a+") as f:
for i in df.itertuples():
x = dict(i._asdict())
json.dump(x, f)
f.write("\n")
I have a *.txt while the following configuration: A long header and the data preceding. See below
field1, field2, field3
field4, field5, field 6, field7, field8
field9, fiel10
field11, field12
1, 1.1, 1o.1
2, 0.5, 15
3, 0, 8.3
4, 2.1, 7.8
..
..
This is the code I have made. In order to save the values form the header I have created a dictionary named "header".
header={}
count=1
with open('file.txt') as f:
while count<4:
line = f.readline()
if count==1:
header['field1]=line.split(',')[0]
header['field2]=line.split(',')[1]
header['field3]=line.split(',')[2]
if count==2:
header['field4]=line.split(',')[0]
header['field5]=line.split(',')[1]
header['field6]=line.split(',')[2]
header['field7]=line.split(',')[3]
header['field8]=line.split(',')[4]
if count==3:
header['field9]=line.split(',')[0]
header['field10]=line.split(',')[1]
if count==4:
header['field11]=line.split(',')[0]
header['field12]=line.split(',')[1]
#Read the full data into dataframe
df= pd.read_csv('file.txt',skiprows=4,names=['Col1','Col2','Col3])
However, I think it is not very eficient nor elegant way to do it. I would appreciate a simpler version using the I/O file pointer or Pandas alone. Thanks
Iterate over the header lines, split them, and iterate over the line enries:
header = {}
with open('file.txt') as fobj:
counter = 1
for line in fobj:
# assuming empty line between multi-line header and data
if not line.strip():
break
for entry in line.split(','):
header['field{}'.format(counter)] = entry.strip()
counter += 1
import pprint
pprint.pprint(header)
Output:
{'field1': 'field1',
'field10': 'fiel10',
'field11': 'field11',
'field12': 'field12',
'field2': 'field2',
'field3': 'field3',
'field4': 'field4',
'field5': 'field5',
'field6': 'field 6',
'field7': 'field7',
'field8': 'field8',
'field9': 'field9'}
I am trying to read in from a data file that has lines like:
2007 ANDREA 30 31.40 -71.90 05/13/18Z 25 1007 LOW
2007 ANDREA 31 31.80 -69.40 05/14/00Z 25 1007 LOW
I am trying to create a nested dictionary that has a key holding the year and then the nested dictionary will hold the name and a tuple containing statistics. I would like the return value to look like this:
{'2007': {'ANDREA': [(31.4, -71.9, '05/13/18Z', 25.0, 1007.0), (31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
However when I run the code it returns only one set of statistics. It seems to be overwriting itself because I am getting that last line of statistics in the txt file returned:
{'2007': {'ANDREA': [(31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
Here is the code:
def create_dictionary(fp):
'''Remember to put a docstring here'''
dict1 = {}
f = []
for line in fp:
a = line.split()
f.append(a)
for item in f:
a = (float(item[3]), float(item[4]), item[5], float(item[6]),
float(item[7]))
dict1 = update_dictionary(dict1, item[0], item[1], a))
print(dict1)
def update_dictionary(dictionary, year, hurricane_name, data):
if year not in dictionary:
dictionary[year] = {}
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
else:
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
return dictionary
These lines:
if hurricane_name not in dictionary:
...should be:
if hurricane_name not in dictionary[year]:
Since I was a little late here's a suggestion instead of an answer to your original question. You can simplify the logic a bit because when the year doesn't exist then the name also can't exist for that year. Everything can be put in a single function and using a "with" statement to open the file will ensure it is properly closed even if your program encounters an error.
def build_dict(file_path):
result = {}
with open(file_path, 'r') as f:
for line in f:
items = line.split()
year, name, data = items[0], items[1], tuple(items[2:])
if year in result:
if name in result[year]:
result[year][name].append(data)
else:
result[year][name] = [data]
else:
result[year] = {name: [data]}
return result
print(build_dict(file_path))
Output:
{'2007': {'ANDREA': [('30', '31.40', '-71.90', '05/13/18Z', '25', '1007', 'LOW'), ('31', '31.80', '-69.40', '05/14/00Z', '25', '1007', 'LOW')]}}