How to convert Pandas object and not the entire dataframe to string? - python

While reading the csv file, I am iterating over it using itertuples:
df = pd.read_csv("/home/aviral/dev/misc/EKYducHrK93oSKCt7nY0ZYne.csv", encoding='utf-8')
count = 0
for row in df.itertuples():
print (row)
if count == 0:
sys.exit()
count += 1
The value of row is:
Pandas(Index=0, _1='07755aa8-3a15-42ca-8757-58da8a9a298f', _2=nan,
_3='07755aa8-3a15-42ca-8757-58da8a9a298f', _4='2018-03-14T04:43:21.309Z', _5='2018-03-14T04:43:30.679Z', _6='2018-03-14T04:43:30.679Z', User='Vaibhav Inzalkar', Username=919766148649, _9=24, _10=24, _11='Android',
_12='6E6498d700e51ebf', _13='2.3.0', _14='2.3.0', _15=nan, Tags=nan, vill_name='911B675b-B422-41E9-A2ae-6Ccb1e9de2e4', vill_name_parent_response_id='02d93f9d-80df-4e12-8c6b-4c7859a50862', vill_name_taluka_code=4082, census_district_2011='Yavatmal', vill_name_gp_code=194782, vill_name_village_code=542432.0, subdistrict_code=4082, vill_name_anganwadi_code=27510160609.0, census_village_sd_2011='Dhamani', district_code=510, vill_name_taluka_name='Yavatmal', vill_name_gp_name='Dhamni', vill_name_staffname='Anjali Rajendra Pachkawade', vill_name_gram_panchayat_survey_phase='Phase_3', census_subdistrict_2011='Yavatmal', vill_name_auditor_mobile_number=919766148649, vill_name_auditor_name='Vaibhav Inzalkar', vill_name_dist_sd_vill_comb_code=5104082194782542848, vill_name_anganwadi_worker_mobile_number=9404825268.0, _36=nan, buildtype='Pucca House', buildown='Temporary Arrangement', _39=1,
_40=1, staffname='Anjali Rajendra Pachakawde', anganwadi_sevika_own_mobile_yn=nan, anganwadi_worker_mobile_number=nan, wrkrvillyn='Yes', sahayakname='Pandharnishe Madam', helpervillyn='Yes', pregno=6, mothlactano=6, _49=1, imr_child_birth=3, imr_child_deaths=1, child0to6no=43, _53=1.0, angawadi_children_not_suffering_from_sam_age_zero_six=42.0, _55=0.0, adolegirlno=0, _57=nan, _58=1, _59=1, _60=1, _61=0, _62=1, _63=0,
_64=1, _65=1, _66=0, _67=0, _68=0, _69=0, _70=0, _71=0, _72=0, _73=0, _74=0, _75=0, _76=1, _77=0, _78=0, drugs_nothing=0, drugs_nothing_646PdPW9FewY3edC2LeG=0, _81=0, _82=0, _83=0, _84=0,
_85=0, _86=1, _87=1, _88=0, _89=0, _90=0, _91=0, _92=0, _93=0, _94=0, _95=0, _96=0, infraphy_nota=0, basic_util_awc_Register=0, _99=1, _100=1, _101=0, _102=0, _103=0, basic_util_awc_Ventilation=0, _105=0, _106=1, _107=1, _108=1, _109=1, basic_util_awc_Phenyl=1, basic_util_awc_Register_2DN84oFiz565JzqFegx7=0, _112=0, _113=0,
_114=0, _115=0, _116=0, basic_util_awc_Ventilation_2DN84oFiz565JzqFegx7=0, _118=0, _119=0,
_120=0, _121=0, _122=0, basic_util_awc_Phenyl_2DN84oFiz565JzqFegx7=0, basic_util_awc_nota=0, solar_unit=nan, elec_bill_paid=nan, _127=nan,
_128=nan, _129=nan, _130=nan, _131=nan, _132=nan, _133=nan, _134=nan, _135=nan, _136=nan, other=nan, _138=1, _139=1, _140=1, _141=1, _142=1, _143=1, _144=1, _145=0, _146=0, _147=0, _148=0, _149=0, _150=0, _151=0, servpregbaby03_nota=0, anganwadi_children_vaccination_BCG=1.0, anganwadi_children_vaccination_DPT=1.0, anganwadi_children_vaccination_OPV=1.0, _156=0.0, anganwadi_children_vaccination_Measles=1.0, _158=1.0, _159=1.0,
_160=0.0, _161=0.0, _162=0.0, anganwadi_children_vaccination_nota=0.0, _164=1, _165=1, _166=1, _167=1, _168=0, _169=0, _170=0, _171=0, servchild3to6_nota=0, servadolgirl_registration=0, _174=0, _175=0,
_176=0, _177=0, servadolgirl_registration_KoXmKreRO4DAxGuLelRP=0, _179=0, _180=0, _181=0, _182=0, servadolgirl_nothing=0, servadolgirl_nothing_KoXmKreRO4DAxGuLelRP=1, anganwadi_photo='Https://Collect-V2-Production.s3.Ap-South-1.Amazonaws.com/Omurh3lmkcmftts4muxn%2Fwmse68bvz5zwmzcfj9tx%2Fcpo0bvh0kuday74e9cqw%2F94c13850-6008-4087-B742-7B31ad2e4d02.Jpeg', anganwadi_map_latitude=20.338352399999998, anganwadi_map_longitude=78.1930695, anganwadi_map_accuracy=10.0,
_189=nan, problem_1='1)Pakki Building', problem_2='1) Toilet', problem_3='Kichan', problem_4='Elictric Line', problem_5='Water Connection', popserv=55, census_country='India', state_name='Maharashtra', state_code=27, sc_ang_id=1, village_code_census2011_raw=542432.0, phase='Phase 3')
How can I get just the columns and the values?
Something just like this:
_1='07755aa8-3a15-42ca-8757-58da8a9a298f', _2=nan, _3='07755aa8-3a15-42ca-8757-58da8a9a298f', _4='2018-03-14T04:43:21.309Z', _5='2018-03-14T04:43:30.679Z', _6='2018-03-14T04:43:30.679Z', User='Vaibhav Inzalkar', Username=919766148649, _9=24, _10=24, _11='Android',
_12='6E6498d700e51ebf', _13='2.3.0', _14='2.3.0'

Something like this?
Sample dataframe
name city cell
0 A X 124
1 ABC Y 345
2 BAD Z 76
Code:
for i in df.itertuples():
# Python 3.6
print(','.join(f' {j}="{getattr(i,j)}"' for j in df.columns))
# Python 3.5
print(','.join(' {0}="{1}"'.format(j, getattr(i,j)) for j in df.columns))
Output:
name="A", city="X", cell="124"
name="ABC", city="Y", cell="345"
name="BAD", city="Z", cell="76"
Writing to file:
with open("dummy.json", "a+") as f:
for i in df.itertuples():
x = dict(i._asdict())
json.dump(x, f)
f.write("\n")

Related

how to sum/aggregate by group without using pandas or import

so I am basically not allowed to use any import or other libraries like pandas or groupby.
and I have to categorize the data and sum up the corresponding values. The data is in the csv file.
For example,
**S** C **T**
A T 100
A. B 102
A. T. 200
A B. 100
C T 203
C. T. 200
C B 200
C T 200
C. B 200
my expected result should be
S C T
A T 300
A B. 202
C T 403
C B. 200
C T. 200
C B. 200
Considering that you have a csv file (i.e., columns split by comma):
with open('myfile.csv', 'r') as file:
header = file.readline().rstrip()
data = {}
for row in file:
state, candidate, value = row.split(',')
k, value = (state, candidate), int(value)
data[k] = data.get(k, 0) + value
result_csv = '\n'.join([header] + [f"{','.join(k)},{v}" for k,v in data.items()])
print(result_csv)
Output:
state,candidate,total votes
Alaska,Trump,300
Alaska,Biden,202
colorado,Trump,403
colorado,Biden,200
California,Trump,200
California,Biden,200
Original content of myfile.csv is (use str.replace if necessary):
state,candidate,total votes
Alaska,Trump,100
Alaska,Biden,102
Alaska,Trump,200
Alaska,Biden,100
colorado,Trump,203
colorado,Trump,200
colorado,Biden,200
California,Trump,200
California,Biden,200
mylist = []
with open("data", "r") as msg:
for line in msg:
mylist.append(line.strip().replace(".",""))
msg.close()
headers = mylist[0].replace("*","").split()
del mylist[0]
headers[2] = headers[2]+" "+headers[3]
mydict = {}
for line in mylist:
state = line.split()[0]
mydict[state] = {}
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
mydict[state][candidate] = 0
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
votes = line.split()[2]
mydict[state][candidate] = mydict[state][candidate] + int(votes)
print ("%-15s %-15s %-15s \n\n" % (headers[0],headers[1],headers[2]))
for state in mydict.keys():
for candidate in mydict[state].keys():
print ("%-15s %-15s %-15s" % (state,candidate,str(mydict[state][candidate])))
Output:
state candidate total votes
Alaska Trump 300
Alaska Biden 202
colorado Trump 403
colorado Biden 200
California Trump 200
California Biden 200

Compare list items with a list of pairs, and output matching pairs

I have lists that are formatted like so:
order_ids = ['Order ID', '026-2529662-9119536', '026-4092572-3574764', '026-4267878-0816332', '026-5334006-4073138', '026-5750353-4848328', '026-5945233-4883500', '026-5966822-8160331', '026-8799392-8255522', '202-5076008-9615516', '202-5211901-8584318', '202-5788153-3773918', '202-6208325-9677946', '203-1024454-3409960', '203-1064201-9833131', '203-4104559-7038752', '203-5013053-9959554', '203-5768187-0573905', '203-8639245-4145958', '203-9473169-4807564', '204-1577436-4733125', '204-7025768-1965915', '204-9196762-0226720', '205-6427246-2264368', '205-9028779-8764322', '206-0703454-9777135', '206-0954144-1685131', '206-3381432-7615531', '206-3822931-6939555', '206-4658913-5563533', '206-5213573-9997926', '206-5882801-0583557', '206-7158700-9326744', '206-7668862-3913143', '206-8019246-1474732', '206-8541775-0545153']
one = [['Order ID', 'Amount'], ['026-2529662-9119536', '10.42'], ['026-4092572-3574764', '10.42'], ['026-4267878-0816332', '1.75'], ['026-5334006-4073138', '17.990000000000002'], ['026-5750353-4848328', '16.25'], ['026-5945233-4883500', '1.83'], ['026-5966822-8160331', '11.92'], ['026-8799392-8255522', '8.5'], ['202-5076008-9615516', '1.83'], ['202-5211901-8584318', '1.83'], ['202-5788153-3773918', '8.08'], ['202-6208325-9677946', '11.33'], ['203-1024454-3409960', '8.08'], ['203-1064201-9833131', '1.5'], ['203-4104559-7038752', '8.5'], ['203-5013053-9959554', '9.67'], ['203-5113131-7525963', '-8.5'], ['203-5768187-0573905', '3.66'], ['203-8639245-4145958', '5.08'], ['203-9473169-4807564', '3.66'], ['204-1577436-4733125', '1.83'], ['204-7025768-1965915', '1.83'], ['204-9196762-0226720', '11.33'], ['205-8348990-1889964', '-11.33'], ['205-9028779-8764322', '6.91'], ['206-0703454-9777135', '23.84'], ['206-0954144-1685131', '22.66'], ['206-3381432-7615531', '8.08'], ['206-3822931-6939555', '11.92'], ['206-4658913-5563533', '9.67'], ['206-5213573-9997926', '3.66'], ['206-5882801-0583557', '13.92'], ['206-7158700-9326744', '27.5'], ['206-7668862-3913143', '6.58'], ['206-8541775-0545153', '1.83']]
What I want to do is cycle through every item inside order_ids, and if the order_id is present in one - get the "value"
So far what I have tried is:
with open('test.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile)
i = 0
while i < len(order_ids):
for order in order_ids:
try:
if order == one[i][0]:
value_a = one[i][1]
print(order, value_a)
writer.writerow([order, value_a])
i += 1
else:
i += 1
pass
except IndexError:
i += 1
This is working somewhat - but there are 36 items inside "order_ids" and 36 lists inside "one", however only 18 rows are being wrote to my outfile.
An example of one order_id that isn't being wrote is "206-7668862-3913143", even though this clearly has a value of "6.58" inside "one"
What is stopping the rest of my rows being written?
You can do this simply with a dictionary. The dict() constructor will accept a nested list of pairs and create a dictionary mapping order_id to amount. Then we can just loop over the order_ids list, and write out any order_id that appears to test.csv.
Code:
import csv
d = dict(one)
with open('test.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile)
for order_id in order_ids:
if order_id in d:
writer.writerow([order_id, d[order_id]])
test.csv:
Order ID,Amount
026-2529662-9119536,10.42
026-4092572-3574764,10.42
026-4267878-0816332,1.75
026-5334006-4073138,17.990000000000002
026-5750353-4848328,16.25
026-5945233-4883500,1.83
026-5966822-8160331,11.92
026-8799392-8255522,8.5
202-5076008-9615516,1.83
202-5211901-8584318,1.83
202-5788153-3773918,8.08
202-6208325-9677946,11.33
203-1024454-3409960,8.08
203-1064201-9833131,1.5
203-4104559-7038752,8.5
203-5013053-9959554,9.67
203-5768187-0573905,3.66
203-8639245-4145958,5.08
203-9473169-4807564,3.66
204-1577436-4733125,1.83
204-7025768-1965915,1.83
204-9196762-0226720,11.33
205-9028779-8764322,6.91
206-0703454-9777135,23.84
206-0954144-1685131,22.66
206-3381432-7615531,8.08
206-3822931-6939555,11.92
206-4658913-5563533,9.67
206-5213573-9997926,3.66
206-5882801-0583557,13.92
206-7158700-9326744,27.5
206-7668862-3913143,6.58
206-8541775-0545153,1.83

Looking over a csv file for various conditions

00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
01,0,8749
01,1,8749
01,2,8749
01,3,88609
01,4,88609
01,5,88609
01,6,88611
01,7,88611
01,8,88611
02,0,9006
02,1,9006
02,2,4355
02,3,9013
02,4,9013
02,5,9013
02,6,4341
02,7,4341
02,8,4341
02,9,4341
03,0,6285
03,1,6285
03,2,6285
03,3,6285
03,4,6278
03,5,6278
03,6,6278
03,7,6278
03,8,8960
I have a csv file and a bit of it is shown above.
What I want to do is if the column 0 has the same value, it makes a an array of column 2, prints the array. ie- for 00, it makes an array-
a = [6098,6098,6098,6098,6094,6094]
for 01, it makes an array-
a = [8749,8749,88609,88609,88609,88611,88611,88611]
I don't know how to loop over this file.
This solution assumes that the first column will appear in sorted order in the file.
def main():
import csv
from itertools import groupby
with open("csv.csv") as file:
reader = csv.reader(file)
rows = [[row[0]] + [int(item) for item in row[1:]] for row in reader]
groups = {}
for key, group in groupby(rows, lambda row: row[0]):
groups[key] = [row[2] for row in group]
print(groups["00"])
print(groups["01"])
print(groups["02"])
print(groups["03"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
[6098, 6098, 6098, 6098, 6094, 6094]
[8749, 8749, 8749, 88609, 88609, 88609, 88611, 88611, 88611]
[9006, 9006, 4355, 9013, 9013, 9013, 4341, 4341, 4341, 4341]
[6285, 6285, 6285, 6285, 6278, 6278, 6278, 6278, 8960]
The idea is to use a dictionary in which 00, 01 etc will be the keys and value will be a list. So you need to iterate through the csv data and push these data to corresponding keys.
import csv
result = {}
with open("you csv file", "r") as csvfile:
data = csv.reader(csvfile)
for row in data:
if result.has_key(row[0]):
result[row[0]].append(row[2])
else:
result[row[0]] = [row[2]]
print (result)
Here
from collections import defaultdict
txt = '''00,0,6098
00,1,6098
00,2,6098
00,3,6098
00,4,6094
00,5,6094
01,0,8749
01,1,8749
01,2,8749
01,3,88609
01,4,88609
01,5,88609
01,6,88611
01,7,88611
01,8,88611
02,0,9006
02,1,9006
02,2,4355
02,3,9013
02,4,9013
02,5,9013
02,6,4341
02,7,4341
02,8,4341
02,9,4341
03,0,6285
03,1,6285
03,2,6285
03,3,6285
03,4,6278
03,5,6278
03,6,6278
03,7,6278
03,8,8960'''
data_holder = defaultdict(list)
lines = txt.split('\n')
for line in lines:
fields = line.split(',')
data_holder[fields[0]].append(fields[2])
for k,v in data_holder.items():
print('{} -> {}'.format(k,v))
output
02 -> ['9006', '9006', '4355', '9013', '9013', '9013', '4341', '4341', '4341', '4341']
03 -> ['6285', '6285', '6285', '6285', '6278', '6278', '6278', '6278', '8960']
00 -> ['6098', '6098', '6098', '6098', '6094', '6094']
01 -> ['8749', '8749', '8749', '88609', '88609', '88609', '88611', '88611', '88611']

Connecting data in python to spreadsheets

I have a dictionary in Python 2.7.9. I want to present the data in my dictionary in a spreadsheet. How can I accomplish this? Note, the dictionary has over 15 different items inside.
Dictionary:
{'Leda Doggslife': '$13.99', 'Carson Busses': '$29.95', 'Derri Anne Connecticut': '$19.25', 'Bobbi Soks': '$5.68', 'Ben D. Rules': '$7.50', 'Patty Cakes': '$15.26', 'Ira Pent': '$16.27', 'Moe Tell': '$10.09', 'Ido Hoe': '$14.47', 'Ave Sectomy': '$50.85', 'Phil Meup': '$15.98', 'Al Fresco': '$8.49', 'Moe Dess': '$19.25', 'Sheila Takya': '$15.00', 'Earl E. Byrd': '$8.37', 'Rose Tattoo': '$114.07', 'Gary Shattire': '$14.26', 'Len Lease': '$11.11', 'Howie Kisses': '$15.86', 'Dan Druff': '$31.57'}
Are you trying to write your dictionary in a Excel Spreadsheet?
In this case, you could use win32com library:
import win32com.client
xlApp = win32com.client.DispatchEx('Excel.Application')
xlApp.Visible = 0
xlBook = xlApp.Workbooks.Open(my_filename)
sht = xlBook.Worksheets(my_sheet)
row = 1
for element in dict.keys():
sht.Cells(row, 1).Value = element
sht.Cells(row, 2).Value = dict[element]
row += 1
xlBook.Save()
xlBook.Close()
Note that this code will work just if the workbook already exists.
Otherwise:
import win32com.client
xlApp = win32com.client.DispatchEx('Excel.Application')
xlApp.Visible = 0
xlBook = xlApp.Workbooks.Add()
sht = xlBook.Worksheets(my_sheet)
row = 1
for element in dict.keys():
sht.Cells(row, 1).Value = element
sht.Cells(row, 2).Value = dict[element]
row += 1
xlBook.SaveAs(mw_filename)
xlBook.Close()
I hope it will be the right answer to your question.

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories