I’m trying to write a python script that sends a query to TweetSentiments.com API.
The idea is that it will perform like this –
Reads CSV tweet file > construct query > Interrogates API > format JSON response > writes to CSV file.
So far I’ve come up with this –
import csv
import urllib
import os
count = 0
TweetList=[] ## Creates empty list to store tweets.
TweetWriter = csv.writer(open('test.csv', 'w'), dialect='excel', delimiter=' ',quotechar='|')
TweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for rows in TweetReader:
TweetList.append(rows)
#print TweetList [0]
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
connect = httplib.HTTPConnection("http://data.tweetsentiments.com:8080/api/analyze.json?q=")
connect.result = json.load(urllib.request("POST", "", data))
TweetWriter.write(result)
But when its run I get “line 20, data = urllib.urlencode(TweetList[rows]) Type Error: list indices must be integers, not list”
I know my list “TweetList” is storing the tweets just as I’d like but I don’t think I’m using urllib.urlencode correct. The API requires that queries are sent like –
http://data.tweetsentiments.com:8080/api/analyze.json?q= (text to analyze)
So the idea was that urllib.urlencode would simply add the tweets to the end of the address to allow a query.
The last four lines of code have become a mess after looking at so many examples. Your help would be much appreciated.
I'm not 100% sure what it is you're trying to do since I don't know what's the format of the files you are reading, but this part looks suspicious:
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
since TweetList is a list, the for loop puts in the rows one single value from the list in each iteration, and so this for example:
list = [1, 2, 3, 4]
for num in list:
print num
will print 1 2 3 4. But if this:
list = [1, 2, 3, 4]
for num in list:
print list[num]
Will end up with this error: IndexError: list index out of range.
Can you please elaborate a bit more about the format of the files you are reading?
Edit
If I understand you correctly, you need something like this:
tweets = []
tweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for row in tweetReader:
tweets.append({ 'tweet': row[0], 'date': row[1] })
for row in tweets:
data = urllib.urlencode(row)
.....
Related
When I run my code, only the first item in the array "listWP" checks whether it is in "dataGA." Tried a bunch of things, same problem. New to coding apologize for ignorance
import csv
f_GA = open('MS_GA.csv', 'rt')
f_WP = open('MS_WP.csv', 'rt')
dataGA = csv.reader(f_GA, delimiter=',')
dataWP = csv.reader(f_WP, delimiter=',')
listWP = []
for row in dataWP:
for i in row:
b = i[29:]
listWP.append(b)
for url in listWP:
for row in dataGA:
for i in row:
if url in i:
print (i + " ||is top site")
Current output is the first item of the array listWP checked through dataGA, i would obviously like it to be all items
csv.reader() returns a generator. The first time you iterate through it you process all the elements. All the subsequent times get nothing, since there are no more elements to iterate through. You should convert it to a list.
dataGA = list(csv.reader(f_GA, delimiter = ','))
However, it would probably be better if you redesigned your data structures. Instead of all those nested loops, convert the contents of dataGA to a set and then just use if url in url_set:.
Basically currently my program reads the Data file (electric info), sums the values up, and after summing the values, it changes all the negative numbers to 0, and keeps the positive numbers as they are. The program does this perfectly. This is the code I currently have:
import csv
from datetime import timedelta
from collections import defaultdict
def convert(item):
try:
return float(item)
except ValueError:
return 0
sums = defaultdict(list)
def daily():
lista = []
with open('Data.csv', 'r') as inp:
reader = csv.reader(inp, delimiter = ';')
headers = next(reader)
for line in reader:
mittaus = max(0,sum([convert(i) for i in line[1:-2]]))
lista.append()
#print(line[0],mittaus) ('#'only printing to check that it works ok)
daily()
My question is: How can I save the data to lists, so I can use them, and add all the values per day, so should look something like this:
1.1.2016;358006
2.1.2016;39
3.1.2016;0 ...
8.1.2016;239143
After had having these in a list (to save later on to a new data file), it should calculate the cumulative values straight after, and should look like this:
1.1.2016;358006
2.1.2016;358045
3.1.2016;358045...
8.1.2016;597188
Having done these, it should be ready to write these datas to a new csv file.
Small peak what's behind the Data file: https://pastebin.com/9HxwcixZ [It's actually divided with ';' , not with ' ' as in the pastebin]
The data file: https://files.fm/u/yuf4bbuk
I have clarified the questions, so you might have seen me ask before. These should be done without external libraries. I hope to find some help.
We had our customer details spread over 4 legacy systems and have subsequently migrated all the data into 1 new system.
Our customers previously had different account numbers in each system and I need to check which account number has been used in the new system, which supports API calls.
I have a text file containing all the possible account numbers, structured like this:
30000001, 30000002, 30000003, 30000004
30010000, 30110000, 30120000, 30130000
34000000, 33000000, 32000000, 31000000
Where each row represents all the old account numbers for each customer.
I'm not sure if this is the best approach but I want to open the text file and create a nested list:
[['30000001', '30000002', '30000003', '30000004'], ['30010000', '30110000', '30120000', '30130000'], ['34000000', '33000000', '32000000', '31000000']]
Then I want to iterate over each list but to save on API calls, as soon as I have verified the new account number in a particular list, I want to break out and move onto the next list.
import json
from urllib2 import urlopen
def read_file():
lines = [line.rstrip('\n') for line in open('customers.txt', 'r')]
return lines
def verify_account(*customers):
verified_accounts = []
for accounts in customers:
for account in accounts:
url = api_url + account
response = json.load(urlopen(url))
if response['account_status'] == 1:
verified_accounts.append(account)
break
return verified_accounts
The main issue is when I read from the file it returns the data like below so I can't iterate over the individual accounts.
['30000001, 30000002, 30000003, 30000004', '30010000, 30110000, 30120000, 30130000', '34000000, 33000000, 32000000, 31000000']
Also, is there a more Pythonic way of using list comprehensions or similar to iterate and check the account numbers. There seems to be too much nesting being used for Python?
The final item to mention is that there are over 255 customers to check, well there is almost 1000. Will I be able to pass more than 255 parameters into a function?
What about this? Just use str.split():
l = []
with open('customers.txt', 'r') as f:
for i in f:
l.append([s.strip() for s in i.split(',')])
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]
How about this?
with open('customers.txt','r') as f:
final_list=[i.split(",") for i in f.read().replace(" ","").splitlines()]
print final_list
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]
I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.
I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])