Merging requested tables in a loop - python

I am trying to either merge or concatenate tables that I am generating through a loop in python.
Here's what I have:
for i in [some_list]:
# replacing with the ith term to request that particular value
url = "https://some_url/%s" % str(i)
# accessing a table correspounding to my request
request = pd.read_html(url)[0]
#request1 is a table with the same columns as request
request1 = request1.merge(request,how = 'outer')
request1
Essentially I want to add on to my original request1 table which has the same columns as request table, However, I am getting an error:
" You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat"

You may want to use concat
dflist=[]
for i in some_list:
# replacing with the ith term to request that particular value
url = "https://some_url/%s" % str(i)
# accessing a table correspounding to my request
request = pd.read_html(url)[0]
dflist.append(pd.DataFrame(request))#Adding dataframe constructor here
request1=pd.concat(dflist)

Related

How to use date filter correctly on aws dynamodb boto3

I want to retrieve items in a table in dynamodb. then i will add this data to below the last data of the table in big query.
client = boto3.client('dynamodb')
table = dynamodb.Table('table')
response = table.scan(FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query))
#first part
data = response['Items']
#second part
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
df=pd.DataFrame(data)
df=df[['query','created_at','result_count','id','isfuzy']]
# load df to big query
.....
the date filter working true but in while loop session (second part), the code retrieve all items.
after first part, i have 100 rows. but after this code
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
i have 500.000 rows. i can use only first part. but i know there is a 1 mb limit, thats why i am using second part. how can i get data in given date range
Your 1st scan API call has a FilterExpression set, which applies your data filter:
response = table.scan(FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query))
However, the 2nd scan API call doesn't have one set and thus is not filtering your data:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
Apply the FilterExpression to both calls:
while response.get('LastEvaluatedKey'):
response = table.scan(
ExclusiveStartKey=response['LastEvaluatedKey'],
FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query)
)
data.extend(response['Items'])

Why does a Where clause cause more data to be returned?

I have a script that uses shareplum to get items from a very large and growing SharePoint (SP) list. Because of the size, I encountered the dreaded 5000 item limit set in SP. To get around that, I tried to page the data based on the 'ID' with a Where clause on the query.
# this is wrapped in a while.
# the idx is updated to the latest max if the results aren't empty.
df = pd.DataFrame(columns=cols)
idx = 0
query = {'Where': [('Gt', 'ID', str(idx))], 'OrderBy': ['ID']}
data = sp_list.GetListItems(view, query=query, row_limit=4750)
df = df.append(pd.DataFrame(data[0:]))
That seemed to work but, after I added the Where, it started returning rows not visible on the SP web list. For example, the minimum ID on the web is, say, 500 while shareplum returns rows starting at 1. It also seems to be pulling in rows that are filtered out on the web. For example, it includes column values not included on the web. If the Where is removed, it brings back the exact list viewed on the web.
What is it that I'm getting wrong here? I'm brand new to shareplum; I looked at the docs but they don't go into much detail and all the examples are rather trivial.
Why does a Where clause cause more data to be returned?
After further investigation, it seems shareplum will ignore any filters applied to the list to create the view when a query is provided to GetListItems. This is easily verified by removing the query param.
As a workaround, I'm now paging 'All Items' with a row_limit and query as below. This at least lets me get all the data and do any further filtering/grouping in python.
df = pd.DataFrame(columns=cols)
idx = 0
more = True
while more:
query = {'Where': [('Gt', 'ID', str(idx))]}
# Page 'All Items' based on 'ID' > idx
data = sp_list.GetListItems('All Items', query=query, row_limit=4500)
data_df = pd.DataFrame(data[0:])
if not data_df.empty:
df = df.append(data_df)
ids = pd.to_numeric(data_df['ID'])
idx = ids.max()
else:
more = False
As to why shareplum behaves this way is still an open question.

Loop in web Table

I am new using python and I am trying to get some values from a table in a webpage, I need to get the values in yellow from the web page:
I have this code, it is getting all the values in the "Instruments" column but I don't know how to get the specific values:
body = soup.find_all("tr")
for Rows in body:
RowValue = Rows.find_all('th')
if len(RowValue) > 0:
CellValue = RowValue[0]
ThisWeekValues.append(CellValue.text)
any suggestion?
ids = driver.find_elements_by_xpath('//*[#id]')
if 'Your element id` in ids:
Do something
One of the ways could be this, since only id is different.

Getting the column name of a specific data with sqlalchemy

I am using sqlalchemy 0.8 and I want to get the column name of the input only, not all the column in the table.
here is the code:
rec = raw_input("Enter keyword to search: ")
res = session.query(test.__table__).filter(test.fname == rec).first()
data = ','.join(map(str, res)) +","
print data
#saw this here # SO but not the one I wanted. It displays all of the columns
columns = [m.key for m in data.columns]
print columns
You can just query for the columns you want. Like if you had some model MyModel
You can do:
session.query(MyModel.wanted_column1, ...) ... # rest of the query
This would only select all the columns mentioned there.
You can use the select syntax.
Or if you still want the model object to be returned and certain columns not loaded, you can use deferred column loading.

Fetch and render fetched values from MySQL tables as XML

I am fetching results out of a query from a table:
def getdata()
self.cursor.execute("....")
fetchall = self.cursor.fetchall()
result ={}
for row in fetchall:
detail1 = row['mysite']
details2 = row['url']
result[detail1] = row
return result
Now I need to process the result set as generated :
def genXML()
data = getdata()
doc = Document() ""create XML tree structure"""
Such that data would hold all the rows as fetched from query and I can extract each column values from it? Somehow I am not getting the desired out. My requirement is to fetch result set via a DB query and store result into a placeholder such that I can easily access it later in other method or locations?
================================================================================
I tried the below technique but still in method 'getXML()' I am unable to get each dict row so that I can traverse and manipulate:
fetchall = self.cursor.fetchall()
results= []
result={}
for row in fetchall:
result['mysite'] = row['mysite']
result['mystart'] = row['mystart']
..................................
results.append(result)
return results
def getXML(self):
doc = Document()
charts = doc.createElement("charts")
doc.appendChild(charts)
chartData = self.grabChartData()
for site in chartData:
print site[??]
So how do I get each chartData row values and then I can loop for each?
Note: I found that only last row fetched values are getting printed as in chartData. Say I know that 2 rows are getting returned by the query. Hence in case I print the list in getXML() method like below both rows are same:
chartData[0]
chartData[1]
How can I uniquely add each result to the list?
Here you are modifying and adding the same dict to results over and over again:
result={}
for row in fetchall:
result['mysite'] = row['mysite']
result['mystart'] = row['mystart']
..................................
results.append(result)
Create the dictionary inside the loop to solve this:
for row in fetchall:
result={}

Categories