Im getting this JSON extract using scrapy, but the desc has the amunt and the amount type on in, this could be g, gr, kg, L, etc. I wan't to know if its possible to extract this data and add it into an additional field.
How could this be achievable either within scrapy or a separate process once he file has been created.
P.S. I'm totally new to JSON and scrapy and I'm learning.
Current
{
'p_desc': ['Coffee 225 g '],
'p_price': ['8.00']
}
Desired
{
'p_desc': ['Coffee'],
'p_amount': [225]
'p_amount_type': ['g']
'p_price': ['8.00']
}
Something like this works if the data has a regular structure (i.e. every desc contains amount and amount type as the last two fields). If not you might have to use regular expressions.
One observation: if each value is unique you don't need a list and for instance you can just use 'Coffee' instead of ['Coffee']
jsonData = {
'p_desc': ['Grain Black Coffee 225 g'],
'p_price': ['8.00']
}
var p_desc, p_amount, p_amount_type;
[p_amount_type, p_amount,...p_desc] = jsonData['p_desc'][0].split(" ").reverse();
jsonData["p_amount"] = [p_amount];
jsonData["p_amount_type"] = [p_amount_type];
jsonData["p_desc"] = p_desc.join(' ');
console.log(jsonData);
Also, you might need to remove trailing white-space from the description.
Related
I have a sample code using google calendar API that requires data to be a string value. Instead of manually typing strings each time the data frame is changed I use df.iat. This is what I have so far and works fine for a single value.
Create an event
"""
event = {
'summary':f'{df.iat[0,0]}',
'description': f'{df.iat[0,1]}',
'start': {
'date': '2022-04-04',
},
'end': {
'date': '2022-04-04',
which prints out
summary = Order
description = Plastic Cups
date 04-04-2022
But I need to pull multiple values into a string. How do I perform this to work properly
for example in the description
I want to do f'{df.iat[1,1]}',f'{df.iat[0,1]}'
which would print out description = 7000 Plastic Cups
but I get errors using this and I've tried dfiloc, I've also tried just a sample
(("test"),(f'{df.iat[0,1]}')
this only prints off the 'test' portion and not the df.iat string
I've been stuck at this for hours any help would be appreciated.
f-strings in Python, also known as Literal String Interpolation, can handle multiple variables at the same time. For example:
orderID = 012345
orderName = "foo"
message = f"Your order {orderName} with orderID {orderID} was registered!"
If you print the aforementioned message variable:
Your order foo with orderID 012345 was registered!
For more info regarding this: PEP-0498
I'm trying to parse some data as follows:
subject_data
{"72744387":{"retired":null,"Filename":"2021-07-18 23-16-26 frontlow.jpg"}}
{"72744485":{"retired":null,"Filename":"2021-07-21 07-39-57 frontlow.jpg"}}
{"72744339":{"retired":null,"Filename":"2021-07-17 04-55-03 frontlow.jpg"}}
I'd like to get the file name from all of this data, but I'd like to do so without using that first number, as these numbers are randomized and there are a lot. So far I have:
classifications['subject_data_json'] = [json.loads(q) for q in classifications.subject_data]
data = classifications['subject_data_json']
print(data[3])
This prints {'72744471': {'retired': None, 'Filename': '2021-07-21 04-11-45 frontlow.jpg'}}
But I'd like to print just the Filename for each of the data sets. print(data[3]['Filename']) fails, and I'm not sure how to get the information without using the number.
I'd go with a nested expression
print([v['Filename'] for i in data for k, v in i.items()])
My goal is to sort millions of logs by timestamp that I receive out of Elasticsearch.
Example logs:
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:00:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:01:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:02:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:04:09.000Z"}
Unfortunately, I am not able to get all the logs sorted out of Elastic. It seems like I have to do it by myself.
Approaches I have tried to get the data sorted out of elastic:
es = Search(index="somelogs-*").using(client).params(preserve_order=True)
for hit in es.scan():
print(hit['#timestamp'])
Another approach:
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.scan()
)
So I am looking for a way to sort these logs by myself or directly through Elasticsearch. Currently, I am saving all the data in a local 'logs.json' and it seems to me I have to iter over and sort it by myself.
You should definitely let Elasticsearch do the sorting, then return the data to you already sorted.
The problem is that you are using .scan(). It uses Elasticsearch's scan/scroll API, which unfortunately only applies the sorting params on each page/slice, not the entire search result. This is noted in the elasticsearch-dsl docs on Pagination:
Pagination
...
If you want to access all the documents matched by your query you can
use the scan method which uses the scan/scroll elasticsearch API:
for hit in s.scan():
print(hit.title)
Note that in this case the results won’t be sorted.
(emphasis mine)
Using pagination is definitely an option especially when you have a "millions of logs" as you said. There is a search_after pagination API:
Search after
You can use the search_after parameter to retrieve the next page of
hits using a set of sort values from the previous page.
...
To get the first page of results, submit a search request with a sort
argument.
...
The search response includes an array of sort values for
each hit.
...
To get the next page of results, rerun the previous search using the last hit’s sort values as the search_after argument. ... The search’s query and sort arguments must remain unchanged. If provided, the from argument must be 0 (default) or -1.
...
You can repeat this process to get additional pages of results.
(omitted the raw JSON requests since I'll show a sample in Python below)
Here's a sample how to do it with elasticsearch-dsl for Python. Note that I'm limiting the fields and the number of results to make it easier to test. The important parts here are the sort and the extra(search_after=).
search = Search(using=client, index='some-index')
# The main query
search = search.extra(size=100)
search = search.query('range', **{'#timestamp': {'gte': '2020-12-29T09:00', 'lt': '2020-12-29T09:59'}})
search = search.source(fields=('#timestamp', ))
search = search.sort({
'#timestamp': {
'order': 'desc'
},
})
# Store all the results (it would be better to be wrap all this in a generator to be performant)
hits = []
# Get the 1st page
results = search.execute()
hits.extend(results.hits)
total = results.hits.total
print(f'Expecting {total}')
# Get the next pages
# Real use-case condition should be "until total" or "until no more results.hits"
while len(hits) < 1000:
print(f'Now have {len(hits)}')
last_hit_sort_id = hits[-1].meta.sort[0]
search = search.extra(search_after=[last_hit_sort_id])
results = search.execute()
hits.extend(results.hits)
with open('results.txt', 'w') as out:
for hit in hits:
out.write(f'{hit["#timestamp"]}\n')
That would lead to an already sorted data:
# 1st 10 lines
2020-12-29T09:58:57.749Z
2020-12-29T09:58:55.736Z
2020-12-29T09:58:53.627Z
2020-12-29T09:58:52.738Z
2020-12-29T09:58:47.221Z
2020-12-29T09:58:45.676Z
2020-12-29T09:58:44.523Z
2020-12-29T09:58:43.541Z
2020-12-29T09:58:40.116Z
2020-12-29T09:58:38.206Z
...
# 250-260
2020-12-29T09:50:31.117Z
2020-12-29T09:50:27.754Z
2020-12-29T09:50:25.738Z
2020-12-29T09:50:23.601Z
2020-12-29T09:50:17.736Z
2020-12-29T09:50:15.753Z
2020-12-29T09:50:14.491Z
2020-12-29T09:50:13.555Z
2020-12-29T09:50:07.721Z
2020-12-29T09:50:05.744Z
2020-12-29T09:50:03.630Z
...
# 675-685
2020-12-29T09:43:30.609Z
2020-12-29T09:43:30.608Z
2020-12-29T09:43:30.602Z
2020-12-29T09:43:30.570Z
2020-12-29T09:43:30.568Z
2020-12-29T09:43:30.529Z
2020-12-29T09:43:30.475Z
2020-12-29T09:43:30.474Z
2020-12-29T09:43:30.468Z
2020-12-29T09:43:30.418Z
2020-12-29T09:43:30.417Z
...
# 840-850
2020-12-29T09:43:27.953Z
2020-12-29T09:43:27.929Z
2020-12-29T09:43:27.927Z
2020-12-29T09:43:27.920Z
2020-12-29T09:43:27.897Z
2020-12-29T09:43:27.895Z
2020-12-29T09:43:27.886Z
2020-12-29T09:43:27.861Z
2020-12-29T09:43:27.860Z
2020-12-29T09:43:27.853Z
2020-12-29T09:43:27.828Z
...
# Last 3
2020-12-29T09:43:25.878Z
2020-12-29T09:43:25.876Z
2020-12-29T09:43:25.869Z
There are some considerations on using search_after as discussed in the API docs:
Use a Point In Time or PIT parameter
If a refresh occurs between these requests, the order of your results may change, causing inconsistent results across pages. To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.
You need to first make a POST request to get a PIT ID
Then add an extra 'pit': {'id':xxxx, 'keep_alive':5m} parameter to every request
Make sure to use the PIT ID from the last response
Use a tiebreaker
We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.
This would depend on your Document schema
# Add some ID as a tiebreaker to the `sort` call
search = search.sort(
{'#timestamp': {
'order': 'desc'
}},
{'some.id': {
'order': 'desc'
}}
)
# Include both the sort ID and the some.ID in `search_after`
last_hit_sort_id, last_hit_route_id = hits[-1].meta.sort
search = search.extra(search_after=[last_hit_sort_id, last_hit_route_id])
Thank you Gino Mempin. It works!
But I also figured out, that a simple change does the same job.
by adding .params(preserve_order=True) elasticsearch will sort all the data.
es = Search(index="somelog-*").using(client)
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.params(preserve_order=True)
.scan()
)
I'm matching two collections residing in 2 different databases over a criteria and creates a new collection for records that matches this criterion.
Below is working with simple criteria, but I need a different criterion.
Definitions
function insertBatch(collection, documents) {
var bulkInsert = collection.initializeUnorderedBulkOp();
var insertedIds = [];
var id;
documents.forEach(function(doc) {
id = doc._id;
// Insert without raising an error for duplicates
bulkInsert.find({_id: id}).upsert().replaceOne(doc);
insertedIds.push(id);
});
bulkInsert.execute();
return insertedIds;
}
function moveDocuments(sourceCollection, targetCollection, filter, batchSize) {
print("Moving " + sourceCollection.find(filter).count() + " documents from " + sourceCollection + " to " + targetCollection);
var count;
while ((count = sourceCollection.find(filter).count()) > 0) {
print(count + " documents remaining");
sourceDocs = sourceCollection.find(filter).limit(batchSize);
idsOfCopiedDocs = insertBatch(targetCollection, sourceDocs);
targetDocs = targetCollection.find({_id: {$in: idsOfCopiedDocs}});
}
print("Done!")
}
Call
var db2 = new Mongo("<URI_1>").getDB("analy")
var db = new Mongo("<URI_2>").getDB("clone")
var readDocs= db2.coll1
var writeDocs= db.temp_coll
var Urls = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url" ,{})
var filter= {"Url": {$in: Urls }}
moveDocuments(readDocs, writeDocs, filter, 10932)
In a nutshell, my criterion is distinct "Url" string. Instead, I want Url + Date string to be my criterion. There are 2 problems:
In one collection, the date is in format ISODate("2016-03-14T13:42:00.000+0000") and in other collection the date format is "2018-10-22T14:34:40Z". So, How to make them uniform so that they match each other?
Assuming, we get a solution to 1., and we create a new array having concatenated strings UrlsAndDate instead of Urls. How would we create a similar concatenated field on the fly and match it in the other collection?
For example: (non-functional code!)
var UrlsAndDate = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url"+"formated_Date" ,{})
var filter= {"Url"+"formated_Date": {$in: Urls }}
readDocs.find(filter)
...and do the same stuff as above!
Any suggestions?
Got a brute force solution, but isn't feasible!
Problem:
I want to merge 2 collections mycoll & coll1. Both have a field name Url and Date. mycoll has 35000 docs and coll1 has 4.7M docs(16+gb)-can't load into m/m.
Algo, written using pymongo client :
iterate over mycoll
create a src string "url+common_date_format"
Try to find a match in coll1, since, coll1 is big I can't load it in m/m and treat as dictionary!. So, I'm iterating over each doc in this collection again and again.
iterate over coll1
create a destination string "url+common_date_format"
if src_string == dest_string
insert this doc in a new collection called temp_coll
This is a terrible algorithm since O(35000*4.7M), would take ages to complete!. If I could load 4.7M in m/m then the run time will reduce to O(35000), that's doable!
Any suggestions for another algorithm!
First thing I would do is create compound index with {url: 1, date: 1} on collections if they don't already exist. Say collection A has 35k docs and collection B has 4.7M docs. We can't load whole 4.7M docs data in-memory. You are iterating over cursor object of B in inner loop. I assume once that cursor object is exhausted you are querying the collection again.
Some observations to make here why are we iterating over 4.7M docs each time. Instead of fetching all 4.7M docs and then matching, we could just fetch docs that match url and date for each doc in A. Converting a_doc date to b_doc format and then querying would be better than making both to common format which forces us to do 4.7M docs iteration. Read the below pseudo code.
a_docs = a_collection.find()
c_docs = []
for doc in a_docs:
url = doc.url
date = doc.date
date = convert_to_b_collection_date_format(date)
query = {'url': url, 'date': date}
b_doc = b_collection.find(query)
c_docs.append(b_doc)
c_docs = covert_c_docs_to_required_format(c_docs)
c_collection.insert_many(c_docs)
Above we are looping over 35k docs and filter for each doc. Given that we have indexes created already lookup takes logarithmic time, which seems reasonable.
x=pd.DataFrame([[5.75,7.32],[1000000,-2]])
def money(val):
"""
Takes a value and returns properly formatted money
"""
if val < 0:
return "$({:>,.0f})".format(abs((val)))
else:
return "${:>,.0f}".format(abs(val))
x.style.format({0: lambda x: money(x),
1: lambda x: money(x)
})
I am trying to get currency to format in the pandas jupyter display with excel accounting formatting. Which would look like the below.
I was most successful with the above code, but i also tried a myriad of css and html things, but i am not well versed in the languages so they didn't work really at all.
Your output looks like you are using the HTML display in the Jupyter notebook, so you will need to set pre for the white-space style, because HTML collapses multiple whitespace, and use a monospace font, e.g.:
styles = {
'font-family': 'monospace',
'white-space': 'pre'
}
x_style = x.style.set_properties(**styles)
Now to format the float, a simple right justified with $ could look like:
x_style.format('${:>10,.0f}')
This isn't quite right because you want to convert the negative number to (2), and you can do this with nested formats, separating out the number formatting from justification so you can add () if negative, e.g.:
x_style.format(lambda f: '${:>10}'.format(('({:,.0f})' if f < 0 else '{:,.0f}').format(f)))
Note: this is fragile in the sense it assumes 10 is sufficient width, vs. excel which dynamically left justifies $ to the maximum width of all the values in that column.
An alternative way to do this would be to extend string.StringFormatter to implement the accounting format logic.