Retrieve the last n days from MongoDB file using Python - python

I have a code below which connects to a MongoDB database and selects the specified JSON file, flattens it and exports it as a CSV.
So my problem is some of the JSON files in the MongoDB databases are absolutely huge with thousands of rows, so what I am trying to do is filter table down so that I only bring in data from the last 7 days.
from pymongo import MongoClient
import pandas as pd
import os, uuid, sys
import collections
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings
from pandas import json_normalize
mongo_client = MongoClient("connstring")
db = mongo_client.nhdb
table = db.Report
document = table.find()
mongo_docs = list(document)
mongo_docs = json_normalize(mongo_docs)
mongo_docs.to_csv("Report.csv", sep = ",", index=False)
Any help will be much appreciated.
Note: I know a way to do it in Azure Data Factory using the expression below, however, I am not sure how to go about it in Python
{"createdDatetime":{$gt: ISODate("#{adddays(utcnow(),-7)}")}}

In python create a datetime object to use as a filter; for example this shows the last 7 days:
from datetime import datetime, timedelta
document = table.find({'createdDatetime': {'$gt': datetime.utcnow() - timedelta(days=7)}})

Related

Reading from mongodb and converting in dataframe

Hi I am trying to read data from MongoDB and trying to convert it into data frame but getting an error that says "("Could not convert ObjectId('620d3f43ae93743dbb6f6846') with type ObjectId: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column _id with type object')".
my code looks like this:
import streamlit as st
import pymongo
import streamlit as st
from pymongo import MongoClient
import pandas as pd
import json
import string
import io
import re
import time
import csv
from pandas import Timestamp
import certifi
import datetime
from pandas.io.json import json_normalize
client=pymongo.MongoClient()
connection = MongoClient("mongodb+srv://*****:****#cluster0.t4iwt.mongodb.net/testdb?retryWrites=true&w=majority",tlsCAFile=certifi.where())
db=connection["testdb"]
collection=db["test"]
cursor = collection.find()
entries=list(cursor)
entries[:]
df=pd.DataFrame(entries)
st.write(df)
Change type of _id to str.
df = pd.DataFrame(entries)
df = df.astype({"_id": str})
st.write(df)

How to load mongodb databases with special character in their names in pandas dataframe?

I'm trying to import the mongodb collection data in a pandas dataframe. When the database name is simple like 'admin', it's able to load in the dataframe. However when I try with one of my required databases named asdev-Admin (line 5), I get an empty dataframe. Apparently the error's somewhere related to the special character in the db name, but I don't know how to get around it. How do I resolve this??
import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.asdev-Admin
collection = db.system.groups
data = pd.DataFrame(list(collection.find()))
print(data)
The error states: NameError: name 'Admin' is not defined
You can change db = client.asdev-Admin to db = client['asdev-Admin'].

Extract Date from UUID

I have a list of UUIDs and I'm trying to extract datetime from the code below I found online but it shows error.
Is there any other way or any changes in this code to extract it?
import uuid
import time_uuid
my_uuid = uuid.UUID('{7a3febe0-3fcc-3a28-bdab-c29438888d12}')
ts = time_uuid.TimeUUID(bytes=my_uuid.bytes).get_timestamp()
Error: No module named time_uuid
Expected: Datetime

SQL with regular expressions in python

I am running this query in python. I do get the error "Invalid string literal" in the regular expression part. I know this regex is, not sure what syntax is missing here. Any help would be appreciated.
import pandas as pd
import numpy as np
import datetime
from time import gmtime, strftime
import smtplib
import sys
from IPython.core.display import HTML
PROJECT = 'server'
queryString = '''
SELECT
mn as mName,
dt as DateTime,
ip as LocalIPAddress,
REGEXP_EXTRACT(path, 'ActName:([\s\S\w\W]*?)ActDomain:') AS ActName,
FROM Agent.Logs
'''
You're missing an r
REGEXP_EXTRACT(path, r'ActName:([\s\S\w\W]*?)ActDomain:') AS ActName

Python - Library to Import Libraries

I am developing a project which has about a dozen different files. At the top of each file I have almost the identical lines which import the same libraries and initializes a connection to my DB:
import re
import urllib2
import datetime
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.sql import *
from sqlalchemy.orm.collections import *
from table_def import Team, Player, Box_Score, Game, Name_Mapper
from datetime import timedelta
from bs4 import BeautifulSoup as bs
from datetime import date, datetime, timedelta
import numpy as np
import argparse
engine = create_engine('sqlite:///ncaaf.db', echo=False)
md = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
s = Session()
teams_table = Table("teams", md, autoload=True)
games_table = Table("games", md, autoload=True)
box_scores_table = Table("box_scores", md, autoload=True)
players_table = Table("players", md, autoload=True)
names_table = Table("names", md, autoload=True)
Can I make a module to import all these modules and to initialize this DB connection? Is that standard? Or dumb for some reason I am not realizing?
When you import something into your module, it becomes available as if it was declared in your module itself. So, you can do what you want like this:
In common_imports.py:
from datetime import date, datetime, timedelta
import numpy as np
import argparse
...
In main_module.py:
from common_import import *
a = np.array([]) # works fine
However this is not recommended since Explicit is better than implicit. E.g. if you do this, someone else (or even you from the future) won't understand where all these imported modules come from. Instead, try to either organize your imports better, or decompose your module into several ones. For example, in your import list I see argparse, SQL stuff and numpy, and I can't imaging single module that may need all these unrelated libraries.
If you create a package, you can import them in the __init__.py file, although I would suggest leaving them where they are to increase code-readability.

Categories