I'm trying a simple query expression:
df = db.persons.find_one({"name.first": "victor"})
That's work fine. When I try the same query using find method the return is empty.
df = db.persons.find({"name.first":"victor"})
My objective is a query expression with 2 arguments "name.first" : "victor" and "name.last":"perdensen". I also tried $andoperator.
df = db.persons.find({"$and": [{"name.first": "victor"}, {"name.last": "pedersen"}]})
Both queries using compass I had no problem.
While find_one() returns a dict if data is available, find() returns an iterator since there may be more than one json document.
So if you get a result of a query with find_one(), you will also get a result with find() but this time, you have to access it in a for loop.
df = db.persons.find({"name.first": "victor"})
for df1 in df:
print(df1) # here you get the result
I think you can also do this:
df = list(db.persons.find({"name.first": "victor"}))
Related
Using Pyspark to transform data a DataFrame. The old extract used this SQL line :
case when location_type = 'SUPPLIER' then SUBSTRING(location_id,1,length(location_id)-3)
I brought in the data and loaded it into a DF, then was trying to do the transform using:
df = df.withColumn("location_id", F.when(df.location_type == "SUPPLIER",
F.substring(df.location_id, 1, length(df.location_id) - 3))
.otherwise(df.location_id))`
The substring method takes a int as the third argument but the length() method is giving a Column object. I had no luck trying to cast it and haven't found a method that would accept the Column. Also tried using the expr() wrapper but again could not make it work.
the supplier IDs look like 12345-01. The transform needs to strip the -01.
As you mention it, you can use expr to be able to use substring with indices that come from other columns like this:
df = df.withColumn("location_id",
F.when(df.location_type == "SUPPLIER",
F.expr("substring(location_id, 1, length(location_id) - 3)")
).otherwise(df.location_id)
)
I am trying to get the result from date_add function in pyspark, when I use the function it always returns as column type. To see the actual result we have to add the result to a column to a dataframe but I want the result to be stored in variable. How can I store the resulted date?
df = spark.createDataFrame([('2015-04-08',)], ['dt'])
r = date_add(df.dt, 1)
print(r)
output:- Column<'date_add(dt, 1)'>
But I want output like below
output:- date.time(2015,04,09)
or
'2015-04-09'
date_add has to be used within a withColumn. In case you want the desired output, consider a non-spark approach using datetime and timedelta.
Alternately, if your use case requires spark, use the collect method like so
r=df.withColumn(‘new_col’, date_add(col(‘dt’), 1)).select(‘new_col’).collect()
I have a df with an "isbn13" column. I also have a function called "isbnlib.meta". This function is from the library isbnlib. I would like to run the function on each row of the "isbn13" column. I'm using the apply function to do that.
df['publisher'] = df['isbn13'].apply(isbnlib.meta)
The issue is that the results for each isbn13 is a dictionary with various points such as Title, Author, Publisher, etc. I'm only looking for the "Publisher" result in the dictionary to be written out in my dataframe.
How do I only return the "Publisher" result in the dataframe from the dictionary results of the function?
Thank you in advance.
I suppose your isbnlib.meta() returns a dictionary based on the value in your isbn13 column. If so, you can use a lambda function in the same apply:
df['publisher'] = df['isbn13'].apply(lambda x: isbnlib.meta(x).get('Publisher', None))
In this case, if your dict doesn't have a Publisher key, it will return the default value None.
I am unfamiliar with the isbnlib library but assuming that isbnlib.meta takes in a string and returns a dictionary, you can do:
df['publisher'] = df['isbn13'].apply(lambda x: isbnlib.meta(x)['Publisher'])
Using a lambda function inside .apply() can be very useful for simple tasks like this one.
I try to apply the following code (minimal example) to my 2 Million rows DataFrame, but for some reason .apply returns more than one row to the function and breaks my code. I'm not sure what changed, but the code did run before.
def function(row):
return [row[clm1], row[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Did anyone get an idea or a similar issue?
Important without swifter everything works fine, but too slow due to the amount of rows.
This should work ==>
def function(row_different_name):
return [row_different_name[clm1], row_different_name[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Try changing the name of function parameter rwo to some other name.
based on this previous answer what you are trying to do should work if you change it like this:
def function(row):
return [row.swifter[clm1], row.swifter[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.apply(function, axis=1, result_type='expand')
this because apply on a column lacks result_type as arg, while apply on a dataframe has it
axis=1 means column, so it will insert it vertically. Is that what you want? Try removing axis=1
I have a table called jobs with a column called job_num.
How do i return the maximum value of the integers in that column?
I have tried
result = Job.select(max(Job.job_num))
I have also tried a few different combinations such as
result = Job.select(Job.job_num).max()
I have also checked the peewee docs.
Can anyone help please.
You can use "fn.MAX" to apply the SQL MAX function. The "scalar()" method returns a single, scalar result value:
result = Job.select(fn.MAX(Job.job_num)).scalar()