I am trying to get the result from date_add function in pyspark, when I use the function it always returns as column type. To see the actual result we have to add the result to a column to a dataframe but I want the result to be stored in variable. How can I store the resulted date?
df = spark.createDataFrame([('2015-04-08',)], ['dt'])
r = date_add(df.dt, 1)
print(r)
output:- Column<'date_add(dt, 1)'>
But I want output like below
output:- date.time(2015,04,09)
or
'2015-04-09'
date_add has to be used within a withColumn. In case you want the desired output, consider a non-spark approach using datetime and timedelta.
Alternately, if your use case requires spark, use the collect method like so
r=df.withColumn(‘new_col’, date_add(col(‘dt’), 1)).select(‘new_col’).collect()
Related
Using Pyspark to transform data a DataFrame. The old extract used this SQL line :
case when location_type = 'SUPPLIER' then SUBSTRING(location_id,1,length(location_id)-3)
I brought in the data and loaded it into a DF, then was trying to do the transform using:
df = df.withColumn("location_id", F.when(df.location_type == "SUPPLIER",
F.substring(df.location_id, 1, length(df.location_id) - 3))
.otherwise(df.location_id))`
The substring method takes a int as the third argument but the length() method is giving a Column object. I had no luck trying to cast it and haven't found a method that would accept the Column. Also tried using the expr() wrapper but again could not make it work.
the supplier IDs look like 12345-01. The transform needs to strip the -01.
As you mention it, you can use expr to be able to use substring with indices that come from other columns like this:
df = df.withColumn("location_id",
F.when(df.location_type == "SUPPLIER",
F.expr("substring(location_id, 1, length(location_id) - 3)")
).otherwise(df.location_id)
)
I want to output the format of JSON, which is like:
{"553588913747808256":"rumour","524949003834634240":"rumour","553221281181859841":"rumour","580322346508124160":"non-rumour","544307417677189121":"rumour"}
Here, I have a df_prediction_with_id dataFrame and I set_index using the id_str:
df_prediction_with_id
rumor_or_not
id_str
552800070199148544 non-rumour
544388259359387648 non-rumour
552805970536333314 non-rumour
525071376084791297 rumour
498355319979143168 non-rumour
What I've tried is to use DataFrame.to_json.
json = df_prediction_with_id.to_json(orient='index')
What I've got is:
{"552813420136128513":{"rumor_or_not":"non-rumour"},"544340943965409281":{"rumor_or_not":"non-rumour"}}
Is there any way that I could directly use the value in the column as the value? Thanks.
You can simply select the column and call .to_json():
print(df_prediction_with_id["rumor_or_not"].to_json())
Prints:
{"552800070199148544":"non-rumour","544388259359387648":"non-rumour","552805970536333314":"non-rumour","525071376084791297":"rumour","498355319979143168":"non-rumour"}
I am new to python and I hope that someone can help me get a handle on map.
I have a function myfunc which takes a column in a dataframe and for every column creates a calculation that results in a JSON which I then convert to a data-frame. Below is pseudo-code for what I'm doing.
For example
def myfunc (factor):
# This is the API we are posting to
str_url = "www.foourl.com"
# This is the factor we post to try and get the result
request_string = [{"foo":factor}]
header = {"content-type": "application/json","AUTH-TOKEN": "Foo"}
# We post it using our authorization and Token
response = requests.post(str_url , data=json.dumps(request_string), headers=header)
# convert response to json format and then to the dataframe
results_json = response.json()
return(pd.json_normalize(results_json))
I then execute my function using the below code which works perfectly. I can access each result using result[1] to get the dataframe results for factor[1], results[2] for factor[2] and so on. It returns a <class 'pandas.core.series.Series'>
# Import in the excel sheet and get the factors
df = pd.read_excel ('ref_data/search_factors.xlsx')
test = df['factor_number']
# Run the API for every factor
# Collapse the list then into a dataframe
result = test.map(myfunc)
My question is
Since all the results are dataframes and are exactly the same structure wise (5 columns all with the same name), Is there a way I can just collapse everything into a single data-frame after all the iterations from map
I know for example in R you can use bind_rows in dplyr or something like map_df to do the same thing. Is their an equivalent in python?
Yes in pandas we have concat
df=pd.concat(result.tolist())
I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()
I'm trying a simple query expression:
df = db.persons.find_one({"name.first": "victor"})
That's work fine. When I try the same query using find method the return is empty.
df = db.persons.find({"name.first":"victor"})
My objective is a query expression with 2 arguments "name.first" : "victor" and "name.last":"perdensen". I also tried $andoperator.
df = db.persons.find({"$and": [{"name.first": "victor"}, {"name.last": "pedersen"}]})
Both queries using compass I had no problem.
While find_one() returns a dict if data is available, find() returns an iterator since there may be more than one json document.
So if you get a result of a query with find_one(), you will also get a result with find() but this time, you have to access it in a for loop.
df = db.persons.find({"name.first": "victor"})
for df1 in df:
print(df1) # here you get the result
I think you can also do this:
df = list(db.persons.find({"name.first": "victor"}))