I want to output the format of JSON, which is like:
{"553588913747808256":"rumour","524949003834634240":"rumour","553221281181859841":"rumour","580322346508124160":"non-rumour","544307417677189121":"rumour"}
Here, I have a df_prediction_with_id dataFrame and I set_index using the id_str:
df_prediction_with_id
rumor_or_not
id_str
552800070199148544 non-rumour
544388259359387648 non-rumour
552805970536333314 non-rumour
525071376084791297 rumour
498355319979143168 non-rumour
What I've tried is to use DataFrame.to_json.
json = df_prediction_with_id.to_json(orient='index')
What I've got is:
{"552813420136128513":{"rumor_or_not":"non-rumour"},"544340943965409281":{"rumor_or_not":"non-rumour"}}
Is there any way that I could directly use the value in the column as the value? Thanks.
You can simply select the column and call .to_json():
print(df_prediction_with_id["rumor_or_not"].to_json())
Prints:
{"552800070199148544":"non-rumour","544388259359387648":"non-rumour","552805970536333314":"non-rumour","525071376084791297":"rumour","498355319979143168":"non-rumour"}
Related
I am trying to get the result from date_add function in pyspark, when I use the function it always returns as column type. To see the actual result we have to add the result to a column to a dataframe but I want the result to be stored in variable. How can I store the resulted date?
df = spark.createDataFrame([('2015-04-08',)], ['dt'])
r = date_add(df.dt, 1)
print(r)
output:- Column<'date_add(dt, 1)'>
But I want output like below
output:- date.time(2015,04,09)
or
'2015-04-09'
date_add has to be used within a withColumn. In case you want the desired output, consider a non-spark approach using datetime and timedelta.
Alternately, if your use case requires spark, use the collect method like so
r=df.withColumn(‘new_col’, date_add(col(‘dt’), 1)).select(‘new_col’).collect()
I need to convert a JSON dictionary to a Pandas DataFrame, but the embedding is tripping me up.
Here is basically what the JSON dict looks like.
{
"report": "{'name':{'data':[{'key 1':'value 1','key 2':'value 2'},{'key 1':'value 1','key 2':'value 2'}]}}"
}
In the DataFrame, I want the keys to be the column headers and values in the rows below them.
The extra layer of embedding is throwing me off somewhat from all the usual methods of doing this.
One tricky part is 'name' will change each time I get this JSON dict, so I can't use an exact sting value for 'name'.
Your JSON looks a bit odd. It looks more like a Python dict converted to a string, so you can use ast.literal_eval (a built-in function) to convert it to a real dict, and then use pd.json_normalize to get it into a dataframe form:
import ast
j = ...
parsed_json = ast.literal_eval(j['report'])
df = pd.json_normalize(parsed_json, record_path=[list(parsed_json)[0], 'data'])
Output:
>>> df
key 1 key 2
0 value 1 value 2
1 value 1 value 2
The error suggests that you're trying to index the strings (because the value under report is a string) using another string.
You just need ast.literal_eval to parse the string and a DataFrame constructor. If the "name" is unknown, you can iterate over the dict.values after you parse the string.
import ast
out = pd.DataFrame(y for x in ast.literal_eval(your_data['report']).values() for y in x['data'])
Output:
key 1 key 2
0 value 1 value 2
1 value 1 value 2
this is my code where I need to access a certain tuple from the DataFrame df. Can you please help me with this matter as I can't find any answer regarding this issue.
import pandas as pd
import openpyxl
df_sheet_index = pd.read_excel("path/to/excel/file.xlsx")
df = df_sheet_index.itertuples()
for tuple in df:
print(tuple)
This is the output
Pandas(Index=0, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=1, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=2, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
...
EDIT: As a general rule, you should use pandas builtin functions to search inside and not iterate on it. It's more efficient and more readable.
But if you really want to access the tuples:
target_index= 10
for tu in df.itertuples():
if tu[0] == target_index:
print(tu)
in a more general view, this is a regular tuple so you can access each element by its position. index will be tuple[0] then you first column tuple[1], the second tuple[2] etc.
NOTE: do not use tuple as a variable name, this is a reserved name in Python for the tuple type and it may create issue (on top of not being a good practice)
if u are trying to get an element at a specific place. U can use .iloc()
It takes two pars the row & column
df.iloc[-1]["column"]
This will get the last row element at that column
For df.loc["row","column"]
df.loc[[]] returns a df while df.loc[] returns a series
I have a dataframe as below,
df = pd.DataFrame({'URL_domains':[['wa.me','t.co','goo.gl','fb.com'],['tinyurl.com','bit.ly'],['test.in']]})
Here the column URL_Domains has got 2 observations with a list of Domains.
I would like to know the length of each observations URL domain list as:
df['len_of_url_list'] = df['URL_domains'].map(len)
and output as:
That is fine and no issues with above case and
In my case these list observations are treated string type as below:
When i execute the below code with string type URL domain it has shown the below output:
How to make a datatype conversion from string to list here in pandas ?
Use ast.literal_eval, because eval is bad practice:
import ast
df['len_of_url_list'] = df['URL_domains'].map(ast.literal_eval)
df["URL_domains"] = df["URL_domains"].apply(eval)
I have a column {'duration': 0, 'is_incoming': False}
I want to fetch 0 and Falseout of this. How do I split it using Python (Pandas)?
I tried - data["additional_data"] = data["additional_data"].apply(lambda x :",".join(x.split(":")[:-1]))
I want two columns Duration and Incoming_Time
How do I do this?
You can try converting those string to actual dict:
from ast import literal_eval
Finally:
out=pd.DataFrame(df['additional_data'].astype(str).map(literal_eval).tolist())
Now if you print out you will get your expected output
If needed use join() method:
df=df.join(out)
Now if you print df you will get your expected result
If your column additional_data contains real dict / json, you can directly use the string accessor .str[] to get the dict values by keys, as follows:
data['Duration'] = data['additional_data].str['duration']
data['Incoming_Time'] = = data['additional_data].str['is_incoming']
If your column additional_data contains strings of dict (enclosing dict with a pair of single quotes or double quotes), you need to convert the string to dict first, by:
from ast import literal_eval
data['Duration'] = data['additional_data].map(literal_eval).str['duration']
data['Incoming_Time'] = data['additional_data].map(literal_eval).str['is_incoming']