I have a big dataframe consisting of 144005 rows. One of the columns of the dataframe is a string of dictionaries like
'{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"}'
I want to convert this string to seperate dictionaries. I have been using json.loads() for this purpose, however, I have had to iterate over this string of dictionary one at a time, convert it to a dictionary using json.loads(), then convert this to a new dataframe and keep appending to this dataframe while I iterate over the entire original dataframe.
I wanted to know whether there was a more efficient way to do this as it takes a long time to iterate over an entire dataframe of 144005 rows.
Here is a snippet of what I have been doing:
d1 = df1['attributes'].values
d2 = df1['ID'].values
for i,j in zip(d1,d2):
data = json.loads(i)
temp = pd.DataFrame(data, index = [j])
temp['ID'] = j
df2 = df2.append(temp, sort=False)
My 'attributes' column consist of a string of dictionary as a row, and the 'Id' column contains it's corresponding Id
Did it myself.
I used map along with lambda functions to efficiently apply json.loads() on each row, then I converted this data to a dataframe and stored the output.
Here it is.
l1 = df1['attributes'].values
data = map(lambda x: json.loads(x), l1)
df2 = pd.DataFrame(data)
Just check the type of your column by using type()
If the type is Series:
data['your column name'].apply(pd.Series)
then you will see all keys as separate column in a dataframe with their key values.
Related
I have a dataframe with only one row which contains columns such as subscription_id_avg, subscription_id_std etc.., around 84 columns. Each column contains a mean and a standard deviation. I want to convert this dataframe to a nested dictionary such as
{"subscription_id" : {"avg": 0.36, "std": 1.5}}
How do I split the column names and convert it into a nested dictionary form?
Here, _std and _avg are in every column name so column_name[:-4] will give the desired property name. For example, subscription_id_avg[:-4] will yeild subscription_id. Using this, we can get the keys for dictionary.
dict = {}
for cl in df.columns:
key_name = cl[:-4]
if dict.get(key_name) is None:
dict[key_name] = {cl[-3:]: df[cl][0]}
else:
dict[key_name][cl[-3:]] = df[cl][0]
Here dict will be the required python dictionary. Hope this will help.
This will work only if _avg and _std are the only suffixes in column names.
i'm looking for use value_counts in many columns and i know that i can use df.island.value_counts() but i want a loop or something more efficient to don't put the name of each column of the DataFrame. It's important to say that I know the specific columns with i want apply this function. The code that i'm using is:
data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
df = pd.read_csv(data_url)
df.select_dtypes(object).head() # I use this to know the columns that has variables categorical. I want this because i'm looking to know what categorical variables they has, and that is why i'm using value_count
df.island.value_counts()
categories = df.select_dtypes(object).head() # this is a dataframe object
categorical_columns = categories.columns # this is an iterable which contains strings of column headers
counter_dict = {}
# loop through each column, add values to dict
for column in categorical_columns:
counter_dict[column] = df[column].value_counts()
# each item in dict has column title as key and value_counts series as value
I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)
I want to create a dictionary from a dataframe in python.
In this dataframe, frame one column contains all the keys and another column contains multiple values of that key.
DATAKEY DATAKEYVALUE
name mayank,deepak,naveen,rajni
empid 1,2,3,4
city delhi,mumbai,pune,noida
I tried this code to first convert it into simple data frame but all the values are not separating row-wise:
columnnames=finaldata['DATAKEY']
collist=list(columnnames)
dfObj = pd.DataFrame(columns=collist)
collen=len(finaldata['DATAKEY'])
for i in range(collen):
colname=collist[i]
keyvalue=finaldata.DATAKEYVALUE[i]
valuelist2=keyvalue.split(",")
dfObj = dfObj.append({colname: valuelist2}, ignore_index=True)
You should modify you title question, it is misleading because pandas dataframes are "kind of" dictionaries in themselves, that is why the first comment you got was relating to the .to_dict() pandas' built-in method.
What you want to do is actually iterate over your pandas dataframe row-wise and for each row generate a dictionary key from the first column, and a dictionary list from the second column.
For that you will have to use:
an empty dictionary: dict()
the method for iterating over dataframe rows: dataframe.iterrows()
a method to split a single string of values separated by a separator as the split() method you suggested: str.split().
With all these tools all you have to do is:
output = dict()
for index, row in finaldata.iterrows():
output[row['DATAKEY']] = row['DATAKEYVALUE'].split(',')
Note that this generates a dictionary whose values are lists of strings. And it will not work if the contents of the 'DATAKEYVALUE' column are not singles strings.
Also note that this may not be the most efficient solution if you have a very large dataframe.
I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]