I have a dataframe df, containing only one column 'Info', which I want to split into multiple dataframes based on a list of indices, ls = [23,76,90,460,790]. If I want to use np.array_split(), how do I pass the list so that it parses the data from these indices with each index being the first row of split dataframes.
I don't think you can use np.array_split() here (you can access the underlying .values of the primary DF but you'd get back numpy arrays - not DFs...) - what you can do is use .iloc and "slice" from your DF, eg:
from itertools import zip_longest
dfs = [df.iloc[s: e] for s, e in zip_longest(ls[::2], ls[1::2])]
Related
I have the following column multiindex dataframe.
I would like to select (or get a subset) of the dataframe with different columns of each level_0 index (i.e. x_mm and y_mm from virtual and z_mm rx_deg ry_deg rz_deg from actual). From what I have read I think I might be able to use pandas IndexSlice but not entire sure how to use it in this context.
So far my work around is to use pd.concat selecting the 2 sets of columns independently. I have the feeling that this can be done neatly with slicing.
You can programmatically generate the tuples to slice your MultiIndex:
from itertools import product
cols = ((('virtual',), ('x_mm', 'y_mm')),
(('actual',), ('z_mm', 'rx_deg', 'ry_deg', 'rz_deg'))
)
out = df[[t for x in cols for t in product(*x)]]
I want to create a dictionary from a dataframe in python.
In this dataframe, frame one column contains all the keys and another column contains multiple values of that key.
DATAKEY DATAKEYVALUE
name mayank,deepak,naveen,rajni
empid 1,2,3,4
city delhi,mumbai,pune,noida
I tried this code to first convert it into simple data frame but all the values are not separating row-wise:
columnnames=finaldata['DATAKEY']
collist=list(columnnames)
dfObj = pd.DataFrame(columns=collist)
collen=len(finaldata['DATAKEY'])
for i in range(collen):
colname=collist[i]
keyvalue=finaldata.DATAKEYVALUE[i]
valuelist2=keyvalue.split(",")
dfObj = dfObj.append({colname: valuelist2}, ignore_index=True)
You should modify you title question, it is misleading because pandas dataframes are "kind of" dictionaries in themselves, that is why the first comment you got was relating to the .to_dict() pandas' built-in method.
What you want to do is actually iterate over your pandas dataframe row-wise and for each row generate a dictionary key from the first column, and a dictionary list from the second column.
For that you will have to use:
an empty dictionary: dict()
the method for iterating over dataframe rows: dataframe.iterrows()
a method to split a single string of values separated by a separator as the split() method you suggested: str.split().
With all these tools all you have to do is:
output = dict()
for index, row in finaldata.iterrows():
output[row['DATAKEY']] = row['DATAKEYVALUE'].split(',')
Note that this generates a dictionary whose values are lists of strings. And it will not work if the contents of the 'DATAKEYVALUE' column are not singles strings.
Also note that this may not be the most efficient solution if you have a very large dataframe.
I have a big dataframe consisting of 144005 rows. One of the columns of the dataframe is a string of dictionaries like
'{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"},
{"Step ID":"78495","Choice Number":"0","Campaign Run ID":"23199"}'
I want to convert this string to seperate dictionaries. I have been using json.loads() for this purpose, however, I have had to iterate over this string of dictionary one at a time, convert it to a dictionary using json.loads(), then convert this to a new dataframe and keep appending to this dataframe while I iterate over the entire original dataframe.
I wanted to know whether there was a more efficient way to do this as it takes a long time to iterate over an entire dataframe of 144005 rows.
Here is a snippet of what I have been doing:
d1 = df1['attributes'].values
d2 = df1['ID'].values
for i,j in zip(d1,d2):
data = json.loads(i)
temp = pd.DataFrame(data, index = [j])
temp['ID'] = j
df2 = df2.append(temp, sort=False)
My 'attributes' column consist of a string of dictionary as a row, and the 'Id' column contains it's corresponding Id
Did it myself.
I used map along with lambda functions to efficiently apply json.loads() on each row, then I converted this data to a dataframe and stored the output.
Here it is.
l1 = df1['attributes'].values
data = map(lambda x: json.loads(x), l1)
df2 = pd.DataFrame(data)
Just check the type of your column by using type()
If the type is Series:
data['your column name'].apply(pd.Series)
then you will see all keys as separate column in a dataframe with their key values.
I have a dataframe named :- df_explode
inputdataframe
I have a list of strings "myList_groupby".
myList_groupby = ["domain","tag_name","tag_hierarchy","html_attributes","extension","xyz"]
I want to generate a new column named "combined" in df_explode,whose elements will be a string concatenation for all the columns whose header matches with strings in myList_groupby list.(is in myList_groupby)
Output dataframe :-
outputdataframe
I think need filter columns by subset and call join per rows:
df['new'] = df[myList_groupby].apply('_'.join, axis=1)
Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.