Good day everyone! I had trouble putting a nested dictionary as separate columns. However, I fixed it using the concat and json.normalize function. But for some reason the code I used removed all the column names and returned NaN as values for the columns...
Does someone knows how to fix this?
Code I used:
import pandas as pd
c = ['photo.photo_replace', 'photo.photo_remove', 'photo.photo_add', 'photo.photo_effect', 'photo.photo_brightness',
'photo.background_color', 'photo.photo_resize', 'photo.photo_rotate', 'photo.photo_mirror', 'photo.photo_layer_rearrange',
'photo.photo_move', 'text.text_remove', 'text.text_add', 'text.text_edit', 'text.font_select', 'text.text_color', 'text.text_style',
'text.background_color', 'text.text_align', 'text.text_resize', 'text.text_rotate', 'text.text_move', 'text.text_layer_rearrange']
df_edit = pd.concat([json_normalize(x)[c] for x in df['editables']], ignore_index=True)
df.columns = df.columns.str.split('.').str[1]
Current problem:
Result I want:
df= pd.DataFrame({
'A':[1,2,3],
'B':[3,3,3]
})
print(df)
A B
0 1 3
1 2 3
2 3 3
c=['new_name1','new_name2']
df.columns=c
print(df)
new_name1 new_name2
0 1 3
1 2 3
2 3 3
remember , lenght of column names (c) should be equal to column amount
Related
Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.
Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.
Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)
I am trying to convert some Pandas code to Dask.
I have a dataframe that looks like the following:
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2
0 1 1
1 1 0
2 1 1
3 1 1
4 1 1
In Pandas, I can use create a Lists column which includes the List if the row value is 1 like so:
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
So the Lists column looks like:
Lists
0 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
1 [ListView_Lead_MyUnreadLeads]
2 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
3 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
4 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
In Dask, the dot function doesn't seem to work the same way. How can I get the same behavior / output?
Any help would be appreciated. Thanks!
Related question in Pandas: How to return headers of columns that match a criteria for every row in a pandas dataframe?
Here's some alternative ways to do it in Pandas. You can try whether it works equally well in Dask.
cols = df.columns.values
df['Lists'] = [list(cols[x]) for x in df.eq(1).values]
or try:
df['Lists'] = df.eq(1).apply(lambda x: list(x.index[x]), axis=1)
The first solution using list comprehension provides better performance if your dataset is large.
Result:
print(df)
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2 Lists
0 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
1 1 0 [ListView_Lead_MyUnreadLeads]
2 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
3 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
4 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
Here's a Dask version with map_partitions:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'ListView_Lead_MyUnreadLeads': [1,1,1,1,1], 'ListView_Lead_ViewCustom2': [1,0,1,1,1] })
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
df = df.copy()
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
return df
ddf.map_partitions(myfunc).compute()
I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.
I have pandas dataframe like below:
"Unnamed: 0"
0 {1:'Apple1', 2:'LemonA', 3:'StrawberryX'}
1 {1:'Apple2', 2:'LemonB', 3:'StrawberryW'}
2 {1:'Apple3', 2:'LemonC', 3:'StrawberryZ'}]
so myDf is DataFrame with 3*1 (3 rows and 1 column)
What is the best way to modify it like below:
1 2 3
0 'Apple1' 'LemonA' 'StrawberryX'
1 'Apple2' 'LemonB' 'StrawberryW'
2 'Apple3' 'LemonC' 'StrawberryZ'
After modification my new data shape is 3*3
Assuming you have a series you can do
pd.DataFrame(list(df['"Unnamed: 0"']))
1 2 3
0 Apple1 LemonA StrawberryX
1 Apple2 LemonB StrawberryW
2 Apple3 LemonC StrawberryZ
Thanks for everyone who helped,
I've just figure-out the answer to my own question
This is how I solve it,
I convert all data on all rows into dictionary using pandas.DataFrame().to_dict() and save it on variable x
Then I run the following code
newDf = pandas.DataFrame(x)
pandas is smart enough to read dictionary keys as column. :)
d = {1: ['Apple1', 'Apple2', 'Apple3'], 2: ['LemonA', 'LemonB', 'LemonC'], 3: ['StrawberryX', 'StrawberryY', 'StrawberryZ']}
df = pd.DataFrame(data=d)
df
1 2 3
0 Apple1 LemonA StrawberryX
1 Apple2 LemonB StrawberryY
2 Apple3 LemonC StrawberryZ
that would be the classic solution (you can skip putting the dictionery in the variable "d" and just write data={the data}, it just looks nicer)
I have successfully changed a single column name in the dataframe using this:
df.columns=['new_name' if x=='old_name' else x for x in df.columns]
However i have lots of columns to update (but not all 240 of them) and I don't want to have to write it out for each single change if i can help it.
I have tried to follow the advice from #StefanK in this thread:
Changing multiple column names but not all of them - Pandas Python
my code:
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
but i am getting an error message:
File "<ipython-input-17-2808488b712d>", line 3
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
^
SyntaxError: can't assign to literal
So having googled the error and read many more S.O. questions here it looks to me like it is trying to read the numbers as integers instead of an index? I'm not certain here though.
So how do i fix it so it looks at the numbers as the index?! The column names I am replacing are at least 10 words long each so I'm keen not to have to type them all out! my only ideas are to use iloc somehow but i'm going into new territory here!
really appreciate some help please
Remove the '=' after df.columns in your code and use this instead:
df.columns.values[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
Because index does not support mutable operations convert it to numpy array, reassign and set back:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
arr = df.columns.to_numpy()
arr[[0,2,3]] = list('RTG')
df.columns = arr
print (df)
R B T G E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
So with your data use:
idx = [4,18,181,182,187,188,189,190,203,204]
names = ['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
arr = df.columns.to_numpy()
arr[idx] = names
df.columns = arr