Multiply two df in GPU (cudf) - python

I have two dataframe in GPU. I want to multiply each element of each df.
Here is a simple version of my dataframes:
import cudf
a = cudf.DataFrame()
a['c1'] = [1, 2]
b = cudf.DataFrame()
b['c1'] = [2, 5]
I want to see this output:
c1
0 2
1 10
I am using a.multiply(b), however, I get error;AttributeError: DataFrame object has no attribute multiply
Can you please help me with that? Thanks.

Related

pandas column of list type

Need to generate a dataframe, column 'b' is of list type.
temp = pd.DataFrame(columns=['a','b')]
temp['a'] = 1
temp['b'] = [2,3]
expected result will be
a b
1 [2,3]
but now the result is
a b
NaN 2
NaN 3
How to get expected result?
You should add it row by row like this by loc[index]
import pandas as pd
temp = pd.DataFrame(columns=['a','b'])
temp.loc[0] = 2, [2,3]
Result:
You need to give the values inside a bracket if the dataframe is empty.
import pandas as pd
temp = pd.DataFrame(columns=['a','b'])
temp['a'] = [1]
temp['b'] = [[2,3]]
print (temp)
temp['a'] = 5
temp['b'] = [[8,9]]
print (temp)
If the dataframe already had values, you can give a normal assignment.
First assignment to an empty dataframe will be:
a b
0 1 [2, 3]
Any further updates to the dataframe after the initial assignment will result in:
a b
0 5 [8, 9]
Alternatively as Tan suggested, you can use .loc or .iloc to store the values.
However, if you do initial assignment like this, then a will result in NaN as shown in example below (thx Trenton)

Obtain mode from column in groupby [duplicate]

This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 4 years ago.
I'm trying to obtain the mode of a column in a groupby object, but I'm getting this error: incompatible index of inserted column with frame index.
This is the line I'm getting this on, and I'm not sure how to fix it. Any help would be appreciated.
dfBitSeq['KMeans'] = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: x.mode())
Pandas mode returns a data frame unlike mean and median which return a scalar. So you just need to select the slice using x.mode().iloc[0]
dfBitSeq['KMeans'] = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: x.mode().iloc[0])
You can use scipy.stats.mode. Example below.
from scipy.stats import mode
df = pd.DataFrame([[1, 5], [2, 3], [3, 5], [2, 4], [2, 3], [1, 4], [1, 5]],
columns=['OnBitSeq', 'KMeans'])
# OnBitSeq KMeans
# 0 1 5
# 1 2 3
# 2 3 5
# 3 2 4
# 4 2 3
# 5 1 4
# 6 1 5
modes = df.groupby('OnBitSeq')['KMeans'].apply(lambda x: mode(x)[0][0]).reset_index()
# OnBitSeq KMeans
# 0 1 5
# 1 2 3
# 2 3 5
If you need to add this back to the original dataframe:
df['Mode'] = df['OnBitSeq'].map(modes.set_index('OnBitSeq')['KMeans'])
You could look at Attach a calculated column to an existing dataframe.
This error looks similar and the answer is pretty useful.

Keep column and row order when storing pandas dataframe in json

When storing data in a json object with to_json, and reading it back with read_json, rows and columns are returned sorted alphabetically. Is there a way to keep the results ordered or reorder them upon retrieval?
You could use orient='split', which stores the index and column information in lists, which preserve order:
In [34]: df
Out[34]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
In [35]: df.to_json(orient='split')
Out[35]: '{"columns":["A","C","B"],"index":[5,4,3],"data":[[0,1,2],[3,4,5],[6,7,8]]}'
In [36]: pd.read_json(df.to_json(orient='split'), orient='split')
Out[36]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
Just remember to use orient='split' on reading as well, or you'll get
In [37]: pd.read_json(df.to_json(orient='split'))
Out[37]:
columns data index
0 A [0, 1, 2] 5
1 C [3, 4, 5] 4
2 B [6, 7, 8] 3
If you want to make a format with "orient='records'" and keep orders of the column, try to make a function like this. I don't think it is a wise approach, and do not recommend because it does not guarantee its order.
def df_to_json(df):
res_arr = []
ldf = df.copy()
ldf=ldf.fillna('')
lcolumns = [ldf.index.name] + list(ldf.columns)
for key, value in ldf.iterrows():
lvalues = [key] + list(value)
res_arr.append(dict(zip(lcolumns, lvalues)))
return json.dumps(res_arr)
In addition, for reading without sorted column please ref this [link] (Python json.loads changes the order of the object)
Good Luck
lets say you have a pandas dataframe, that you read
import pandas as pd
df = pd.read_json ('/abc.json')
df.head()
that give following
now there are two ways to save to json using pandas to_json
result.sample(200).to_json('abc_sample.json',orient='split')
that will give the order like this one column
however, to preserve the order like in csv, use this one
result.sample(200).to_json('abc_sample_2nd.json',orient='records')
this will give result as

Apply condition on pandas columns to create a boolen indexing array

I want to drop specific rows from a pandas dataframe. Usually you can do that using something like
df[df['some_column'] != 1234]
What df['some_column'] != 1234 does is creating an indexing array that is indexing the new df, thus letting only rows with value True to be present.
But in some cases, like mine, I don't see how I can express the condition in such a way, and iterating over pandas rows is way too slow to be considered a viable option.
To be more specific, I want to drop all rows where the value of a column is also a key in a dictionary, in a similar manner with the example above.
In a perfect world I would consider something like
df[df['some_column'] not in my_dict.keys()]
Which is obviously not working. Any suggestions?
What you're looking for is isin()
import pandas as pd
df = pd.DataFrame([[1, 2], [1, 3], [4, 6],[5,7],[8,9]], columns=['A', 'B'])
In[9]: df
Out[9]: df
A B
0 1 2
1 1 3
2 4 6
3 5 7
4 8 9
mydict = {1:'A',8:'B'}
df[df['A'].isin(mydict.keys())]
Out[11]:
A B
0 1 2
1 1 3
4 8 9
You could use query for this purpose:
df.query('some_column != list(my_dict.keys()')
You can use the function isin() to select rows whose column value is in an iterable.
Using lists:
my_list = ['my', 'own', 'data']
df.loc[df['column'].isin (my_list)]
Using dicts:
my_dict = {'key1':'Some value'}
df.loc[df['column'].isin (my_dict.keys())]

How do I filter a pandas DataFrame based on value counts?

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).
Help?
Use groupby filter:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
A B
0 1 2
1 1 4
I recommend reading the split-combine-section of the docs.
Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:
df1 = df[df.groupby("A")['A'].transform('size') > 1]
Or use Series.map with Series.value_counts:
df1 = df[df['A'].map(df['A'].value_counts()) > 1]
#jezael solution works very well, Here is a different approach to filter based on values count :
For example, if the dataset is :
df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})
Convert and save the count as a dictionary
ount_freq = dict(df['a'].value_counts())
Create a new column and copy the target column, map the dictionary with newly created column
df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)
Now we have a new column with count freq, you can now define a threshold and filter easily with this column.
df[df.count_freq>1]
Additionlly, in case one wants to filter and have 'count' column:
attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)
#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A
I might be a little late to this party but:
df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)
This is how you create a new dataframe and then just filter it...
df[df['A']>100]

Categories