Count number of majority value cross rows in Dataframe in Python - python

I have a DataFrame like:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
and my expected output is the majority value of each row, like:
0 5
1 2
2 6
I'm new with Pandas. Thank you for any help.

with pandas version 0.13.0, you can use df.mode(axis = 1)
(check your version with pd.__version__)
df.mode(axis=1)
0
0 5
1 2
2 6
[3 rows x 1 columns]

The concept you are looking for is a mode, which is the most commonly occurring number in a set. Scipy and Pandas both have ways to handle modes, through scipy.stats.mode and pandas.DataFrame.mode(works along an axis). So for this example you could say:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
for i in np.arange(len(df)):
results = np.zeros(len(df))
results[i] = scipy.stats.mode(df[i])
This should return a numpy array with the modes of each array. To do this same thing with Pandas you can do:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
df.mode(axis = 1)
The documentation is here: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.mode.html

Related

Filtering pandas dataframe using output from idxmax function

I have a pandas.DataFrame with more than one numerical columns and would like to find a maximum across row so I did below,
df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)
# O/P is a pd.Series like below
0 0.741459
1 0.995978
2 0.978618
3 0.973057
4 0.838006
...
Next we want to find the index of maximum value. So I did below,
filter_ = df.idxmax(axis=1)
# O/P
0 3
1 8
2 7
3 5
4 1
..
Now using the filter_ on DataFrame I want to achieve the result same as max_series variable and without using the pd.DataFrame.max(axis=1)
So I tried below,
df.loc[:, filter_]
or
df.filter(items=filter_, axis=1)
but both give me
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
I don't need a 100000x100000 matrix, I just need my max_series which is 100000x1
So how do I filter the DataFrame using the filter_ and get the pd.Series of maximum across rows?
This could be a faster solution:
%%time
df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)
filter_ = df.idxmax(axis=1)
unique_cols = filter_.unique()
max_series_ = pd.concat([df.loc[df.index.isin(filter_[filter_ == col].index), col] for col in unique_cols]).sort_index()
from pandas.testing import assert_series_equal
assert_series_equal(max_series_, max_series)
Maybe it can be even optimized further.
This could be one of the solutions,
filter_ = df.idxmax(axis=1)
df.apply(lambda row: row[filter_.loc[row.name]], axis=1)

Converting NumPy Date in Pandas DataFrame without using Pd.DataFrame

I have a question in an assessment where i have to convert NumPy data into Pandas dataframe type. It also should use the data's dtype names as column headers.
I cannot use the pd.DataFrame() function for this task and there has been a clue given where I am supposed to still use pandas methods.
This is the code i have so far -
def convert_to_df(data):
"converting numpy array into dataframe"
far = data.tolist()
return pd.Series(far).to_frame()
which does convert it to a DataFrame ,giving this with the test:
0
0 (2020-02-29 13:32:59, 1.23375E+18, 0.67, 0.293...
1 (2020-02-27 00:20:58, 1.23282E+18, 0.442, 0.38...
2 (2020-02-10 18:54:50, 1.22694E+18, 0.577, 0.42...
3 (2020-02-29 05:23:06, 1.23362E+18, 0.514, 0.41...
4 (2020-02-26 03:20:55, 1.23251E+18, 0.426, 0.37...
im just confused on how to then get the headers in order. My output when i run test code is meant to look like this.
created_at ... emotion_category
0 2020-02-29 13:32:59 ... joy
1 2020-02-27 00:20:58 ... fear
2 2020-02-10 18:54:50 ... joy
3 2020-02-29 05:23:06 ... no specific emotion
4 2020-02-26 03:20:55 ... fear
[5 rows x 9 columns]
I have attached a screenshot of the question so you can see the test codes and the wording.
Hope someone can help !
the data im using looks like this
You can use pandas series instead, what I would do is convert each column in the numpy array to a Series, for example, I have the following numpy array:
data = np.array([[1, 2, 3], [4, 5, 6]])
I will use a for loop to create series for each column:
series = []
for i in range(data.shape[1]):
series.append(pd.Series(data[:,i], name="Serie_" + str(i)))
Finally, concatenate these series to one dataframe:
pd.concat([series[i] for i in range(data.shape[1])], axis=1)
The result:
Serie_0 Serie_1 Serie_2
0 1 2 3
1 4 5 6
I hope this helps.
Try this one:
def convert_to_df(data):
'''convert numpy to pandas'''
headers = ['created_at', 'tweet_ID', 'valence_intensity', 'anger_intensity', 'fear_intensity',
'sadness_intensity', 'joy_intensity', 'sentiment_category', 'emotion_category']
df = data[:][headers[0]].tolist()
df = pd.Series(df).to_frame()
df.columns = ['created_at']
for i in range(1, len(headers)):
newcol = data[:][headers[i]].tolist()
newcol = pd.Series(newcol).to_frame()
df[headers[i]] = newcol
return df

How to generate array column with values from other columns using Dask Dataframe

I am trying to convert some Pandas code to Dask.
I have a dataframe that looks like the following:
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2
0 1 1
1 1 0
2 1 1
3 1 1
4 1 1
In Pandas, I can use create a Lists column which includes the List if the row value is 1 like so:
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
So the Lists column looks like:
Lists
0 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
1 [ListView_Lead_MyUnreadLeads]
2 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
3 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
4 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
In Dask, the dot function doesn't seem to work the same way. How can I get the same behavior / output?
Any help would be appreciated. Thanks!
Related question in Pandas: How to return headers of columns that match a criteria for every row in a pandas dataframe?
Here's some alternative ways to do it in Pandas. You can try whether it works equally well in Dask.
cols = df.columns.values
df['Lists'] = [list(cols[x]) for x in df.eq(1).values]
or try:
df['Lists'] = df.eq(1).apply(lambda x: list(x.index[x]), axis=1)
The first solution using list comprehension provides better performance if your dataset is large.
Result:
print(df)
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2 Lists
0 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
1 1 0 [ListView_Lead_MyUnreadLeads]
2 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
3 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
4 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
Here's a Dask version with map_partitions:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'ListView_Lead_MyUnreadLeads': [1,1,1,1,1], 'ListView_Lead_ViewCustom2': [1,0,1,1,1] })
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
df = df.copy()
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
return df
ddf.map_partitions(myfunc).compute()

Idiomatic way to create pandas dataframe as concatenation of function of another's rows

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

Python - pandas - Append Series into Blank DataFrame

Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.

Categories