Single row of a DataFrame prints value side by side, i.e. column_name then columne_value in one line and next line contains next column_name and columne_value. For example, below code
import pandas as pd
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
# other operations goes here....
print row
Output for first row comes as
0 100
1 200
2 300
Name: 0, dtype: int64
Is there a way to have each row printed horizontally and ignore the datatype, Name? Example for the first row:
0 1 2
100 200 300
use the to_frame method then transpose with T
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
print(row.to_frame().T)
0 1 2
0 100 200 300
0 1 2
1 400 500 600
note:
This is similar to #JohnE's answer in that the method to_frame is syntactic sugar around pd.DataFrame.
In fact if we follow the code
def to_frame(self, name=None):
"""
Convert Series to DataFrame
Parameters
----------
name : object, default None
The passed name should substitute for the series name (if it has
one).
Returns
-------
data_frame : DataFrame
"""
if name is None:
df = self._constructor_expanddim(self)
else:
df = self._constructor_expanddim({name: self})
return df
Points to _constructor_expanddim
#property
def _constructor_expanddim(self):
from pandas.core.frame import DataFrame
return DataFrame
Which you can see simply returns the callable DataFrame
Use the transpose property:
df.T
0 1 2
0 100 200 300
It seems like there should be a simpler answer to this, but try turning it into another DataFrame with one row.
data = {x: y for x, y in zip(df.columns, df.iloc[0])}
sf = pd.DataFrame(data, index=[0])
print(sf.to_string())
Sorta combining the two previous answers, you could do:
for index, ser in df.iterrows():
print( pd.DataFrame(ser).T )
0 1 2
0 100 200 300
0 1 2
1 400 500 600
Basically what happens is that if you extract a row or column from a dataframe, you get a series which displays as a column. And doesn't matter if you do ser or ser.T, it "looks" like a column. I mean, series are one dimensional, not two, but you get the point...
So anyway, you can convert the series to a dataframe with one row. (I changed the name from "row" to "ser" to emphasize what is happening above.) The key is you have to convert to a dataframe first (which will be a column by default), then transpose it.
Related
I am trying to assign as ID to a pandas dataframe based on row count. For this I am trying to apply the below logic to pandas dataframe:
num = df.shape[0]
for i in range(num):
print(math.ceil(i/4))
So the idea is that for every 4 consecutive rows, an ID would be assigned. So the resultant dataframe would look like
col_1 Group_ID
v_1 1
v_2 1
v_3 1
v_4 1
v_5 2
v_6 2
v_7 2
v_8 2
v_9 3
v_10 3
--- And so on.
Just a quick thought. How can I use apply function on df.index.
Can I use the below code?
df['Index'] = df.index
df[GroupID] = df['Index].apply(np.ceil)
Any hints?
You can pass a function to apply, so create a named function and pass it
def everyFour(rowIdx):
return math.ceil(rowIdx / 4)
df['GroupId'] = df['Index'].apply(everyFour)
or just use a lambda
df['GroupId'] = df['Index'].apply(lambda rowIdx: math.ceil(rowIdx / 4))
Note that this will leave the first row with index 0 at 0, so you might want to add 1 to the rowIndex before dividing by 4.
I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).
I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
I put this dataframe as an example:
import pandas as pd
df = pd.DataFrame({'country':['china','canda','usa' ], 'value':[1000, 850, 1100], 'fact':[1000,200,850]})
df.index=df['country']
df = df.drop('country', axis=1)
I want to iterate over the GDP of each country and into this iteration I want to create a new column that would be full of 1 or 0 in function of a condition:
for x in df['value']:
if x > 900:
df['answer']=1
else:
df['answer']=0
I would expected a column with the following values:
[1,0,1]
Because Canada has a value lower than 900.
But instead I have a column full of ones.
What is wrong?
Use np.where
df["answer"] = np.where(df["value"]> 900, 1,0)
Or
df["answer"] = (df["value"]> 900).astype(int)
Output:
value fact answer
country
china 1000 1000 1
canda 850 200 0
usa 1100 850 1
whats wrong with your code
When you do df['answer']=1, the expression assign 1 to all the rows in answer column.
So last evaluated value is assigned to that column
It can be even done without iterating over each row using:
df['answer'] = df['value'].apply(lambda value: 1 if value > 900 else 0)
EDIT You are assigning df['answer'] to some value. The last value is 1 that is why it applies 1 to the entire answer column and not a particular row.
I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])
One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)
Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0