Pandas Wrap Display into Multiple Columns - python

All,
I have a pandas dataframe with ~30 rows and 1 column. When I display it in Jupyter, all 30 rows are displayed in one long list. I am looking for a way to wrap the rows into multiple displayed columns, such as below:
Example dataframe:
df = pd.DataFrame([
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y', 'z', 'aa', 'ab', 'ac', 'ad'],
columns=['value'])
Example output
value value value
0 a 10 k 20 u
1 b 11 l 21 v
2 c 12 m 22 w
3 d 13 n 23 x
4 e 14 o 24 y
5 f 15 p 25 z
6 g 16 q 26 aa
7 h 17 r 27 ab
8 i 18 s 28 ac
9 j 19 t 29 ad

You can use this helper function:
def reshape(df, rows=10):
length = len(df)
cols = np.ceil(length / rows).astype(int)
df = df.assign(rows=np.tile(np.arange(rows), cols)[:length],
cols=np.repeat(np.arange(cols), rows)[:length]) \
.pivot('rows', 'cols', df.columns.tolist()) \
.sort_index(level=1, axis=1).droplevel(level=1, axis=1).rename_axis(None)
return df
Output
>>> reshape(df)
value value value
0 a k u
1 b l v
2 c m w
3 d n x
4 e o y
5 f p z
6 g q aa
7 h r ab
8 i s ac
9 j t ad

Try
df[['col1', 'col2']] = df['col'].str.split(' ', 1, expand=True)

Related

How to specify header to a specific number of columns in csv and panda dataframe

I have a csv file with 50 comma seperated values. for example, a row:
3290,171,12,134,23,1824,228,245,147,2999,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
I want to specify headers to the 11 first columns in this csv file. I tried some ways but the data seems is corrupted.
What I did:
df = pd.read_csv("info/info.data", sep=',', header=None)
df_11 = df.iloc[:, 1:11]
df_11.columns = ['A', 'B', 'C', 'D', 'E', 'F' 'G', 'H', 'I', 'J', 'K']
What I'm doing wrong?
assumption: you want to rename the first 11 columns
# new names for the columns
cols=['A', 'B', 'C', 'D', 'E', 'F' ,'G', 'H', 'I', 'J', 'K']
# using list comprehension, take value from cols for the first 11 columns and remainder
# keep as is
new_cols=[cols[c] if c < len(cols) else c for c in range(len(df.columns)) ]
df.columns=new_cols
df
A B C D E F G H I J ... 45 46 47 48 49 50 51 52 53 54
0 3290 171 12 134 23 1824 228 245 147 2999 ... 0 0 0 0 0 0 0 0 0 1
if you only need the first 11 columns and rename them
cols=['A', 'B', 'C', 'D', 'E', 'F' ,'G', 'H', 'I', 'J', 'K']
# filter columns as many as there are in the cols list
df2=df.iloc[:, :len(cols)]
# Rename columns replacing with cols
df2.columns= cols
df2
A B C D E F G H I J K
0 3290 171 12 134 23 1824 228 245 147 2999 1

Plot the number of instances in first column with respect to second column in python?

I have this table in excel which I am trying to analyze. I am not able to plot the number of S and D (in column 3) according to months (in Col4). I am plotting the number of S and D in col3 using the following. But how to plot the number of S & D according to the months. How to do that?
I would like to get two line plots showing the number of S and D respectively with the corresponding months on the X-axis.
#to plot the number of S and D in col3
df = pd.read_csv (r'C:\Users\data.csv', usecols = ['Col1','Col2','Col3','Col4'])
df['Col4'] = pd.to_datetime(df['Col4'], format="%m/%d/%Y").dt.date
df.head()
df1 = df[['Col3']].copy()
my_dict = df1['Col3'].value_counts().to_dict()
myList = my_dict.items()
x, y = zip(*myList)
plt.bar(x, y, color = "tomato")
plt.ylabel('Count')
plt.title('Outcome')
plt.show()
Col1 Col2 Col3 Col4
0 Y MA S 2/2/2022
1 N YJ D 4/25/2022
2 N YJ D 3/11/2022
3 N YJ D 4/28/2022
4 Y YJ D 4/21/2022
5 N YJ D 4/21/2022
6 Y WE D 5/25/2022
7 Y WE S 5/7/2022
8 N WE D 3/30/2022
9 N PR D 3/22/2022
10 Y PR S 3/22/2022
The following should do what the OP wants given the data frame as posted:
df = pd.DataFrame({'Col1': ['Y', 'N', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'Y'],
'Col2': ['MA', 'YJ', 'YJ', 'YJ', 'YJ', 'YJ', 'WE', 'WE', 'WE', 'PR', 'PR'],
'Col3': ['S', 'D', 'D', 'D', 'D', 'D', 'D', 'S', 'D', 'D', 'S'],
'Col4': ['2/2/2022', '4/25/2022', '3/11/2022', '4/28/2022', '4/21/2022',
'4/21/2022', '5/25/2022', '5/7/2022', '3/30/2022', '3/22/2022', '3/22/2022']})
First extract the month from the dates in Col4 (and sort ascending by month):
df.loc[:, 'Month'] = pd.to_datetime(df.Col4).dt.month
df = df.sort_values('Month', ascending=True)
Col1 Col2 Col3 Col4 Month
0 Y MA S 2/2/2022 2
2 N YJ D 3/11/2022 3
8 N WE D 3/30/2022 3
9 N PR D 3/22/2022 3
10 Y PR S 3/22/2022 3
1 N YJ D 4/25/2022 4
3 N YJ D 4/28/2022 4
4 Y YJ D 4/21/2022 4
5 N YJ D 4/21/2022 4
6 Y WE D 5/25/2022 5
7 Y WE S 5/7/2022 5
Create a pivot table with Month as the index, the values of Col3 (i.e., S and D) as columns, and counts as the cell values:
df1 = df.groupby(['Month', 'Col3'])\
.size()\
.unstack(fill_value=0)\
.reset_index()
Col3 Month D S
0 2 0 1
1 3 3 1
2 4 4 0
3 5 1 1
Plot the results
plt.plot(df1.Month, df1.D, label='D')
plt.plot(df1.Month, df1.S, label='S')
plt.xlabel('Month')
plt.ylabel('Count')
plt.legend()
plt.show()

Data Transforming/formatting in Python

I've the following panda data:
df = {'ID_1': [1,1,1,2,2,3,4,4,4,4],
'ID_2': ['a', 'b', 'c', 'f', 'g', 'd', 'v', 'x', 'y', 'z']
}
df = pd.DataFrame(df)
display(df)
ID_1 ID_2
1 a
1 b
1 c
2 f
2 g
3 d
4 v
4 x
4 y
4 z
For each ID_1, I need to find the combination (order doesn't matter) of ID_2. For example,
When ID_1 = 1, the combinations are ab, ac, bc.
When ID_1 = 2, the combination is fg.
Note, if the frequency of ID_1<2, then there is no combination here (see ID_1=3, for example).
Finally, I need to store the combination results in df2 as follows:
One way using itertools.combinations:
from itertools import combinations
def comb_df(ser):
return pd.DataFrame(list(combinations(ser, 2)), columns=["from", "to"])
new_df = df.groupby("ID_1")["ID_2"].apply(comb_df).reset_index(drop=True)
Output:
from to
0 a b
1 a c
2 b c
3 f g
4 v x
5 v y
6 v z
7 x y
8 x z
9 y z

pandas: groupby sum conditional on other column

i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()
You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13

Improve performance when creating a new column using another column as a lookup table

I have a main dataframe with 4 columns representing 4 colors and 3 rows representing 3 types of materials. The values in this frame are either 1 or 0, where 1's indicate POSITIVE, and 0 NEGATIVE.
I have another very long dataframe with multiple columns, including a column for COLOR and another column for MATERIAL. For each row in this frame, the values will be different. The main table indicates which combination of COLOR and MATERIAL is considered POSITIVE. Now, I want to create a new column in this frame called 'FAVOR', such that for a combination of color and material indicated as POSITIVE (with value 1) in the main table,if the same combination occurs in this long dataframe, the value should be 1, else 0.
I did something along the lines of :
for i in pairs:
main_frame['FAVOR'].loc[(main_frame['Color']==i[0]) & (main_frame['Material']==i[1])]='1'
where pairs is a list I created using the main table, in which each item is a pair of MATERIAL and COLOR for which the value is 1.
The above lines of code ran for over 30 mins and I ran out of patience.
I understand that a row-wise operation like this is typically inefficient in Pandas. But is there any faster way to achieve what I am trying to do?
EDIT:
import pandas as pd
import numpy as np
main_frame = pd.DataFrame({'Color':['g', 'e', 'e', 'k', 's', 'f', 'o',
'r', 'g', 'e', 'e', 'k', 's'],'Material':['p', 'r', 'o', 'g', 'r', 'a', 'm',
'm', 'i', 'n', 'g','k','n']})
lookup_table = pd.DataFrame(np.random.choice([1, 0], 56).reshape(7,8),index=['g', 'e', 'k', 's', 'f', 'o', 'r'],columns=['p', 'r', 'o', 'g', 'a', 'm','i', 'n'])
# n = np.random.choice([1, 0], 9).reshape(3,3)
print main_frame
print lookup_table
rows=[]
for i in lookup_table.index:
rows.append(i)
cols=[]
for j in lookup_table.columns:
cols.append(j)
pairs=[]
for i in rows:
for j in cols:
if lookup_table.loc[i,j]==1:
pairs.append([i,j])
for i in pairs:
main_frame['FAVOR'].loc[(main_frame['Color']==i[0]) & (main_frame['Material']==i[1])]='1'
This works for this sample code very quickly, but for my dataset with 1,000,000 records, this code takes significantly large amount of time.
You can use merge after using stack and reset_index on lookup_table. First create df_stack:
df_stack = (lookup_table.stack().reset_index()
.rename(columns={'level_0':'Color','level_1':'Material',0:'FAVOR'}))
print (df_stack.head(15))
Color Material FAVOR
0 g p 0
1 g r 0
2 g o 1
3 g g 1
4 g a 1
5 g m 0
6 g i 0
7 g n 1
8 e p 0
9 e r 0
10 e o 0
11 e g 0
12 e a 0
13 e m 0
14 e i 0
you can see on a row, you have 0 or 1 associated to a couple (row, column) of your lookup_table, respectively in column I named Color and Material for the merge:
main_frame = main_frame.merge(df_stack, how='left').fillna(0)
the result in main_frame is with my random 0 and 1:
Color Material FAVOR
0 g p 0.0
1 e r 0.0
2 e o 0.0
3 k g 0.0
4 s r 1.0
5 f a 0.0
6 o m 1.0
7 r m 1.0
8 g i 0.0
9 e n 0.0
10 e g 0.0
11 k k 0.0
12 s n 0.0
it should be faster than your method on a large df

Categories